From Linear Models to Deep Neural Networks

From Linear Models to Deep Neural Networks: Choosing the Right AI for Formulation Innovation

Understanding AI, ML, and Deep Learning for Complex Formulation Design

Deep neural networks for formulation design

The terms "artificial intelligence," "machine learning," and "deep learning" appear constantly in discussions of formulation development, often used interchangeably as if they describe the same capability. Commercial AI systems and vendor literature claim AI-powered formulation tools without specifying what type of AI, what data it was trained on, or what problems it can actually solve. For formulation scientists evaluating these tools, the confusion is more than semantic. The technical approach underlying a predictive model fundamentally determines what kinds of formulation problems it can address, how much data it requires, what inductive biases it encodes (handcrafted features versus learned representations), and how it generalizes to new compositions outside its training distribution.

Formulation systems are inherently high-dimensional and interaction-dominated: the behavior of a 20-component mixture depends not just on individual ingredient properties but on concentration-dependent interactions among all components. This structural characteristic places specific demands on predictive modeling approaches. The distinctions between AI methods matter because they determine whether a tool can feasibly explore this design space or remains limited to interpolation within narrow regions of prior experimental data.

This article clarifies what these terms mean in the context of formulation science, explains why general-purpose AI tools like large language models are not suitable for quantitative formulation prediction, and examines the specific technical advantages that deep neural networks offer over traditional machine learning methods for complex formulation design.

Definitions: What These Terms Actually Mean

AI, ML, Deep Learning, and Generative AI hierarchy

These terms span methods, paradigms, and application domains, which contributes to their frequent conflation. Establishing clear definitions is necessary before comparing their applicability to formulation problems.

Artificial intelligence is the broadest category, encompassing any computational system designed to perform tasks that typically require human intelligence. Historically, AI included symbolic systems based on explicit rules and logic, such as expert systems developed in the 1980s. Modern AI is dominated by statistical learning systems that extract patterns from data. The term itself conveys little information about capability or methodology, which is why more precise terminology is needed when evaluating tools for specific applications.

Machine learning is a subset of AI in which systems learn from data rather than following explicit programmed rules. Traditional machine learning encompasses methods such as linear regression, random forests, support vector machines, and gradient boosting algorithms. These methods have been applied successfully across many domains, including materials science and chemistry. A defining characteristic of traditional ML is its reliance on handcrafted features, sometimes called descriptors. The feature space is fixed at model design time, and the model is fundamentally limited by the descriptors chosen. When predicting molecular or formulation properties, traditional ML models take as input a set of numerical features calculated for each component or mixture: molecular weight, LogP, topological indices, HLB values for surfactant blends, polymer molecular weight and polydispersity, charge density, or electrolyte concentration. The model then learns relationships between these input features and the target property.

Deep learning is a subset of machine learning that uses neural networks with multiple layers to learn hierarchical representations of data. The key distinction from traditional ML is that representation learning replaces explicit feature engineering: deep neural networks learn directly from raw inputs, building their own internal representations that are adaptive to the prediction task. In molecular applications, graph neural networks can take a molecular structure as input, represented either as a SMILES string or as a molecular graph with atoms as nodes and bonds as edges, and learn to predict properties without requiring explicit descriptor calculation. Deep architectures also support variable-length and set-structured inputs, making them naturally suited to formulations with varying numbers of components.

Simulation represents a complementary approach that solves physics equations rather than learning from data. Molecular dynamics simulations, computational fluid dynamics, and thermodynamic models predict behavior from first principles. These approaches derive their predictions from established physical laws, but they carry substantial computational cost and often require simplifying assumptions that can limit their accuracy for complex formulation systems involving multiple interacting phases and length scales.

Materials informatics applies data science and machine learning methods to materials discovery and optimization. Traditional materials informatics relies heavily on handcrafted descriptors, such as elemental fractions or MagPie features that encode composition and periodic table properties. Much of this work has focused on crystalline materials with well-defined stoichiometry and structure, which differs substantially from the amorphous, multi-phase, concentration-dependent systems typical in formulation science.

Surrogate models deserve explicit mention as a unifying concept. A surrogate model is any ML or DL model that approximates a property prediction function, replacing expensive experimental measurements or simulations with fast inference. Both traditional ML and deep learning can produce surrogate models; the distinction lies in how they represent the input space and what relationships they can learn.

Why Large Language Models Are Not Suitable for Quantitative Formulation Prediction

Large language models have captured enormous attention for their ability to engage in sophisticated conversation, write code, and synthesize information across domains. It is natural to wonder whether these models can accelerate formulation development. The short answer is that they are not suitable for the core challenge of quantitative property prediction, though they may provide value in adjacent tasks.

LLMs are trained on text data: scientific papers, websites, books, and other written material. They learn statistical patterns in language and can reproduce information that appears frequently in their training data. When asked about formulation chemistry, they can provide general information about surfactant behavior, describe what HLB values mean, or summarize textbook knowledge about colloidal stability. LLMs can be useful for literature synthesis, hypothesis generation, and experimental planning by surfacing relevant prior work or suggesting directions to explore. These capabilities operate in the knowledge and interface layer of formulation development.

What LLMs are not suited for is producing calibrated, quantitative, composition-conditioned predictions. Formulation prediction requires a specific mapping: given a structured input (a list of ingredients, their concentrations, and processing conditions), produce a numerical estimate of measurable properties such as viscosity as a function of shear rate, surface tension dynamics, colloidal stability over time, or foam generation and decay. This is fundamentally different from the text-to-text mapping that LLMs perform.

LLMs lack a grounded mapping from structured composition inputs to physical property outputs. They do not encode calibrated representations of how co-surfactant concentration affects phase behavior in a specific ternary system, how shear-thinning profiles emerge from wormlike micelle formation, or how ionic strength modulates the interaction between a polymer and a charged surfactant. These phenomena require surrogate models trained on physicochemical data with structured inputs and continuous outputs, not models that operate on textual descriptions of chemistry.

The distinction between knowledge and system state is important here. LLMs can describe what happens when surfactant concentration increases, drawing on textual descriptions in their training data. Domain-specific surrogate models can predict the specific viscosity value at a specific shear rate for a specific composition. The former provides qualitative guidance; the latter provides quantitative predictions that can inform experimental decisions.

Traditional Machine Learning vs. Deep Learning: A Technical Comparison

Machine learning vs deep learning comparison

Both traditional ML and deep learning can be applied to formulation problems, but they differ substantially in their capabilities and limitations. The comparison can be organized along several axes: how each approach represents inputs, how it models interactions among variables, how it scales with problem dimensionality, how efficiently it uses data, and how it handles heterogeneous data types. For complex formulation systems involving multiple components, nonlinear interactions, and coupled physicochemical phenomena, deep learning provides meaningful advantages along most of these axes, though with important caveats.

Capturing Complex Nonlinear Interactions

Formulation behavior is rarely additive. The performance of a surfactant blend depends not just on the individual surfactants but on how they interact: whether they form mixed micelles, how their CMC shifts upon mixing, and how the blend affects phase behavior and interfacial properties. Add a polymer to the system and the interactions become more complex still: surfactant-polymer complexation can dramatically alter rheology, and the presence of electrolytes can screen charges and modify binding behavior. Add processing conditions like temperature and shear, and the dimensionality of the interaction space expands further.

Traditional machine learning methods handle some nonlinearity. Random forests and gradient boosting can capture nonlinear relationships and pairwise or low-order interaction effects through their tree-based partitioning of feature space. However, they struggle with scalability and sample efficiency when high-order interactions involving many variables simultaneously become important.

Deep neural networks learn nonlinear, multivariate relationships through a different mechanism. Their layered architecture with distributed representations and parameter sharing allows them to build hierarchical compositions of features: low-level patterns combine into higher-level patterns that combine into predictions. Rather than partitioning feature space into discrete regions, DNNs learn smooth, high-dimensional functions that can capture emergent behavior in formulation systems: shear-thinning from wormlike micelle networks, phase inversion phenomena, colloidal destabilization at critical electrolyte concentrations.

End-to-End Learning from Molecular Structure

Traditional machine learning models for molecular property prediction require input features. Someone must decide which descriptors to calculate for each molecule or mixture: perhaps molecular weight, number of hydrogen bond donors and acceptors, calculated LogP, surface area, and a selection of topological indices. The model then learns relationships between these descriptors and the target property.

This approach has significant limitations, particularly for formulations. First, the choice of descriptors constrains what the model can learn; if an important structural feature or interaction is not captured by any input descriptor, the model has no access to it. Second, standard molecular descriptors were developed primarily for single molecules and do not naturally capture mixture-level phenomena.

Graph neural networks and other deep learning architectures for chemistry take molecular structure as direct input. The molecule is represented as a graph, with atoms as nodes and bonds as edges, or as a SMILES string that encodes connectivity. The neural network learns to extract relevant features from this structural representation, building internal embeddings that capture structure-property relationships without requiring handcrafted descriptors.

Handling High-Dimensional Formulation Space

A realistic formulation may contain 15 to 30 ingredients, each at variable concentrations, along with processing parameters such as mixing speed, temperature, and addition order. The resulting design space is combinatorial: even under coarse discretization, the number of candidate formulations quickly reaches billions. Exhaustive experimental exploration is therefore infeasible.

Deep learning does not eliminate the curse of dimensionality; no method can. However, the effective complexity of the problem is often much lower than the raw dimensionality suggests. Empirically, many real-world systems appear to obey a version of the manifold hypothesis: although formulations are specified in a high-dimensional space, the subset that yields physically meaningful, stable, and performant products may lie on a much lower-dimensional manifold embedded within it.

Deep neural networks are well-suited to exploit this structure when it exists. Through hierarchical nonlinear transformations, they learn compressed latent representations that can align with the underlying manifold of valid formulations. This allows them to generalize across sparsely sampled regions of the input space and capture interactions between ingredients and process variables without requiring exhaustive coverage.

Applying Foundational Models to New Chemical Systems

Formulation data is expensive to generate. Each data point requires synthesizing a formulation and running measurements, which takes time and resources. Traditional machine learning models must be trained from scratch for each new chemical system. A model trained on anionic surfactant formulations provides limited advantage when a formulator begins working with a new class of nonionic or biosurfactant systems, because the model has no mechanism to transfer what it learned about molecular behavior and interactions.

Deep learning enables a different paradigm: foundational models trained on broad, diverse formulation data can be applied to new chemical systems with limited additional data. The key is that a well-trained deep network learns general representations of how molecular structure relates to formulation behavior. When encountering a new surfactant or polymer system, the model can leverage its learned understanding of structure-property relationships, requiring only modest fine-tuning data from the new system rather than building knowledge from zero.

The Critical Role of Pretraining Data

The value of a foundational model depends entirely on what it was trained on. A model learns to recognize patterns present in its training data; if that data does not cover the phenomena relevant to formulation, the model will not encode useful representations for formulation prediction.

Building an effective pretraining dataset for formulation requires deliberate coverage across multiple dimensions. First, chemical diversity: the dataset must include the classes of molecules that appear in real formulations, including surfactants of various architectures and charges, polymers with different backbone chemistries and molecular weights, and the range of cosolvents, salts, and functional additives used in practice. Second, concentration ranges: formulation behavior depends strongly on concentration, and models must see examples across dilute, semi-dilute, and concentrated regimes to learn how properties change. Third, interaction regimes: the dataset should be organized around a hierarchical construction of interactions.

This breadth requirement is why not all deep learning approaches to formulation are equivalent. A model trained on a narrow slice of chemical space, or only at a single concentration regime, will have blind spots when applied to systems outside that training distribution. Pretraining defines the inductive bias of the model, determining what patterns it is predisposed to recognize. Systematic coverage of formulation-relevant chemistry is what separates a useful foundational model from one that fails to generalize.

FastFormulator's Approach

FastFormulator deep neural network architecture

FastFormulator's technology is built on deep neural networks designed specifically for formulation property prediction. The platform takes structured formulation inputs (component identities, concentrations, processing parameters) and produces numerical estimates of physicochemical properties as outputs.

At the molecular level, the platform uses a molecular encoder trained on millions of chemical compounds that learns representations of molecular structure. This encoder captures intra-molecular features that influence how each ingredient behaves. Critically, the architecture extends beyond single-molecule encoding: additional model components learn inter-component interactions conditioned on mixture composition, capturing concentration-dependent effects that single-molecule descriptors miss. The models are trained on thousands of complex formulation examples, not just isolated compound properties. This formulation-specific training enables the models to learn multi-ingredient interactions and predict emergent properties that arise from component combinations.

The platform provides foundational models covering the physicochemical properties most critical to formulation design. The Virtual Viscometer predicts viscosity and full rheology flow curves across shear rates, the Virtual Stability Chamber predicts formulation stability after a specified time under defined environmental conditions such as temperature, the Virtual Surface Tensiometer models surface tension, and The Virtual Foam Analyzer predicts foam formation and longevity. These properties are tightly coupled: changing one affects the others, and optimizing a formulation requires predicting all of them together.

The practical value lies in identifying promising formulation candidates computationally before committing to laboratory synthesis and testing. This addresses the combinatorial design space challenge discussed earlier: rather than exhaustive trial-and-error screening, formulators can use predictive models to prioritize the most promising regions of formulation space. When the pretraining data covers the relevant chemical space and interaction regimes, this approach can reduce failed experiments and accelerate development cycles by focusing experimental effort where it is most likely to yield useful results.

Takeaways

The terminology around AI, machine learning, and deep learning matters because these terms describe fundamentally different technical approaches. Large language models are not suitable for quantitative formulation prediction; they operate on text, not structured composition-to-property mappings. Traditional machine learning works well when relationships can be captured in handcrafted features, but formulation design involves complex nonlinear interactions among many components where these methods face limitations in scalability and sample efficiency.

Deep neural networks provide meaningful advantages for complex formulation problems when appropriate pretraining data is available: they capture high-order interactions, learn directly from molecular structure, and leverage foundational models to make predictions on new chemical systems with limited additional data. The quality of pretraining data determines whether deep learning delivers on this potential. Models must be trained on formulation-relevant chemistry across diverse surfactants, polymers, concentration regimes, and interaction phenomena.

For complex formulation systems, deep learning represents a different class of capability than traditional approaches, enabling predictions that traditional methods do not reliably make and opening design spaces that would otherwise be practically unexplorable.

References

Chen, C. et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nature Communications 14, 2585 (2023).
Choudhary, K. et al. Recent advances and applications of deep learning methods in materials science. npj Computational Materials 8, 59 (2022).
Wang, A. Y.-T. et al. A Strategic Approach to Machine Learning for Material Science: How to Tackle Real-World Challenges and Avoid Pitfalls. Chemistry of Materials 34, 7323-7336 (2022).
Keith, J. A. et al. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chemical Reviews 121, 9816-9872 (2021).
Liu, Y. et al. Machine learning in chemistry. Proceedings of the National Academy of Sciences 122 (2025).
Schleder, G. R. et al. A survey of AI-supported materials informatics. Computational Materials Science (2025).
Jha, D. et al. Enabling deeper learning on big data for materials informatics applications. Scientific Reports 11, 4022 (2021).
Holmberg, K. et al. Surfactants and Polymers in Aqueous Solution (2nd ed.). Wiley.
Tadros, T.F. Applied Surfactants: Principles and Applications. Wiley-VCH.

← Back to all posts