Decode and Design the Structure of Life

The world's most general-purpose atomistic foundation model, unifying structure prediction and de novo generation for the atoms of life

The 2024 Nobel Prize in Chemistry marked a historic achievement: a once intractable problem in biology has now largely been solved through AI. This problem was once the “Holy Grail” of structural biology: predicting a protein structure from its primary amino acid sequence. Yet, in drug design, knowing a protein's structure is just the first step in an effort that often takes over a decade. Solving monomer folding is merely a stepping stone to the ultimate goal: making biology programmable - opening endless possibilities in medicine and the many life science industries beyond.

Achieving this goal will require stepping into a new frontier to develop tools that not only predict the structures of life, but which also design new molecules in a programmable manner, integrated with as much experimental data as can be harnessed. Furthermore, we must strive to increase the rate of experimental data collection, increasing the fidelity and applicability of these tools.

Today, we introduce VantAI's first foundational model: Neo-1. Neo-1 unifies structure prediction and molecular design at an atomic level, allowing prompting with multimodal and fine-grained structural information both for individual molecules and their interactions. In addition to designing biomolecules, this programmability allows Neo-1 to accelerate the collection of structural data when combined with our cross-linking mass spectrometry (XLMS) platform, NeoLink.

Neo-1 integrates state-of-the-art structure prediction and all-atom molecular generation into a single, unified model—to our knowledge, the world's first model to decode and design the structure of life. We're excited to give a glimpse today into how we're combining the capabilities of Neo-1 and NeoLink, bringing us one step closer to transforming biology into an engineering discipline, where we can engineer the biological circuits of nature.

Fig. 1: Capabilities of structural biology foundation models over the generations.

Introducing Neo

Fig. 2: Neo-1 de novo designing and co-folding an all-atom molecular glue structure simultaneously from sequence alone, a feat not possible with previous models.

VantAI was started with a grand vision: making protein interactions programmable. The current approach to designing artificial proteins as therapies is, in a sense, reinventing the wheel. Billions of years of evolution have endowed our cells with proteins which fulfill every function needed for survival. What if, instead, we could leverage the diversity of proteins and their many functions already present in our cells, and precisely direct them against diseases?

Many protein therapeutics (e.g. antibodies) are currently only able to reach extracellular targets or a limited set of tissues, often triggering unwanted immune responses. By contrast, reprogramming proteins already present within cells—redirecting them toward new targets using small molecules, peptides, or macrocycles—enables the treatment of a wide array of diseases currently considered untreatable, opening a new chapter in medicine, as evidenced by the dozens of ongoing clinical trials in protein degradation.

Naturally occurring proteins reprogrammed to have new functions are known as “Neoproteins.” They are at the center of a new therapeutic modality called Proximity Modulation (ProMod) that VantAI is pioneering.

However, existing approaches prevalent in small molecule drug discovery, where the structure of a protein target is first determined experimentally, or computationally via folding methods, and then a molecule is designed against that structure, do not apply here. This is because the complex often only exists (stably) in the presence of the molecule and hence its structure can often not be determined correctly without the molecule. For this task in particular, simultaneous co-folding and generation is required, but impossible with current methods.

For the first time, designing Neoproteins with a unified model that simultaneously folds the complex structure and designs a small molecule is now achievable. And hence we call our foundation model series “Neo”.

The technical leap required to achieve this unlocked not just ProMod design, but design of all molecular modalities in the process. We trained Neo-1 to not only decode and design ProMods, but to decode and design the structure of any molecule, including proteins, small molecules, and more. We are excited to work together with partners in academia and industry to leverage these capabilities.

Neo-1 Foundation Model Capabilities

Molecular Generation

Designing novel therapeutics

Protein Design

Creating the molecules of life

Structure Prediction

Revealing molecular architecture

Inpainting

Generating improved molecules

And many more...

A Next-Generation Model Unifying Structure Prediction and De Novo Generation

Current all-atom structure prediction methods share a common blueprint, replicating or closely following the ideas introduced in AlphaFold2 in 2020: they predict 3D atomic coordinates directly, provided an input sequence.

In parallel, there has recently been a surge in task-specific models developed for purposes such as protein backbone generation or designing small molecules targeting predefined binding pockets. These approaches typically alternate between separate molecule generation and structure prediction stages, each utilizing specialized models.

Such iterative workflows can inadvertently accumulate errors and diminish controllability, primarily due to the absence of comprehensive, all-atom structural understanding. Critically, this fragmented approach restricts the design of molecules capable of inducing large changes in protein conformations or complexes—a key requirement of rational ProMod design—where simultaneous, integrated structure prediction and molecular generation is essential.

Neo-1, for the first time, enables such integration between prediction and generation, ushering in the next chapter of atom-level foundation models. We achieve this by moving the diffusion process from the conventional coordinate space to the latent space, enabling the model to reason over a smoother landscape of both sequence and structure. This shift has enabled Neo-1 to generate completely novel molecules, including proteins, peptides, and small molecules, at all-atom resolution, while simultaneously predicting their structures with state-of-the-art accuracy.

Fig.3: All-atom de novo designed molecules by Neo-1 across different molecular types. Designed parts shown in green. Select prompting highlighted in pink.

A Step-Change in Generality and Programmability

Neo-1 unifies a plethora of tasks traditionally tackled with separate specialist models: all-atom co-folding, docking, inverse folding, all-atom protein design, small molecule design, motif scaffolding, R-group enumeration, fragment linking, among others—all within a single model.

At its core, Neo-1 uses a learned, unified latent representation of different biomolecules that compresses and abstracts information from various input modalities. This learned latent representation can be decoded into complete molecules, including small molecules, lipids, proteins, and DNA/RNA, along with their atom types and coordinates. By varying the input from which the model has to construct this latent representation, the model can perform any task from entirely de novo protein-ligand complex generation to inpainting of small molecular fragments in an otherwise provided structure, and we train Neo-1 on a mixture of these tasks. For example, providing sequence-only inputs turns the prediction into a folding task, providing partial structure conditioning turns the prediction into an inpainting task, and prompting the model to generate a small molecule given protein sequence(s) simultaneously designs the small molecule and co-folds the complex structure. We also include auxiliary conditioning, such as molecule type, binding site, distance restraints, and molecular properties, to further increase programmability. This means at inference time, Neo-1 can be prompted with desired sequence, structure and/or property information.

Fig. 4 shows a limited selection of Neo-1 generated small molecules to illustrate just a few of the precise and diverse structural and non-structural prompts Neo-1 can leverage to generate novel small molecules and ProMods. They include 1) pocket-specific co-folding and small molecule generation if provided with a sequence and four residues to indicate the binding site, 2) R-group enumeration if provided with a known structure and molecular scaffold, 3) molecular glue design and complex co-folding if prompted with two protein sequences and 4) expanding a molecular glue for a specific protein-protein interface when provided with a binder scaffold and protein structures.

Fig. 4: Neo-1 can de novo design biomolecules with fine-grained and diverse structural prompting. Prompts to generate structures are shown on the left. De novo generated small molecules are shown in the middle, with the generated atoms highlighted in green. Known reference binders are shown on the right with the re-designed elements highlighted in gray.

Fig. 5 highlights the same versatility for proteins, including 1) binder design when provided with a desired target protein structure, 2) antibody V_H loop design against a specific epitope when provided with distance restraints, a desired target structure, and partial V_H structure, 3) DNA/RNA binder design when provided with a protein scaffold structure and a desired oligonucleotide binder structure, 4) design of peptide non-canonical amino acids when provided a partial peptide and target structure.

Fig. 5: Neo can de novo design biomolecules with fine-grained and diverse structural prompting. Prompts to generate structures with desired features are shown on the left. Examples of Neo-1 de novo generated proteins, loops, and peptides with non-canonical amino acids are shown in the middle, with generated atoms highlighted in green. Known reference molecules are shown on the right, with the re-designed elements highlighted in gray.

Neo-1 Generates Diverse and Desirable Molecules

Neo-1 generates valid and structurally diverse proteins and small molecules with desirable properties. This demonstrates it has accurately learned the underlying data distributions, as seen in Fig. 6 & 7.

Fig. 6 illustrates property distributions for small molecules generated by Neo-1 without prompting into specific directions of molecular space, highlighting the inherent versatility and robustness of the model. Neo-1 consistently yields diverse and chemically valid molecules exhibiting drug-like properties, with atom type distributions closely matching those found in known drug-like molecules (Fig. 6C). Ligands produced by Neo-1 have both similar and different shapes to reference compounds (Fig. 6D), demonstrating utility to explore both validated and novel interactions for drug discovery.

Fig. 6: A & B: For around 6000 samples generated against 42 diverse protein sequences, the distribution of QED (Quantitative Estimate of Druglikeness) and SAS (Synthetic Accessibility Score) produced by Neo-1 is similar to drug-like molecules in the PDB. C: Neo-1 produces similar atom type distribution to PDB. D: A high-druglikeness sample of around 800 Neo-1 generated molecules shows both similar and different shapes compared to co-crystallized reference compounds. Right panel: Neo-1 generated molecules.

Fig. 7 illustrates an analogous case for protein generation, focusing on secondary structure. The structure and sequence of 610 proteins were jointly generated, sampling lengths approximately evenly from 51 to 392 amino acids. Unlike early protein design models, Neo-1 does not exhibit a bias towards helicity, accurately matching the secondary structure distribution found in the PDB (Fig. 7A & B). Amino acid distributions are matched equally well (Fig. 7C), and Neo-1 generates a mixture of known and novel structures as seen by the TM-score (Fig. 7D).

Fig. 7: A & B: Generated proteins closely match distribution of secondary structures as proteins in the training dataset. C: Generated proteins contain approximately the same distribution of amino acids as the training dataset, with a slight overrepresentation of alanine. D: Neo-1 can generate proteins significantly different from PDB. Right panel: Neo-1 generated protein samples.

Steerable Molecular Generation Unlocks Precision Generation

Neo-1 simultaneously generates predictions in a "coarse-to-fine" manner, offering distinct advantages over autoregressive models commonly used in protein and small molecule design. Autoregressive models sequentially generate atoms without the flexibility to adjust previously generated portions based on later additions. In contrast, Neo-1 enables steering of molecule generation towards any objective by applying intermediate rewards across the entire molecular structure. Furthermore, unlike diffusion-based guidance methods, Neo-1's inference-time steering accommodates complex multi-property optimization—including properties that are non-differentiable— without requiring retraining or external classifiers. Fig. 8 illustrates Neo-1's steering capability on a simple example by demonstrating the generation of more rigid molecules, an essential process in lead optimization, particularly for ProMods.

Fig. 8: Steered design of a muscarinic M1 receptor binder, based on input sequence (6ZFZ). Design process aims to reduce number of rotatable bonds, yielding more rigid molecular scaffolds. Top right plot compares distribution of the steered property (rotatable bonds) with and without steering. Examples of Neo-1 generation with and without steering are shown on the left (with number of rotatable bonds annotated), while the bottom right illustrates an example of the entire steered complex generated by Neo-1.

Highly Accurate Structure Prediction

When prompted with complete sequence information but no structure, Neo-1 serves as a structure prediction model.

Neo-1 was trained using structural data and clusters defined in our previously presented datasets, PINDER and PLINDER, which were created through a collaboration with NVIDIA, MIT, and the University of Basel. We also included curated, real and synthetic datasets covering monomers, protein-protein, and protein-ligand complexes. Training and evaluating Neo-1 was made possible by computational accelerations provided by GPU-mmseqs2, a tool developed by NVIDIA and SNU, inference will be powered via NVIDIA's recently announced MSA-Search NIM.

We compared Neo-1's structure prediction capabilities against Boltz-1, a validated open-source reproduction of the current state-of-the-art AlphaFold 3 model. We use identical time cutoffs for train and test splits and leverage the test set results directly provided by Boltz-1 where possible. We split the evaluation across three drug-discovery relevant scenarios to prevent biases due to varying sample sizes: Protein-Protein interactions (PPIs, i.e., multiple protein chains with no small molecules), Protein-Ligand interactions (PLIs, i.e., at least one small molecule with more than five heavy atoms, excl. oligosaccharides and covalently bound ligands), and Monomers (i.e., single protein chains with no small molecules).

Overall, Neo-1 achieves performance comparable to Boltz-1, with strengths and limitations reflecting its training distribution. It notably excels in the prediction of binding-pockets (success defined by <5.0 Å RMSD accuracy), protein-protein complexes, and, in particular, ProMod-induced complexes, as demonstrated in Fig. 9 & 10. Neo-1's slightly lower accuracy on monomers reflects their reduced representation in training, consistent with the model's primary focus on complex structures relevant to drug discovery involving ligands or binder proteins.

TYPE	METRIC	NEO	BOLTZ
PPI (PROTEIN-PROTEIN INTERFACES)
	PPI Success Rate (DockQ > 0.23) (⭡)	0.68	0.69
	I-RMSD (⭣)	5.55	6.51
	L-RMSD (⭣)	10.60	15.42
PLI (PROTEIN-LIGAND INTERFACES)
	PLI-LDDT (⭡)	0.49	0.48
	PLI Success Rate (< 2.0 Å RMSD) (⭡)	0.33	0.36
	PLI Success Rate (< 5.0 Å RMSD) (⭡)	0.65	0.50
MONOMER
	BB-LDDT (⭡)	0.81	0.92
	TM-Score (⭡)	0.87	0.93

Fig. 9: Structure prediction performance of Neo-1 and Boltz-1. Only systems with predictions from both Neo-1 and Boltz-1 are included (excluding systems with oligonucleotides), resulting in 191 PPIs, 30 PLIs and 163 monomers. Boltz-1 predictions sourced from official GitHub. Both methods use matched inputs (MSA, sequences, SMILES) and do not incorporate protein templates or binding-site conditioning. Mean oracle metrics reported. I-RMSD/L-RMSD: Interface/Ligand root mean square deviation.

As illustrated in Fig. 10, Neo-1 demonstrates exceptional performance in challenging prediction scenarios highly relevant to real-world drug discovery programs, such as ternary complexes, antibody-antigen interactions, and protein-peptide complexes.

Fig. 10: Structure prediction capabilities of Neo-1 for biomolecular complexes outside the training set. Ground truth structures are shown in transparent shades, predicted structures in blue (proteins) resp. green (small molecule/peptide). First panel: M1-StaR-T4L in complex with GSK1034702—muscarinic M1 receptor agonist for Alzheimer's disease (6ZG9). Second panel: Molecular glue ternary complex: HIV-1 protease in complex with a novel non-peptidic inhibitor (7WBS). Third panel: Protein-peptide & antibody complexes: Fab Fragment of Monoclonal Antibody LNKB-2 complexed with Antigenic Nonapeptide from Human Interleukin-2 (7YZJ).

Proteolysis-targeting Chimeras (PROTACs), a class of ProMods, are two small molecules connected by a linker designed to bind two proteins together in the cell. They are an ideal and difficult testing ground for all-atom folding models. Their protein interface is difficult to predict as they have no or limited co-evolution, a critical input for current folding models. Additionally, the protein interaction is often transient and highly mobile as long linkers allow for proteins with limited compatibility and interactions to be brought together, unlike molecular glues which often have less transient and more substantial protein interactions.

As seen in Fig. 11, Neo-1 excels in PROTAC-complex prediction, significantly outperforming Boltz-1, across 19 PROTAC structures released after the training time cutoff. As shown in the next section and unique to Neo-1, its performance can be even further improved by leveraging structural information that's available for PROTACs due to their chemical composition, leading to highly accurate predictions not possible with other models.

Fig. 11: Performance of Neo-1 and Boltz-1 on 19 PROTAC ternary complex structures. Neo-1 can recover the correct binding interface in twelve structures, while Boltz-1 correctly predicts nine. Neo-1 correctly predicts the PROTAC conformation and binding mode to sub-2 Å accuracy in 7 cases, despite the long, flexible linkers used to induce the non-native interface.

NeoLink: A New Era in Structural Biology Powered by Black-Box Data and AI

Neo-1 can effectively incorporate any degree of structural restraints, binding sites, or partial structures for both individual molecules and their interfaces within a complex. This broad and precise programmability is fundamental to fully realizing the potential of VantAI's innovative structural data generation platform, NeoLink.

NeoLink represents a transformative advancement in proteomics and structural biology more broadly, analogous to the revolutionary impact of shotgun sequencing in genomics. Just as shotgun sequencing dramatically reduced the cost and increased the throughput of genomic data generation by fragmenting DNA into easily sequenced segments, NeoLink leverages cross-linking mass spectrometry (XLMS)-based chemical "rulers" to measure interatomic distances at scale. These chemical rulers generate sparse structural restraints by capturing proximity information about atoms on molecular surfaces, providing cost-effective, high-throughput snapshots of structural interactions.

However, like shotgun sequencing's fragmented reads, NeoLink's structural outputs are sparse and not directly interpretable by humans without computational assembly. Prior to Neo-1, these sparse structural constraints lacked sufficient resolution to accurately reconstruct complete protein structures. Neo-1 addresses this limitation by computationally assembling the sparse restraints into full, atomic-resolution structures, similar to how bioinformatics tools were developed in the early 2000s to reassemble genomic fragments into a coherent genome. This integration positions NeoLink as an exemplary case of "black-box" data generation—highly automated, AI-targeted, and optimized for computational, rather than manual, interpretation.

Moreover, NeoLink creates a powerful data-driven feedback loop: as Neo-1 continually refines its structure-assembly capabilities, each generation of Neo-1 produces increasingly precise structures, further enriching subsequent datasets. This data flywheel progressively enhances Neo-1's predictive capabilities, laying the foundation for continual improvement in future models. Fig. 12 & 13 illustrate this process, showing how NeoLink's cross-linking approach interrogates structural interactions to enable precise molecular reconstruction at the cellular level.

Fig. 12: Illustration of a ternary complex induced by a molecular glue captured by crosslinks.

Unlike existing experimental methods, with NeoLink, structural data can be obtained for the whole cell at once, exceeding the throughput of current structure measurement paradigms by several orders of magnitude at a fraction of their cost.

Fig. 13: Illustration of proteome-wide crosslinking capturing whole-proteome information in a single experiment. Bright lines symbolize crosslinks (3D Illustration of cellular landscape by Naveen Devasagayam).

NeoLink Reveals Both Induced and Natural Interactions

This data particularly helps when making predictions for ProMods (e.g. PROTACs) where data is especially limited. Compared to obligate natural interactions, the protein interface is often more transient and with limited to no co-evolution, making them hard to predict for methods that rely on co-evolutionary signals extracted from Multiple Sequence Alignments (MSA). When provided with structural restraints such as those obtained via NeoLink, Neo-1 can reconstruct all tested structures with even greater accuracy as shown in Fig. 14.

Fig. 14: When 5 distance restraints between protein chains are available (4-10 Å potential distance), Neo-1 can recover the correct binding interface in all cases for glues (left) and PROTACs (right). Note that only a small number of structures for ternary complexes are available after Neo-1’s training cutoff (N=19 PROTACs, N=32 for glues). To obtain sufficient samples, the glue set is enriched with molecules binding between two chains (excluding artifacts).

A compelling example of the impact of such restraints is illustrated by the prediction of a PROTAC-induced WDR5-VHL complex (Fig. 15). While many PROTAC ternary complexes involve targets such as VHL and CRBN, which are already structurally well-characterized and closely resemble complexes seen previously in model training data, neither Boltz-1 nor Neo-1 has previously encountered WDR5 ternary complexes. Without structural restraints, Neo-1 gets the interface close, while Boltz-1 shows a less aligned prediction. When feeding the available distance restraints, Neo-1 predicts the complex with extremely high accuracy and places the PROTAC within less than 1 Å RMSD of the reference.

Fig. 15: Boltz-1 and Neo-1 co-folding predictions of the WDR5-VHL PROTAC ternary complex (7JTO), released after both models' training cutoff.

To illustrate Neo's transformative potential, Fig. 16 presents two novel complexes absent from the PDB, accurately predicted by Neo-1. To our knowledge, these structures represent entirely new biological insights (i.e. “new-to-science”). Existing folding methods either fail to produce high-confidence predictions or generate structures inconsistent with experimental evidence. In contrast, Neo-1 leverages even minimal cross-linking data to successfully predict alternative protein interfaces. These interfaces are strongly supported by complementary, orthogonal evidence and fully align with the experimental cross-linking data, highlighting how previously inaccessible structural insights can be revealed through the combined Neo-1 & NeoLink application.

Fig. 16: Two new-to-science protein complexes, not found in the PDB, predicted by Neo-1 using structural data from VantAI’s NeoLink platform shown in pink dashed lines. Left panel: Neo-1 and Boltz-1 models of ATPAF2-ATP51A interaction showing Neo-1 satisfies the reference crosslink restraint at 26.8 Å, while the Boltz-1 pose violates it. Right panel: Neo-1 and Boltz-1 models of UBE3D-CPSF3 interaction, with Neo-1 conforming to the reference crosslink restraint at 14.1 Å, whereas the Boltz-1 pose does not satisfy the restraint.

The ATPAF2-ATP5F1A complex, shown in Fig. 16 (left panel), is a key regulator of ATP synthase biogenesis, ensuring proper assembly of the F1 catalytic core by preventing premature α-subunit aggregation and misfolding. This interaction, vital for eukaryotic life, was structurally elusive for decades despite its central role in mitochondrial function. ATPAF2 (ATP12) binds ATP5F1A's α-subunit in a conserved “boxing glove” conformation, shielding oligomerization-prone surfaces until β-subunits are available for proper assembly (PMID: 9446613).

Neo-1 positioned ATPAF2's wrist domain near ATP5F1A's C-terminal helical bundle, satisfying the Lys531-Lys72 crosslink restraint (26.8 Å) and aligning with biochemical evidence of ATPAF2 preventing α-α self-association. In contrast, Boltz-1 models violate the crosslink restraint (>30 Å apart). Notably, Glu240 at the interface matches Glu249 of ATP12 (yeast homolog of ATPAF2), identified in prior mutational studies, as a determinant for ATP synthase α-subunit binding (PMID: 9446613). While no human ATPAF2-ATP5F1A structure exists, our model provides the first atomic-level insight into this transient intermediate, supporting a conserved assembly mechanism.

CPSF3 (Fig. 16, right) is an essential endonuclease for mRNA 3'-end processing. Its dysregulation, particularly in breast cancer, makes it a promising therapeutic target (PMID: 35992060). Unexpectedly, UBE3D, a HECT E3 ligase typically involved in protein degradation, protects CPSF3 from ubiquitin-mediated degradation (PMID: 39032490). While X-ray crystallography resolved CPSF3's core nuclease and β-CASP domains, its flexible C-terminal domain remains structurally undefined. Crosslinking and HDX-MS suggest that UBE3D interacts with this region, stabilizing CPSF3. This evidence is supported evolutionarily: yeast Ysh1 interacts with Ipa1 at a similar site, with Ipa1 functioning analogously to UBE3D. While this interaction is well-documented in the literature, its structure remains unsolved.

A Neo-1 model of the UBE3D-CPSF3 complex, guided by a single K88–K487 crosslink, aligns with prior structural knowledge, positioning UBE3D’s catalytic Cys144 near CPSF3 without interfering with other subunits (e.g., Mpe1 from PDB 6I1D). Unlike AlphaFold Multimer and Boltz-1, which struggle with flexible regions, this model better reflects experimental evidence. While the precise mechanism by which UBE3D stabilizes CPSF3—whether through de-ubiquitination, SUMOylation, or alternative ubiquitin linkages—remains unclear, the Neo-1 framework offers valuable insights into its role in mRNA processing and cancer.

VantAI's NeoLink-derived data covers ~70% of the human proteome and diverse species, strategically selected to maximize geometric diversity of protein interfaces. Collectively, this data surpasses the Protein Data Bank (PDB) coverage by over 1.5-fold. To our knowledge, this is the largest and most diverse proprietary structural dataset.

Broad Structural Programmability Beyond NeoLink

Beyond NeoLink-derived distance restraints and protein-ligand distance restraints, highlighted earlier for molecular generation, Neo-1 has also been trained to be conditioned on available monomer and/or ligand structures. In these scenarios, Neo-1 effectively functions as a docking method. As seen in Fig. 17, providing known monomer structures significantly boosts accuracy.

TYPE	METRIC	DOCKING	CO-FOLDING
PPI (PROTEIN-PROTEIN INTERFACES)
	PPI Success Rate (DockQ > 0.23) (⭡)	0.82	0.68
	I-RMSD (⭣)	2.29	5.55
	L-RMSD (⭣)	6.34	10.60
PLI (PROTEIN-LIGAND INTERFACES)
	PLI-LDDT (⭡)	0.61	0.49
	PLI Success Rate (< 2.0 Å RMSD) (⭡)	0.55	0.33
	PLI Success Rate (< 5.0 Å RMSD) (⭡)	0.73	0.65

Fig. 17: When available, providing monomer or ligand structural information significantly improves the structure prediction performance of Neo-1. Compared to sequence-only inputs, this conditioning enables a docking-based approach that achieves higher accuracy across all metrics on PLI and PPI systems in the test set (N=191 for PPI, N=30 for PLI). Mean oracle metrics reported.

How a Fully Programmable Model Unlocks Drug Discovery

Neo-1's broad programmability enables many steps of traditional protein and small molecule design in a single model. In typical optimization campaigns, more and more information becomes available as a program progresses. Neo-1, unlike specialist models, can be prompted with a broad range of datapoints spanning sequence and structure, progressively increasing its utility as additional information becomes available. This unlocks what we believe to be the future of AI-enabled drug discovery—seamless iterative interaction between three interlinked contributors: 1) drug designers, 2) experimental data and 3) AI tools as copilots.

Despite this significant technological leap, AI models such as Neo-1 still have many limitations. However, these limitations can be effectively addressed through precise control, leveraging the decades of domain knowledge from drug hunters and experimental evidence.

To showcase Neo-1's unique features highlighted above, two examples are presented below. These case studies on molecular glues/inhibitors and on antibodies show how Neo-1 manages to rediscover known molecules or molecules with similar properties through step-by-step optimization typical in discovery campaigns.

Case Study 1: End-to-End Glue and Small Molecule Discovery

Fig. 18: Neo-1 unlocks end-to-end rational glue discovery. Schematic illustrating how Neo-1 can accelerate multiple stages of the drug discovery pipeline through its unique programmability. Neo-1 inputs for each example are shown at the top, while a representative Neo-1 generated molecule is highlighted in green.

Once a disease target is identified, the initial step in both small molecule and ProMod discovery is finding a molecule that engages the target (or effector and target in the ProMod case). Especially if targets are not structurally well-characterized or change conformation upon binding, being able to simultaneously generate protein structures and small molecules is critical. This is particularly important for molecular glues, which are small molecules inducing protein-protein interfaces that are often unstable without the molecule. In such a case the structure of the complex is undefined in the absence of the glue. Neo-1 is the first model able to generate small molecules and molecular glues in re-folded structures from sequence alone.

As shown for the target CDK2 in Fig. 19, Neo-1 is able to de novo generate active site inhibitors directly from protein sequence alone. Many validated active site inhibitors (e.g., roscovitine bound to CDK2, PDB ID 3DDQ) make hydrogen bonding interactions with the hinge region, thus, newly generated molecules can be filtered for this pattern. Generated molecules both “rediscover” the core of known binders, but also show different scaffolds that maintain hinge-binding interactions crucial for kinase inhibitors. Neo-1's strong all-atom structure understanding is critical to designing such precise interactions.

Fig. 19: When prompted with the CDK2 sequence, Neo-1 generates a variety of small molecules that capture the key hinge binding interactions observed in known binders. Neo-1 co-folded and generated protein-ligand complex on left, with Neo-1 designs, filtered to recover the hydrogen bonding pattern, shown in green. Roscovitine bound to CDK2 structure (3DDQ) shown as a reference in white.

If instead prompted with CDK12 and DDB1 sequences, Neo-1 is able to de novo design molecular glues that stabilize their interface. The generated set represents a diverse set of molecules with varying drug-likeness, remarkably able to simultaneously reconstruct ternary complex features observed in the previously observed reference structure (PDB ID 8BUG) with high accuracy (Fig. 20).

Fig. 20: Neo-1 produces diverse molecules when prompted with only sequences of CDK12 and DDB1. Neo-1 co-folded and designed complex on the left, crystallized reference complex showing strong alignment on right. Generated molecules shown in green. Additional designed molecules shown in middle. The generated structures match the PPI and PLI of the previously reported structure (8BUG), however, Neo-1 generates a variety of novel molecular glue structures.

In typical drug discovery workflows, after experimental validation of initial hits, molecules are optimized for desirable properties that enable reaching desired tissues, stability, half-life, and many other objectives. QED (Quantitative Estimate of Druglikeness) is often used as a composite score to quantify these properties. Fig. 21 shows how steering de novo generation of molecules towards higher QED scores given only the previous DDB1/CDK12 sequences produces diverse, but now increasingly drug-like molecules. This example highlights Neo-1's capability not only in initial hit discovery but critically in lead optimization, enabling rapid molecular refinement—a step that traditionally consumes multiple years in conventional drug discovery pipelines.

Fig. 21: Steering generation for higher QED score yields additional starting points with improved molecular properties. Left: Neo-1 generations without steering. Center: shift in distribution of molecular property through steering. Right: molecules generated with steering. The inputs for generation were identical to Fig. 20 except for an additional goal to sample trajectories leading to higher QED values.

Once optimal starting points have been found and optimized, often "lead series'" of molecules are identified. There, a common scaffold that has been validated to be critical for binding is held constant while other parts of the molecule are changed to optimize different properties. Neo-1 can be prompted with the structure and/or sequence of such scaffolds and the target to further expand on the lead molecules. In order to accommodate for the compound, part of the protein complex may also be generated ad hoc by the model, only providing the structure that is known to bind the key recognition motif and co-folding the rest while generating new molecules (Fig. 22).

Fig. 22: Prompted with CDK12 structure, its conserved binding motif and DDB1 sequence, Neo-1 correctly folds the substrate protein and completes the ligand fragment while flexibly accommodating the “inpainted” interactions. Parts generated by Neo-1 are shown with carbon atoms in green. Reference molecule co-crystalized with CDK12 shown in gray (6TD3).

Case Study 2: End-to-End Antibody Discovery

Fig. 23: Neo-1 unlocks end-to-end rational antibody discovery by incorporating knowledge via conditioning. Generated structure is shown in green, folded or input structure in blues. 1) Initial design by folding V_H against structure of SARS-COV-2 RBD and generating portion of CDRH3 sequence. 2) Generation is steered toward target epitope via distance restraint conditioning (pink). 3) Sequence conditioning of paratope residues (arrows) steers sequence choice. 4) Loop generation can be conditioned on structures of both antibody framework regions and antigen.

This case study highlights how Neo-1 can be used in rational antibody design against a known antigen, an incredibly important task both in proximity modulation drug discovery and more broadly. As illustrated by several emerging proximity modality types, such as DACs, LYTACs, and others now approaching clinical studies, antibodies can create unique possibilities in the proximity modulation landscape.

We leverage the recently crystallized SARS-COV-2 RBD (PDB ID: 7MSQ) as an example. For conceptual simplicity, we focus on the V_H antibody fragment (i.e. nanobody design) and ignore V_L, which could be designed in a similar fashion. Fig. 23 shows the end-to-end process, which displays a key innovation of Neo-1: Neo-1 can be directly conditioned with multiple types of information to ensure that experimentally obtained knowledge is used efficiently during the entire design process.

In the initial step, Neo-1 is prompted with a partial antibody sequence and the structure of the antigen, then simultaneously co-folds the V_H against the antigen structure and generates a portion of the CDRH3 sequence (Fig. 23.1). Despite this structure being disclosed after Neo-1's training cutoff, Neo-1 designs a V_H that is predicted to interact via its CDRH3 with a real epitope. Neo-1 also samples diverse putative epitopes, which is desirable when epitopes are not known a priori (Fig. 24).

Fig. 24: Restrained generation steers V_H designs toward target epitopes. Epitope heatmaps (pink) indicate antigen sites frequently interacting with the designed portion of the CDRH3 in unrestrained (left) and restrained (right) samples. Color intensity reflects normalized occurrence.

As the antigen becomes experimentally characterized by epitope mapping with Neo-1-designed or naturally occurring antibodies, scientists will increasingly design targeted antibodies that bind to a preferred epitope. This information can be used to generate refined designs that capture key CDRH3 characteristics using Neo-1. Prompting Neo-1 with interval distance restraints between CDRH3 and epitope residues results in increased exploration of the target epitope (Fig. 24) and generation of more favorable beta-strand interactions (Fig. 23.2).

Fig. 25: Neo-1 conditioned with antigen structure, epitope distance restraints, and sequence motifs generates biologically reasonable all-atom features. Top left: hydrogen-bonding network stabilizing CDRH3 loop. Top right: burial of user-specified tryptophan in hydrophobic pocket. Bottom left: polar and charged interactions with antigen residues. Bottom right: backbone hydrogen bonds in antiparallel β-strand. Designed residues in green, folded residues in dark blue.

As designed antibodies are tested and improved via design iteration, affinity maturation, or other strategies, paratope sequence patterns can emerge that illuminate conserved interactions that should be maintained. Neo-1 can be conditioned with sequence information to maintain key motifs while exploring protein sequence space. Fig. 23.3 depicts a scenario in which Neo-1 is conditioned with antigen structure, distance restraints and two key sequence motifs as they might have been found through affinity maturation. The resulting generated structure exhibits favorable interactions both encouraged by user conditioning and generated unconditionally (Fig. 25). Distance restraint conditioning encourages the formation of the beta strand interaction with the desired epitope, while sequence conditioning encourages the selection of user-specified hydrophobic side chains which are buried in an antigen pocket. Favorable interactions also arise that were not explicitly conditioned, e.g., the formation of inter-chain polar or charged side chain interactions and emergence of a hydrogen bonding network between designed and folded V_H residues that may help pre-arrange the CDRH3 loop.

After multiple iterations of affinity optimization, structures of antibody:antigen complexes are often solved to further rationalize and design interactions. Fig. 23.4 shows how Neo-1 easily integrates this additional information, and produces CDRH3 designs conditioned on such experimental obtained binder structures in conjunction with precise distance and epitope information. With this, Neo-1 manages to rediscover strand interactions highly similar to the crystallized optimized binder (Fig. 26).

Fig. 26: Neo-1 conditioned with the structures of antigen, V_H framework, and distance restraints recovers strand interactions with high fidelity. CDRH3 loop inpainting designs form an antiparallel beta strand interaction with the antigen in a manner highly similar to the reference structure (7MSQ), recovering four of five hydrogen bonds.

Limitations and the Way Forward

While the benchmarks and case studies presented above underscore the capabilities of Neo-1 as a powerful, unified model—demonstrating its ability to predict structures, rediscover critical molecular interactions, and serve as a highly controllable co-pilot in drug discovery—these retrospective evaluations naturally come with inherent limitations. We have already deployed Neo-1 in both internal and collaborative research programs to great effect and look forward to sharing prospective and experimentally validated results soon.

Decoding and designing molecular structures is a breakthrough, but it's only the beginning. NeoLink takes us further by illuminating how proteins and molecules dynamically interact within the living cell, revealing the real-time choreography that underpins both therapeutic success and potential side effects. We look forward to scaling our technology with aims to bridge the gap between static structure and dynamic function, offering critical insights into the emergent clinical outcomes shaped by these complex molecular networks. For VantAI's internal programs, we already benefit from these unique capabilities, and are able to design and prioritize molecules in ways not previously possible.

Additionally, while contributions of Neo-1 and the underlying NeoLink data platform represent a significant milestone and future potential, they remain one step among many to come. We are grateful to the academic field and contributions of many companies who have released their science and developments and look forward to a large release of novel data from VantAI, to supplement our PINDER and PLINDER resources. As the amount of additional information that can be extracted from the RCSB PDB and public data sources is waning, our structural proteomics data platform and breakthrough model advances speak to a new frontier, where we and others will innovate what is possible in our quest to map the true complexity of biology.

If this kind of frontier research as part of a small and extremely talented team with a compute, data and model advantage excites you, and you have a track record of excellence—we are hiring. Reach out to us at [email protected].

The Team Behind Neo-1

Clemens Isert*, Michael Pun*, Emanuele Rossi*, Thomas Castiglione, Doug Tischer, Mehmet Akdel, Daniel Kovtun, Marco Pegoraro, Thomas Duignan, Alex Zhang, Vladas Oleinikovas, Graham Holt, Yusuf Adeshina, Patrick Kunzmann, Arjun Ramesh, Douglas Wu, Alex Goncearenco, Lidor Foguel, Dana Felker, Davide Sabbadin, Vivian Lam, Matthias Grass, Zach Carpenter, Michael Bronstein, Luca Naef

*Equal contribution