InstaDeep has introduced Nucleotide Transformer v3 (NTv3), an advanced multi-species genomics foundation model set to transform the landscape of genomic prediction and design. This novel AI system is engineered to comprehend DNA sequences over extensive 1-megabase (Mb) contexts at single-nucleotide resolution, bridging the gap between local genetic motifs and large-scale regulatory mechanisms across numerous organisms.
Pushing the Boundaries of Genomic AI
NTv3 represents a significant evolution in AI for life sciences, integrating diverse functionalities within a single unified architecture. It combines representation learning, precise functional track and genome annotation prediction, and sophisticated controllable sequence generation. Previous iterations of the Nucleotide Transformer demonstrated the power of self-supervised pretraining on thousands of genomes, generating robust features for molecular phenotype prediction. The earlier models, ranging from 50 million to 2.5 billion parameters, were trained on thousands of human and other diverse species' genomes. NTv3 builds upon this foundation by extending its context length capabilities and incorporating explicit functional supervision alongside a powerful generative mode.
Innovative Architecture for Deep Genomic Understanding
The core of NTv3 features a U-Net style architecture specifically designed to process exceptionally long genomic windows. This design employs a convolutional downsampling tower to compress input sequences, followed by a transformer stack that models long-range dependencies within the reduced-dimension space. A deconvolution tower then reconstructs base-level resolution for precise prediction and generation tasks. Input sequences are meticulously tokenized at the character level, representing the fundamental building blocks of DNA (A, T, C, G, N), along with specialized tokens for various computational needs. Publicly available models utilize single base tokenization, leveraging an 11-token vocabulary.
NTv3 models are available in various scales, from the compact NTv3 8M pre, with approximately 7.69 million parameters, to the more extensive NTv3 650M. The larger model, boasting 650 million parameters, incorporates additional conditioning layers to facilitate species-specific predictions, enhancing its versatility across biological domains.
Vast Training Data and Superior Performance
The development of NTv3 involved an immense training effort, initially pretraining on nine trillion base pairs derived from the OpenGenome2 resource. This stage utilized base-resolution masked language modeling. Following this, the model underwent post-training with a comprehensive joint objective, which combined continued self-supervision with supervised learning. This integrated approach incorporated data from approximately 16,000 functional tracks and annotation labels spanning 24 distinct animal and plant species.
Through this rigorous training, NTv3 has achieved state-of-the-art accuracy in functional track prediction and genome annotation across a multitude of species. It has demonstrably outperformed existing sequence-to-function models and prior genomic foundation models on established public benchmarks. Furthermore, NTv3 excels on the newly introduced NTv3 Benchmark, a standardized suite of 106 long-range, single-nucleotide, cross-assay, and cross-species tasks designed for controlled downstream fine-tuning with 32 kb input windows and base-resolution outputs. The model's exposure to thousands of tracks from numerous species during post-training has enabled it to acquire a shared regulatory grammar, facilitating knowledge transfer between organisms and assays and supporting coherent long-range genome-to-function inference.
From Prediction to Controllable Genomic Design
Beyond its predictive prowess, NTv3 can be adapted into a controllable generative model through masked diffusion language modeling. In this sophisticated mode, the model accepts conditioning signals that specify desired enhancer activity levels and promoter selectivity. It then intelligently completes masked segments within DNA sequences, ensuring consistency with the provided conditions. Experimental validation, conducted in collaboration with the Stark Lab, involved designing 1,000 enhancer sequences with predetermined activity and promoter specificity. In vitro validation using STARR-seq assays confirmed that these generated enhancers accurately reflected the intended activity ordering and exhibited more than double the improved promoter specificity compared to baseline methods. This capability opens new avenues for synthetic biology and therapeutic design.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost