An Open MetaGenomic Corpus for Mixed-Modality Genomic Language Modeling
Offered By: Valence Labs via YouTube
Course Description
Overview
Explore a comprehensive talk on the Open MetaGenomic (OMG) corpus, a groundbreaking genomic pretraining dataset, and its applications in mixed-modality genomic language modeling. Delve into the creation of this massive dataset, comprising 3.1T base pairs and 3.3B protein coding sequences, derived from JGI's IMG and EMBL's MGnify repositories. Learn about the quality filtering process, the dataset's composition, and its unique mixed-modality approach combining translated amino acids for protein coding sequences and nucleic acids for intergenic sequences. Discover the development of gLM2, the first mixed-modality genomic language model, and its ability to leverage genomic context for robust functional representations and coevolutionary signals in protein-protein interfaces. Gain insights into embedding space deduplication techniques for corpus balancing and improved downstream task performance. Explore unsupervised protein-protein interaction analysis and future directions in this field. Access the OMG dataset and gLM2 model through provided Hugging Face Hub links, and engage in a Q&A session to deepen your understanding of this innovative approach to genomic language modeling.
Syllabus
- Introduction
- Background
- OMG dataset
- gLM2
- Unsupervised protein-protein interaction
- Next steps
- Q&A
Taught by
Valence Labs
Related Courses
Synapses, Neurons and BrainsHebrew University of Jerusalem via Coursera Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)
Moscow Institute of Physics and Technology via Coursera Bioinformatics Algorithms (Part 2)
University of California, San Diego via Coursera Biology Meets Programming: Bioinformatics for Beginners
University of California, San Diego via Coursera Neuronal Dynamics
École Polytechnique Fédérale de Lausanne via edX