Genes and Geography - A Bioinformatics Project
Offered By: OMGenomics via YouTube
Course Description
Overview
Embark on a comprehensive bioinformatics project walkthrough that explores the relationship between genes and geography through population genotype data analysis. Learn to run Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) on genetic data from the 1000 Genomes project. Follow step-by-step instructions to download and parse VCF files using pysam, create numpy arrays, and utilize pandas for data manipulation. Transition between Python scripts and Google Colab environments while mastering visualization techniques with both matplotlib and Altair. Gain insights into population genetics by coloring data points based on ancestry labels and merging additional population information. Conclude with an exercise on performing PCA on SNPs and discover the origin story behind this illuminating project.
Syllabus
Intro
Hunting for data
Inspecting the VCF
Finding population labels for the samples
Parsing VCF with pysam
Going from alleles to numbers for a numpy array
When to work in colab versus python script
Saving data with pandas
Adding population labels from the panel file
To Colab!
PCA
First plot! Mission accomplished :
Using Altair for plotting with labels
Second plot with population labels!
Merging with the igsr_population.tsv data
TSNE
Exercise: PCA on the SNPs
Conclusion and origin story for this project
Taught by
OMGenomics
Related Courses
Математика и Python для анализа данныхMoscow Institute of Physics and Technology via Coursera Introduction to Python for Data Science
Microsoft via edX Python for Data Science
University of California, San Diego via edX Get Data Off the Ground with Python
George Washington University via Independent 用 Python 做商管程式設計(三)(Programming for Business Computing in Python (3))
National Taiwan University via Coursera