Overview and Implementation of the GBS Pipeline Qi Sun Computational Biology Service Unit Cornell University Overview of the Data Analysis Strategy Genotyping by Sequencing (GBS) < 450 bp ApeKI site (GCWGC) ( ) 64-base sequence tag B73 • Reduced genome representation; • Reads can be aligned without reference genome; Identification of markers with/without the reference genome SNP and small INDELs B73 Mo17 Loss of cut site Identification of Presence/Absence Variations (PAV) B73 Mo17 Reads -> Tags -> Aligned Tags -> SNPs/INDELs CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC …………. Reads -> Tags -> Aligned Tags -> SNPs/INDELs CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC …………. Tag 1 Tag 2 Tag 3 Reads -> Tags -> Aligned Tags -> SNPs/INDELs CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC …………. Tag 1 Tag 2 Tag 3 Maize N M population (5000 lines) 2.6 billion reads 6 million tags Reads -> Tags -> Aligned Tags -> SNPs/INDELs CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC ………………………………… Two ways of alignments: a. Anchored to reference genome (regular pipeline) b. Pair-wise alignment between tags (UNEAK) ApeK I site Reads -> Tags -> Aligned Tags -> SNPs/INDELs CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC ………………………………… Reproducible sequencing errors Reads Tags Aligned Tags Tags by Taxa HapMap Summary of the GBS pipeline Filtering SNP/INDEL Reads Tags Aligned Tags Tags by Taxa HapMap Summary of the GBS pipeline Filtering SNP/INDEL All reads in the sequenced populations are merged to create Tag list. Experimental Design 1. Depth of coverage 2. Choice of enzyme 3. Type of population 02000000 4000000 6000000 8000000 10000000 12000000 1 4 7 10131619222528313437404346495255 #Depth 0 500 1000 1500 2000 2500 3000 3500 4000 1 8 1 5 2 2 2 9 3 6 4 3 5 0 5 7 6 4 7 1 7 8 8 5 9 2 9 9 Depth of Coverage Whole Genome Shotgun GBS Depths Sites Depths Tags Depth of coverage controled by multiplexing level and choice of enzyme • Multiplexing level 48-plex, 96-plex, 384-plex • Enzyme selection ApeKI GCWGC Expected size: 0.5 kb PstI CTGCAT Expected size: 4 kb Population type determines method of filtering and imputation • RIL population (used in exercise for this workshop) • F1 hybrids of highly heterozygous parents • Families with pedigree information Pipeline Implementation Three ways to access the software 1. Computers with GBS software pre-installed • Cornell BioHPC Lab • iPlant Discovery Environment 2. Using pre-compiled Java code from . http://www.maizegenetics.net 3. Get the source code from sourceforge .net http://sourceforge.net (Project name: TASSEL) GBS Pipeline on Cornell BioHPC Lab (for both Cornell and external users only) Step 1: Reserve a machine http://cbsu.tc.cornell.edu GBS Pipeline on Cornell BioHPC Lab Step 2: Upload files Fetch (mac), FileZilla (win) or WinSCP (win) GBS Pipeline on Cornell BioHPC Lab Step 3: Type the command to run pipeline tassel/run_pipeline.pl -fork1 -QseqToTagCountPlugin -i . -k rice.key -e apeki -endPlugin -runfork1 Mac: terminal window; PC: Putty Two ways to upload files to iPlant data store 1. Web interface 2. Command line tool: icommand Using iPlant GBS on iPlant Discovery Environment http://www.iplantcollaborative.org/ (Beta version now) Download the zip file: TASSEL_x.x _Standalone Set up the pre-compiled pipeline on your own computer A computer with at least 8GB or more RAM (Linux or Mac) Download TASSEL Standalone from maizegenetics.net Set up Java (64bit) BWA (for alignment to reference genome) http://www.maizegenetics.net/tassel/docs/TasselPipelineCLI.pdf Document for installation: Set up TASSEL source code in Netbeans http://www.maizegenetics.net/tassel-in-netbeans (make user use 64-bit Java and Netbeans) The intermediate files are compressed binary files • Tag-Counts (TC): *.cnt.txt *.cnt • Tag-by-taxa (TBT): *.tbt.txt *.tbt.bin • Tags-on-physical-map (TOPM): *.topm.txt *.topm.bin • Hapmap *.hmp.txt GDPDM blobs * 64 bp tags were represented as 2 long integers (8 bytes for long in Java). BinaryToTextPlugin can be used to convert to text file 1. Documentation of the tools. http://www.maizegenetics.net/gbs-bioinformatics 2. Training project data is provided by Chih-Wei Tung & Susan McCouch.