Overview and Implementation of the GBS Pipeline Qi Sun Computational Biology Service Unit Cornell University Overview of the Data Analysis Strategy Genotyping by Sequencing (GBS) < 450 bp ApeKI site (GCWGC) ( ) 64‐base sequence tag B73 • Reduced genome representation; • Reads can be aligned without reference genome; Identification of markers with/without the reference genome SNP and small INDELs B73 Mo17 Loss of cut site Identification of Presence/Absence Variations (PAV) B73 Mo17 Reads ‐> Tags ‐> Call SNPs from Tags CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC …………. Tag 1 Tag 2 Tag 3 Reads ‐> Tags ‐> Call SNPs from Tags CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC …………. Tag 1 Tag 2 Tag 3 Reads ‐> Tags ‐> Call SNPs from Tags Tag 2 Tag 3 Tassel Stacks Tag catalog is created from pooled all individuals. Tag catalog is created from each individual separately. CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC …………. Tag 1 Tag 2 Tag 3 Reads ‐> Tags ‐> Call SNPs from Tags Maize N M population (5000 lines) 2.6 billion reads 6 million tags CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC ………………………………… Two ways of alignments: a. Anchored to reference genome (regular pipeline) b. Pair-wise alignment between tags (UNEAK) Reads ‐> Tags ‐> Call SNPs from Tags ApeK I site CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTATAGACGGGC ………………………………… Reads ‐> Tags ‐> Call SNPs from Tags tag1 tag2 tag3 Many tags with low read depth have sequencing errors. Reads from a population p Unique tag list from ooled reads Summary of the GBS pipeline g g Call SNPs from ali ned ta s p Unique tag list from ooled reads Summary of the GBS pipeline g g Call SNPs from ali ned ta s Genotypes Match reads from each individual to the tags (tags‐to‐taxa matrix) Reads from a population Basic Filtering Strategy 02000000 4000000 6000000 8000000 10000000 12000000 1 4 7 10131619222528313437404346495255 #Depth 0 500 1000 1500 2000 2500 3000 3500 4000 1 8 1 5 2 2 2 9 3 6 4 3 5 0 5 7 6 4 7 1 7 8 8 5 9 2 9 9 Depth of Coverage Whole Genome Shotgun GBS Depths Sites Depths Tags 02000000 4000000 6000000 8000000 10000000 12000000 1 4 7 10131619222528313437404346495255 0 500 1000 1500 2000 2500 3000 3500 4000 1 8 1 5 2 2 2 9 3 6 4 3 5 0 5 7 6 4 7 1 7 8 8 5 9 2 9 9 Depths Sites Depths Tags The GBS pipeline filtered out tags with very low read depth across the population WGS GBS • Reads with sequencing errors; • Extremely rare alleles; • Reads from highly repetitive regions; 02000000 4000000 6000000 8000000 10000000 12000000 1 4 7 10131619222528313437404346495255 0 500 1000 1500 2000 2500 3000 3500 4000 1 8 1 5 2 2 2 9 3 6 4 3 5 0 5 7 6 4 7 1 7 8 8 5 9 2 9 9 Depths Sites Depths Tags WGS GBS The GBS pipeline filtered out tags with very low read depth across the population • Reads with sequencing errors; • Extremely rare alleles; Generally we do not remove tags with very high depth. As they could come from a mix of (1) good loci with efficient PCR amplification; (2) bad loci from repetitive regions. • Reads from highly repetitive regions; 0500 1000 1500 2000 2500 3000 3500 4000 1 6 1 1 1 6 2 1 2 6 3 1 3 6 4 1 4 6 5 1 5 6 6 1 6 6 7 1 7 6 8 1 8 6 9 1 9 6 Uneven coverage enrich sequencing depth in some sites, but cause missing data in others Strategies 1. Filter out sites with too much missing data; 2. Filter out genotypes with low depth to minimize heterozygous genotyping errors. 3. Imputation based on haplotypes; Filter genotyping errors Reference genome Paralogous regions Individual B Individual A Most of the genotyping errors are caused by reads from paralogous regions Pipeline Implementation Three ways to access the software 1. Computers with GBS software pre-installed • Cornell BioHPC Lab • iPlant Discovery Environment 2. Using pre-compiled Java code from . http://www.maizegenetics.net 3. Get the source code from sourceforge .net http://sourceforge.net (Project name: TASSEL) GBS Pipeline on Cornell BioHPC Lab (for both Cornell and external users only) Step 1: Reserve a machine http://cbsu.tc.cornell.edu GBS Pipeline on Cornell BioHPC Lab Step 2: Upload files Fetch (mac), FileZilla (win) or WinSCP (win) GBS Pipeline on Cornell BioHPC Lab Step 3: Type the command to run pipeline tassel/run_pipeline.pl -fork1 -QseqToTagCountPlugin -i . -k rice.key -e apeki -endPlugin -runfork1 Mac: terminal window; PC: Putty Two ways to upload files to iPlant data store 1. Web interface 2. Command line tool: icommand Using iPlant GBS on iPlant Discovery Environment http://www.iplantcollaborative.org/ (Beta version now) Download the zip file: TASSEL_x.x _Standalone Set up the pre-compiled pipeline on your own computer A computer with at least 8GB or more RAM (Linux or Mac) Download TASSEL Standalone from maizegenetics.net Set up Java (64bit) BWA (for alignment to reference genome) http://www.maizegenetics.net/tassel/docs/TasselPipelineCLI.pdf Document for installation: Set up TASSEL source code in Netbeans http://www.maizegenetics.net/tassel-in-netbeans (make user use 64-bit Java and Netbeans) The intermediate files are compressed binary files • Tag‐Counts (TC): *.cnt.txt *.cnt • Tag‐by‐taxa (TBT): *.tbt.txt *.tbt.bin • Tags‐on‐physical‐map (TOPM): *.topm.txt *.topm.bin • Hapmap *.hmp.txt GDPDM blobs * 64 bp tags were represented as 2 long integers (8 bytes for long in Java). BinaryToTextPlugin can be used to convert to text file