Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Overview and Implementation 
of the GBS Pipeline  
Qi Sun 
Computational Biology Service Unit 
Cornell University 
Overview of the Data Analysis Strategy 
Genotyping by Sequencing (GBS) 
< 450 bp 
ApeKI site (GCWGC) 
(   ) 64-base sequence tag 
B73 
• Reduced genome representation; 
 
• Reads can be aligned without reference 
genome; 
Identification of markers with/without the 
reference genome 
SNP and small INDELs 
B73 
Mo17 
Loss of cut site 
Identification of Presence/Absence 
Variations (PAV) 
B73 
Mo17 
 Reads -> Tags -> Aligned Tags -> 
SNPs/INDELs 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
…………. 
 
 
 Reads -> Tags -> Aligned Tags -> 
SNPs/INDELs 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC 
 
…………. 
 
 
Tag 1 
Tag 2 
Tag 3 
 Reads -> Tags -> Aligned Tags -> 
SNPs/INDELs 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC 
 
…………. 
 
 
Tag 1 
Tag 2 
Tag 3 
Maize N M population (5000 lines) 
2.6 billion reads 
6 million tags 
 
 Reads -> Tags -> Aligned Tags -> 
SNPs/INDELs 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
 
………………………………… 
 
 
Two ways of alignments: 
a. Anchored to reference genome (regular pipeline) 
b. Pair-wise alignment between tags (UNEAK) 
ApeK I site 
 
Reads -> Tags -> Aligned Tags -> 
SNPs/INDELs 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 
 
………………………………… 
 
 
Reproducible sequencing 
errors 
Reads 
Tags 
Aligned Tags Tags by Taxa 
HapMap 
Summary of the GBS pipeline 
Filtering 
SNP/INDEL 
Reads 
Tags 
Aligned Tags Tags by Taxa 
HapMap 
Summary of the GBS pipeline 
Filtering 
SNP/INDEL 
All reads in the 
sequenced populations 
are merged to create 
Tag list. 
Experimental Design 
1. Depth of coverage 
 
2. Choice of enzyme 
 
3. Type of population 
 
02000000
4000000
6000000
8000000
10000000
12000000
1 4 7 10131619222528313437404346495255
#Depth 
0
500
1000
1500
2000
2500
3000
3500
4000
1 8
1
5
2
2
2
9
3
6
4
3
5
0
5
7
6
4
7
1
7
8
8
5
9
2
9
9
Depth of Coverage 
Whole Genome Shotgun GBS 
Depths 
Sites 
Depths 
Tags 
Depth of coverage controled by 
multiplexing level and choice of enzyme 
• Multiplexing level 
 48-plex, 96-plex, 384-plex 
 
• Enzyme selection 
 ApeKI GCWGC Expected size: 0.5 kb 
 PstI  CTGCAT  Expected size: 4 kb 
Population type determines method 
of filtering and imputation 
• RIL population (used in exercise for this 
workshop) 
 
• F1 hybrids of highly heterozygous parents 
 
• Families with pedigree information 
Pipeline Implementation 
Three ways to access the software 
1. Computers with GBS software pre-installed 
• Cornell BioHPC Lab 
• iPlant Discovery Environment 
 
2. Using pre-compiled Java code from  .   
 http://www.maizegenetics.net 
 
3. Get the source code from sourceforge .net 
http://sourceforge.net  
(Project name: TASSEL) 
GBS Pipeline on Cornell BioHPC Lab 
 (for both Cornell and external users only) 
Step 1: Reserve a machine 
http://cbsu.tc.cornell.edu  
GBS Pipeline on Cornell BioHPC Lab 
Step 2: Upload files 
Fetch (mac), FileZilla (win) or WinSCP (win)   
GBS Pipeline on Cornell BioHPC Lab 
Step 3: Type the command to run pipeline 
tassel/run_pipeline.pl -fork1 -QseqToTagCountPlugin -i . -k rice.key -e 
apeki -endPlugin -runfork1 
Mac: terminal window;  PC: Putty 
Two ways to upload files to iPlant data store 
1. Web interface 
2. Command line tool: icommand 
Using iPlant 
GBS on iPlant Discovery Environment 
http://www.iplantcollaborative.org/ 
(Beta version now) 
Download the zip 
file: TASSEL_x.x 
_Standalone 
Set up the pre-compiled pipeline on your own 
computer 
 
 A computer with at least 8GB or more RAM (Linux or 
Mac)  
 Download TASSEL Standalone from maizegenetics.net 
 Set up Java (64bit) 
 BWA (for alignment to reference genome) 
http://www.maizegenetics.net/tassel/docs/TasselPipelineCLI.pdf 
Document for installation: 
Set up TASSEL source code in Netbeans 
http://www.maizegenetics.net/tassel-in-netbeans 
(make user use 64-bit  Java and Netbeans) 
The intermediate files are compressed binary files 
 
 
• Tag-Counts (TC):   
 *.cnt.txt   *.cnt 
• Tag-by-taxa (TBT):   
 *.tbt.txt *.tbt.bin 
• Tags-on-physical-map (TOPM):  
 *.topm.txt *.topm.bin 
• Hapmap 
 *.hmp.txt GDPDM blobs 
 
* 64 bp tags were represented as 2 long integers (8 bytes for long in Java).  
 
BinaryToTextPlugin can be used to convert to text file 
1. Documentation of the tools. 
http://www.maizegenetics.net/gbs-bioinformatics 
 
 
2. Training project data is provided by Chih-Wei 
Tung & Susan McCouch.