Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Overview and Implementation 
of the GBS Pipeline 
Qi Sun
Computational Biology Service Unit
Cornell University
Overview of the Data Analysis Strategy
Genotyping by Sequencing (GBS)
< 450 bp
ApeKI site (GCWGC)
(   ) 64‐base sequence tag
B73
• Reduced genome representation;
• Reads can be aligned without reference 
genome;
Identification of markers with/without the 
reference genome
SNP and small INDELs
B73
Mo17
Loss of cut site
Identification of Presence/Absence 
Variations (PAV)
B73
Mo17
Reads ‐> Tags ‐> Call SNPs from Tags
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC
………….
Tag 1
Tag 2
Tag 3
Reads ‐> Tags ‐> Call SNPs from Tags
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC
………….
Tag 1
Tag 2
Tag 3
Reads ‐> Tags ‐> Call SNPs from Tags
Tag 2
Tag 3
Tassel Stacks
Tag catalog is 
created from pooled 
all individuals.
Tag catalog is created 
from each individual 
separately. 
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCATGTAGACGGGC
………….
Tag 1
Tag 2
Tag 3
Reads ‐> Tags ‐> Call SNPs from Tags
Maize N M population (5000 lines)
2.6 billion reads
6 million tags
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
…………………………………
Two ways of alignments:
a. Anchored to reference genome (regular pipeline)
b. Pair-wise alignment between tags (UNEAK)
Reads ‐> Tags ‐> Call SNPs from Tags
ApeK I site
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGCGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC
CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTATAGACGGGC
…………………………………
Reads ‐> Tags ‐> Call SNPs from Tags
tag1
tag2
tag3
Many tags with low read depth have 
sequencing errors.
Reads from a population
p
Unique tag list 
from  ooled reads
Summary of the GBS pipeline
g g
Call SNPs from 
ali ned ta s
p
Unique tag list 
from  ooled reads
Summary of the GBS pipeline
g g
Call SNPs from 
ali ned ta s
Genotypes
Match reads from each 
individual to the tags 
(tags‐to‐taxa matrix)
Reads from a population
Basic Filtering Strategy
02000000
4000000
6000000
8000000
10000000
12000000
1 4 7 10131619222528313437404346495255
#Depth
0
500
1000
1500
2000
2500
3000
3500
4000
1 8
1
5
2
2
2
9
3
6
4
3
5
0
5
7
6
4
7
1
7
8
8
5
9
2
9
9
Depth of Coverage
Whole Genome Shotgun GBS
Depths
Sites
Depths
Tags
02000000
4000000
6000000
8000000
10000000
12000000
1 4 7 10131619222528313437404346495255 0
500
1000
1500
2000
2500
3000
3500
4000
1 8
1
5
2
2
2
9
3
6
4
3
5
0
5
7
6
4
7
1
7
8
8
5
9
2
9
9
Depths
Sites
Depths
Tags
The GBS pipeline filtered out tags with very 
low read depth across the population
WGS GBS
• Reads with sequencing errors;
• Extremely rare alleles;
• Reads from highly repetitive 
regions;
02000000
4000000
6000000
8000000
10000000
12000000
1 4 7 10131619222528313437404346495255 0
500
1000
1500
2000
2500
3000
3500
4000
1 8
1
5
2
2
2
9
3
6
4
3
5
0
5
7
6
4
7
1
7
8
8
5
9
2
9
9
Depths
Sites
Depths
Tags
WGS GBS
The GBS pipeline filtered out tags with very 
low read depth across the population
• Reads with sequencing errors;
• Extremely rare alleles;
Generally we do not remove tags with very high depth. As 
they could come from a mix of (1) good loci with efficient 
PCR amplification; (2) bad loci from repetitive regions.
• Reads from highly repetitive 
regions;
0500
1000
1500
2000
2500
3000
3500
4000
1 6
1
1
1
6
2
1
2
6
3
1
3
6
4
1
4
6
5
1
5
6
6
1
6
6
7
1
7
6
8
1
8
6
9
1
9
6
Uneven coverage enrich sequencing 
depth in some sites, but cause 
missing data in others
Strategies
1. Filter out sites with too much missing data;
2. Filter out genotypes with low depth to minimize 
heterozygous genotyping errors.
3. Imputation based on haplotypes;
Filter genotyping errors
Reference genome
Paralogous regions
Individual B
Individual A
Most of the genotyping errors are caused by 
reads from paralogous regions
Pipeline Implementation
Three ways to access the software
1. Computers with GBS software pre-installed
• Cornell BioHPC Lab
• iPlant Discovery Environment
2. Using pre-compiled Java code from . 
http://www.maizegenetics.net
3. Get the source code from sourceforge .net
http://sourceforge.net
(Project name: TASSEL)
GBS Pipeline on Cornell BioHPC Lab
(for both Cornell and external users only)
Step 1: Reserve a machine
http://cbsu.tc.cornell.edu
GBS Pipeline on Cornell BioHPC Lab
Step 2: Upload files
Fetch (mac), FileZilla (win) or WinSCP (win)  
GBS Pipeline on Cornell BioHPC Lab
Step 3: Type the command to run pipeline
tassel/run_pipeline.pl -fork1 -QseqToTagCountPlugin -i . -k rice.key -e 
apeki -endPlugin -runfork1
Mac: terminal window;  PC: Putty
Two ways to upload files to iPlant data store
1. Web interface
2. Command line tool: icommand
Using iPlant
GBS on iPlant Discovery Environment
http://www.iplantcollaborative.org/
(Beta version now)
Download the zip 
file: TASSEL_x.x
_Standalone
Set up the pre-compiled pipeline on your own 
computer
 A computer with at least 8GB or more RAM (Linux or 
Mac) 
 Download TASSEL Standalone from maizegenetics.net
 Set up Java (64bit)
 BWA (for alignment to reference genome)
http://www.maizegenetics.net/tassel/docs/TasselPipelineCLI.pdf
Document for installation:
Set up TASSEL source code in Netbeans
http://www.maizegenetics.net/tassel-in-netbeans
(make user use 64-bit  Java and Netbeans)
The intermediate files are compressed binary files
• Tag‐Counts (TC):  
*.cnt.txt   *.cnt
• Tag‐by‐taxa (TBT):  
*.tbt.txt *.tbt.bin
• Tags‐on‐physical‐map (TOPM): 
*.topm.txt *.topm.bin
• Hapmap
*.hmp.txt GDPDM blobs
* 64 bp tags were represented as 2 long integers (8 bytes for long in Java). 
BinaryToTextPlugin can be used to convert to text file