Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
 1 
BIT150 - Lab 10  
Part I. Nucleotide Diversity Analysis 
Part II. Association mapping 
 
Part I.  DnaSP 
Sequence analysis. Estimation of nucleotide diversity. Test for evidence of selection 
What evolutionary information can we infer from these data? 
 
 
 DnaSP 
1. About DnaSP  
2. Importing data  
3. Calculation of nucleotide diversity  
4. Detecting departures from neutrality  
 
1. About DnaSP  
 
DnaSP (Rozas and Rozas, 1999; Rozas et al., 2003) is a software package for the analysis of 
the DNA polymorphism from nucleotide sequence data. DnaSP runs on a Windows platform 
and is freely available at http://www.ub.es/dnasp/. 
 
In this lab, you will learn how to use DnaSP to calculate the nucleotide diversity present in 
nucleotide sequence data, and how to test for departure from a neutral model of 
evolution, i.e. genetic drift. However, DnaSP is also capable of performing a number of other 
calculations. 
 
As explained at the beginning of this lab, DnaSP requires a multiple DNA sequence alignment 
file in FASTA format, as one would create using ClustalW through MEGA4, BioEdit, etc, or 
using ClustalX with its graphical interface. 
 
2. Importing data 
 
- Open DnaSP. The opening screen has animated images of DNA double-helices, which stop 
when you click anywhere on the screen. 
 
- Click on File in the toolbar to get a blank screen. 
 
- Go to File|Open Data File... and Open the .fas file. 
 
This opens a Data Information window, which shows a summary of your data, i.e. total number 
of nucleotide sites, total number of sequences, etc. Close this window (to open the Data 
Information window at any time, go to Display|Data info. 
 
 2 
 
 
- Go to Display|View Data to see the multiple sequence alignment. 
 
This opens a DNA Sequence Polymorphism window with the aligned sequence names along 
the left side, and nucleotide bases along the top. You can slide along the length of the sequence 
or along the right side to view all the sequences using the slide rules. 
 
In the bottom right corner, there is a Select Sites/Codons… drop-down box with options of how 
you can view your data. This includes options for highlighting the invariable (monomorphic) or 
the variable (polymorphic) sites only. Select these options in turn to see how this affects the 
data shown. If the sequences were annotated, you could also view the sequence as codons, 
and highlight synonymous and nonsynonymous nucleotide sites. 
 
 
 3 
3. Calculation of nucleotide diversity  
 
- Click on Analysis in the toolbar to see the variety of analyses that DnaSP can perform. 
 
- Go to Analysis|DNA polymorphism. This brings up a DNA Polymorphism Options window. 
 
• The Data Set drop-down box gives you the option to select the dataset to be analyzed. 
Since your dataset contains only one set of sequences, the only option given will be All 
Included Sequences.  
• You can estimate the nucleotide diversity in your data set either across the entire sequence 
or in specific regions by selectiong the Region to Analyze. 
• You can also estimate whether nucleotide diversity is particularly high in a specific region of 
the sequence using the Sliding Window option. If you check the Compute box, you can 
then define the size of the sequence block (Window Length) and how often to repeat the 
calculation (Step Size). As an example, you can see how the pattern of nucleotide diversity 
changes in 100 nucleotide blocks, every 25 nucleotides along your sequence.  
• Finally, there are many Options of associated algorithms for the calculation of 
nucleotide diversity (average number of nucleotide differences per site between two 
sequences): 
 
(i) Variance of Pi - this refers to the the variance in the average number of nucleotide 
differences per site between two sequences (Nei 1987).  
 
(ii) Nucleotide diversity with Jukes and Cantor correction factor - this model corrects for 
bases where mutation has occurred more than once. As such, the Jukes and Cantor 
correction accounts for how sequences evolved (Jukes and Cantor 1969; Lynch and Crease 
1990).  
(iii) Nucleotide diversity (gaps/missing data) - both of the earlier options assume that 
there are no gaps in the sequence data. However, in the event that there are indels in the 
sequence, you will need to select this option otherwise these indels will be ignored during 
the analysis.  
 
NOTE: You can only select options only (i), only (ii), (i) and (ii), or only (iii) but not all 
three of them.  
 
- Select All Included Sequences as Data Set, the entire region as Region to Analyze, and both 
Compute Variance of Pi and Compute Pi as the Options to calculate nucleotide diversity. Click 
on OK. 
 
 4 
 
 
4. Detecting departures from neutrality 
 
Once we have estimated nucleotide diversity, we can find out whether selection has 
potentially played any role in influencing these sequence changes.  
 
Tajima's test, or D test statistic (Tajima, 1989) tests the neutral theory of molecular 
evolution (Kimura, 1983). That is, the vast majority of molecular differences that arise 
through spontaneous mutation does not influence the fitness of the individual. A corollary 
to this theory is that genomes evolve primarily through the process of genetic drift.  
 
Tajima's D statistic compares the difference between two estimates of the amount of 
nucleotide variation, one being simply the number of segregating sites (Watterson, 1975) and 
the other one being the average number of pairwise differences (Nei and Li, 1979; Tajima, 
1983). In a constant-sized population experiencing only genetic drift, both estimates should give 
equal values. Dissimilar values suggest that some form of selection could be acting on this 
sequence.  
 5 
 
A positive value of Tajima's D indicates that there has been 'balancing selection' and the 
data will show a few divergent haploypes, whereas a negative value suggests that 'purifying 
selection' may have occurred and the data will reveal an excess of singletons. 
 
- Go on Analysis|Tajima's test. 
 
- Select All Included Sequences as Data Set, the entire region as Region to Analyze, and 
Segregating Sites as the Nucleotide Substitutions Considered for the analysis. Click on OK. 
 
 
 
Questions to consider:  
What is the frequency of SNPs? 
What are the nucleotide diversity statistics theta and pi?  
Does the gene appear to be under selection? Why yes/no?  
 
 6 
Part II.  Tassel 
 
1. About Tassel 
2. The data  
3. The analysis  
4. Example  
 
1. About Tassel 
 
Trait Analysis by Association, Evolution, and Linkage (TASSEL) is a java-based program 
intended to infer correlations between genetic markers and phenotypic traits (association 
mapping). In this lab we will only focus on methods used to infer correlations between genetic 
and phenotypic data. Specifically, we will asses correlations between single nucleotide 
polymorphism (SNP) markers and various wood property characteristics in loblolly pine (Pinus 
taeda). However, this software also performs a variety of other quantitative analyses including 
calculation of molecular diversity, estimation of linkage disequilibrium, and inference of 
phylogenetic trees.  
 
2. The data 
  
The goal in association mapping is to correlate genotypic with phenotypic variation. We refer to 
this as marker-trait associations. The dataset, therefore, consists of both types of data. The 
genotypic data are comprised of genotypic classes defined across a large number (n) of SNPs 
(n = 58 SNPs in the dataset). For a standard SNP with only two states, there are only three 
genotypic classes in a diploid individual (homozygous for state 1, heterozygous, homozygous 
for state 2). ‘State’ refers to what nucleotides are found at a given SNP. These genotypic 
classes at each SNP are coded with single letters for each individual in the dataset. The 
phenotypic data are comprised of quantitative measurements of various wood property traits (n 
= 18 traits).  
 
3. The analysis 
 
We will use a General Linear Model (GLM) to estimate genetic effects on phenotypic data. In 
this context, variation at SNP markers is used to explain variation in phenotypes (y = a + bx + e, 
where y is the phenotypic trait, b is the linear term corresponding to the SNP, and e is the error). 
A statistical test of the following form will be performed for each SNP and phenotypic trait:  
 
H0: The linear term (b) corresponding to SNP i is equal to zero. 
 
HA: The linear term (b) corresponding to SNP i is not equal to zero.  
 
The null hypothesis is rejected when the corresponding p-value is less than 0.05 or some other 
predetermined significance threshold. Since there are as many tests as there are combinations 
of SNPs and phenotypic traits, the p-value is often adjusted to take into account the fact of 
performing so many independent statistical tests (in this dataset that is 58*18 = 1044 
independent tests!). When the p-value less than 0.05, we reject the null hypothesis and 
conclude that variation at this particular SNP is strongly correlated, or associated, with 
variation for a certain phenotypic trait.  
 
 
 
 7 
4. Example 
 
The files you will need to work with Tassel are in a folder named ‘tassel_files’ on the Z drive. 
There are two files corresponding to genotypic and phenotypic data, called ‘genWood.txt’ and 
‘phenoWood.txt’, respectively.  
  
- Using FileZilla, transfer the ‘genWood.txt’ and ‘phenoWood.txt’ files from the Z drive to your 
directory in C.  
 
- Open Tassel. 
 
 
 
Note that the window is divided into three major frames. In the upper left is a data tree, where all 
the input data files and subsequent output files are listed. In the lower left is a status frame, 
which summarizes commands that are executed. In the right is a data window, which shows the 
data once it is imported into the program. 
  
- Click on POLY and open the genWood.txt file. The SNP genotypes for each individual are now 
loaded into the program. Click on the file named Allele located in the data tree to see the SNP 
data in the data window.  
 
 8 
 
 
- Click on TRAIT and open the phenoWood.txt file. The phenotypic data for each individual for 
each trait are now loaded into the program. Click on the file named 18traits/environ located in 
the data tree to see the phenotypic data in the data window.  
 
The next step is to combine the phenotypic data with the genotypic data to get a single dataset. 
 
- Highlight the files corresponding to each dataset, the SNP and the trait, in the data tree. To 
highlight both files (Allele and 18 traits/environ) hold the Ctrl key down while you click on each 
of the files. Click on U join. 
 
We now have a complete dataset comprised of both genotypic and phenotypic data. 
 
- Click on the file named 18 traits/environ + Allele in the data tree to view the complete file.  
 
We are now ready to perform an analysis. 
 
- Click on Analysis, and then on GLM. 
 
The Input Data Definition window will appear, and is composed of two frames. The one on the 
left lists all the input data. For this lab that is the phenotypic traits and the population. Since 
there is only a single population in this dataset, the drop-down menu for pop should be set to 
Exclude. Next to each trait is a drop-down menu that specifies what should be done with these 
traits. They may be selected as data, a factor, a covariate, or be excluded. You will want to have 
all the phenotypic traits set to Data. Lastly, check the box in the right frame that is labeled as 
Analyze each data column separately. This will perform separate analyses for each 
phenotypic trait. 
 9 
- Click on OK.  The Build a Linear Model window will appear, which allows a number of 
additional specifications to be listed. 
 
- Click on Run.   The analysis should now be running. You can verify this by looking at the 
status bar in the upper right corner of the program window. The results are printed to the results 
folder located in the data tree. 
 
- Click on the output file named GLM_18 traits/environ + Allele. 
 
 
 
The results are located in the data window. The first column lists the phenotypic trait. There 
are 18 traits. Subsequent columns list important values of the GLM fittings and tests of those fits 
for each trait and SNP. Each phenotypic trait has 58 rows, one for each SNP. SNPs are labeled 
as markers with the abbreviation m(i) or q(j), where i = 1, 2, 3, É, 48 and j = 1, 2, É, 10.  
There are two very important columns that you should inspect. The first is named F_marker. It 
is the test statistic used to test the hypothesis of marker-trait association. The larger the 
F value, the better the fit of the GLM. The second is named p_marker. This is the p-value 
associated with the test statistic (F). Remember that a p-value of less than 0.05 is considered 
significant. When p <<< 0.05 the marker-trait association is very strong.  
 
Questions to consider:  
Are there significant associations between the traits and markers listed?  
Do significant associations alone provide conclusive evidence of causation (i.e., the variation in 
this markers CAUSES the variation in the phenotype)? What additional data would be helpful to 
prove causation?  
What is the relationship between F and p-value?