1 BIT150 - Lab 10 Part I. Nucleotide Diversity Analysis Part II. Association mapping Part I. DnaSP Sequence analysis. Estimation of nucleotide diversity. Test for evidence of selection What evolutionary information can we infer from these data? DnaSP 1. About DnaSP 2. Importing data 3. Calculation of nucleotide diversity 4. Detecting departures from neutrality 1. About DnaSP DnaSP (Rozas and Rozas, 1999; Rozas et al., 2003) is a software package for the analysis of the DNA polymorphism from nucleotide sequence data. DnaSP runs on a Windows platform and is freely available at http://www.ub.es/dnasp/. In this lab, you will learn how to use DnaSP to calculate the nucleotide diversity present in nucleotide sequence data, and how to test for departure from a neutral model of evolution, i.e. genetic drift. However, DnaSP is also capable of performing a number of other calculations. As explained at the beginning of this lab, DnaSP requires a multiple DNA sequence alignment file in FASTA format, as one would create using ClustalW through MEGA4, BioEdit, etc, or using ClustalX with its graphical interface. 2. Importing data - Open DnaSP. The opening screen has animated images of DNA double-helices, which stop when you click anywhere on the screen. - Click on File in the toolbar to get a blank screen. - Go to File|Open Data File... and Open the .fas file. This opens a Data Information window, which shows a summary of your data, i.e. total number of nucleotide sites, total number of sequences, etc. Close this window (to open the Data Information window at any time, go to Display|Data info. 2 - Go to Display|View Data to see the multiple sequence alignment. This opens a DNA Sequence Polymorphism window with the aligned sequence names along the left side, and nucleotide bases along the top. You can slide along the length of the sequence or along the right side to view all the sequences using the slide rules. In the bottom right corner, there is a Select Sites/Codons… drop-down box with options of how you can view your data. This includes options for highlighting the invariable (monomorphic) or the variable (polymorphic) sites only. Select these options in turn to see how this affects the data shown. If the sequences were annotated, you could also view the sequence as codons, and highlight synonymous and nonsynonymous nucleotide sites. 3 3. Calculation of nucleotide diversity - Click on Analysis in the toolbar to see the variety of analyses that DnaSP can perform. - Go to Analysis|DNA polymorphism. This brings up a DNA Polymorphism Options window. • The Data Set drop-down box gives you the option to select the dataset to be analyzed. Since your dataset contains only one set of sequences, the only option given will be All Included Sequences. • You can estimate the nucleotide diversity in your data set either across the entire sequence or in specific regions by selectiong the Region to Analyze. • You can also estimate whether nucleotide diversity is particularly high in a specific region of the sequence using the Sliding Window option. If you check the Compute box, you can then define the size of the sequence block (Window Length) and how often to repeat the calculation (Step Size). As an example, you can see how the pattern of nucleotide diversity changes in 100 nucleotide blocks, every 25 nucleotides along your sequence. • Finally, there are many Options of associated algorithms for the calculation of nucleotide diversity (average number of nucleotide differences per site between two sequences): (i) Variance of Pi - this refers to the the variance in the average number of nucleotide differences per site between two sequences (Nei 1987). (ii) Nucleotide diversity with Jukes and Cantor correction factor - this model corrects for bases where mutation has occurred more than once. As such, the Jukes and Cantor correction accounts for how sequences evolved (Jukes and Cantor 1969; Lynch and Crease 1990). (iii) Nucleotide diversity (gaps/missing data) - both of the earlier options assume that there are no gaps in the sequence data. However, in the event that there are indels in the sequence, you will need to select this option otherwise these indels will be ignored during the analysis. NOTE: You can only select options only (i), only (ii), (i) and (ii), or only (iii) but not all three of them. - Select All Included Sequences as Data Set, the entire region as Region to Analyze, and both Compute Variance of Pi and Compute Pi as the Options to calculate nucleotide diversity. Click on OK. 4 4. Detecting departures from neutrality Once we have estimated nucleotide diversity, we can find out whether selection has potentially played any role in influencing these sequence changes. Tajima's test, or D test statistic (Tajima, 1989) tests the neutral theory of molecular evolution (Kimura, 1983). That is, the vast majority of molecular differences that arise through spontaneous mutation does not influence the fitness of the individual. A corollary to this theory is that genomes evolve primarily through the process of genetic drift. Tajima's D statistic compares the difference between two estimates of the amount of nucleotide variation, one being simply the number of segregating sites (Watterson, 1975) and the other one being the average number of pairwise differences (Nei and Li, 1979; Tajima, 1983). In a constant-sized population experiencing only genetic drift, both estimates should give equal values. Dissimilar values suggest that some form of selection could be acting on this sequence. 5 A positive value of Tajima's D indicates that there has been 'balancing selection' and the data will show a few divergent haploypes, whereas a negative value suggests that 'purifying selection' may have occurred and the data will reveal an excess of singletons. - Go on Analysis|Tajima's test. - Select All Included Sequences as Data Set, the entire region as Region to Analyze, and Segregating Sites as the Nucleotide Substitutions Considered for the analysis. Click on OK. Questions to consider: What is the frequency of SNPs? What are the nucleotide diversity statistics theta and pi? Does the gene appear to be under selection? Why yes/no? 6 Part II. Tassel 1. About Tassel 2. The data 3. The analysis 4. Example 1. About Tassel Trait Analysis by Association, Evolution, and Linkage (TASSEL) is a java-based program intended to infer correlations between genetic markers and phenotypic traits (association mapping). In this lab we will only focus on methods used to infer correlations between genetic and phenotypic data. Specifically, we will asses correlations between single nucleotide polymorphism (SNP) markers and various wood property characteristics in loblolly pine (Pinus taeda). However, this software also performs a variety of other quantitative analyses including calculation of molecular diversity, estimation of linkage disequilibrium, and inference of phylogenetic trees. 2. The data The goal in association mapping is to correlate genotypic with phenotypic variation. We refer to this as marker-trait associations. The dataset, therefore, consists of both types of data. The genotypic data are comprised of genotypic classes defined across a large number (n) of SNPs (n = 58 SNPs in the dataset). For a standard SNP with only two states, there are only three genotypic classes in a diploid individual (homozygous for state 1, heterozygous, homozygous for state 2). ‘State’ refers to what nucleotides are found at a given SNP. These genotypic classes at each SNP are coded with single letters for each individual in the dataset. The phenotypic data are comprised of quantitative measurements of various wood property traits (n = 18 traits). 3. The analysis We will use a General Linear Model (GLM) to estimate genetic effects on phenotypic data. In this context, variation at SNP markers is used to explain variation in phenotypes (y = a + bx + e, where y is the phenotypic trait, b is the linear term corresponding to the SNP, and e is the error). A statistical test of the following form will be performed for each SNP and phenotypic trait: H0: The linear term (b) corresponding to SNP i is equal to zero. HA: The linear term (b) corresponding to SNP i is not equal to zero. The null hypothesis is rejected when the corresponding p-value is less than 0.05 or some other predetermined significance threshold. Since there are as many tests as there are combinations of SNPs and phenotypic traits, the p-value is often adjusted to take into account the fact of performing so many independent statistical tests (in this dataset that is 58*18 = 1044 independent tests!). When the p-value less than 0.05, we reject the null hypothesis and conclude that variation at this particular SNP is strongly correlated, or associated, with variation for a certain phenotypic trait. 7 4. Example The files you will need to work with Tassel are in a folder named ‘tassel_files’ on the Z drive. There are two files corresponding to genotypic and phenotypic data, called ‘genWood.txt’ and ‘phenoWood.txt’, respectively. - Using FileZilla, transfer the ‘genWood.txt’ and ‘phenoWood.txt’ files from the Z drive to your directory in C. - Open Tassel. Note that the window is divided into three major frames. In the upper left is a data tree, where all the input data files and subsequent output files are listed. In the lower left is a status frame, which summarizes commands that are executed. In the right is a data window, which shows the data once it is imported into the program. - Click on POLY and open the genWood.txt file. The SNP genotypes for each individual are now loaded into the program. Click on the file named Allele located in the data tree to see the SNP data in the data window. 8 - Click on TRAIT and open the phenoWood.txt file. The phenotypic data for each individual for each trait are now loaded into the program. Click on the file named 18traits/environ located in the data tree to see the phenotypic data in the data window. The next step is to combine the phenotypic data with the genotypic data to get a single dataset. - Highlight the files corresponding to each dataset, the SNP and the trait, in the data tree. To highlight both files (Allele and 18 traits/environ) hold the Ctrl key down while you click on each of the files. Click on U join. We now have a complete dataset comprised of both genotypic and phenotypic data. - Click on the file named 18 traits/environ + Allele in the data tree to view the complete file. We are now ready to perform an analysis. - Click on Analysis, and then on GLM. The Input Data Definition window will appear, and is composed of two frames. The one on the left lists all the input data. For this lab that is the phenotypic traits and the population. Since there is only a single population in this dataset, the drop-down menu for pop should be set to Exclude. Next to each trait is a drop-down menu that specifies what should be done with these traits. They may be selected as data, a factor, a covariate, or be excluded. You will want to have all the phenotypic traits set to Data. Lastly, check the box in the right frame that is labeled as Analyze each data column separately. This will perform separate analyses for each phenotypic trait. 9 - Click on OK. The Build a Linear Model window will appear, which allows a number of additional specifications to be listed. - Click on Run. The analysis should now be running. You can verify this by looking at the status bar in the upper right corner of the program window. The results are printed to the results folder located in the data tree. - Click on the output file named GLM_18 traits/environ + Allele. The results are located in the data window. The first column lists the phenotypic trait. There are 18 traits. Subsequent columns list important values of the GLM fittings and tests of those fits for each trait and SNP. Each phenotypic trait has 58 rows, one for each SNP. SNPs are labeled as markers with the abbreviation m(i) or q(j), where i = 1, 2, 3, É, 48 and j = 1, 2, É, 10. There are two very important columns that you should inspect. The first is named F_marker. It is the test statistic used to test the hypothesis of marker-trait association. The larger the F value, the better the fit of the GLM. The second is named p_marker. This is the p-value associated with the test statistic (F). Remember that a p-value of less than 0.05 is considered significant. When p <<< 0.05 the marker-trait association is very strong. Questions to consider: Are there significant associations between the traits and markers listed? Do significant associations alone provide conclusive evidence of causation (i.e., the variation in this markers CAUSES the variation in the phenotype)? What additional data would be helpful to prove causation? What is the relationship between F and p-value?