Software Tutorial for Microarray Meta-analysis by line August 29, 2012 Contents 1 Introduction 3 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 How to get it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Reporting bugs and update . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Getting Started 4 2.1 Graphical Interface Introduction . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Prepare the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Load Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.1 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.1 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.2 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.3 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Example 9 3.1 Load data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 1 Introduction 1.1 Background Microarray plays an important role in genomic studies. With rapid development of high-throughput genomic technology, combining multiple studies is very important to increase the statistical power. Many researchers have put up with methodology dealing with these genomic problems and some of these methods have been made as package in some language such as R, but it is still hard for experimental scientist to use be- cause of the complicated operation in these language, which is not the emphasis of some researchers[1]. Therefore this paper will present a user-friendly GUI software, implement- ing the metaOmics package written by. It will be easy to use as well as to understand results. 1.2 How to get it. This software could be downloaded at http://www.biostat.pitt.edu/bioinfo/software.htm 1.3 Reporting bugs and update There may be a lot of settings you will not feel comfortable or convenient to use. If you feel there is any, feel free to let the author know and he'll improve it in the future version. The author sincerely appreciate it if you could give some suggestions and feedback. Also if you find and bugs, feel free to contact the author1 and he'll revise that in the next version. Thank you for your contribution. 1Please contact zhh18@pitt.edu 3 2 Getting Started 2.1 Graphical Interface Introduction The GUI consists 4 parts as you can see when opening the software. • Studies Inforamtion. This field will keep track of the input data information, including studies name, gene size, sample size, platform name and year generating the data. Keep in mind that the gene size may change if user merge the data or filter the data. • Console window. This window will provide information regarding what jobs the software is corrently doing. If user clicks a button that is not allowed without another action, this window will provide some warning message. • Load data panel. This panel allows user to load data from local computer. Afterward, user could do preprocessing then it is ready to use the meta-analysis panel. • meta-analysis panel. There are 3 currently availible analysis tools in this panel, including metaQC, metaDE and metaPath. User could visualize and save the result. 2.2 Prepare the data Data is supposed to be arranged as a matrix format. Each row represents a gene and each column represents a sample. Also there should be proper gene names or probeID names, sample names. See Figure 2.1 (a) data matched (b) data unmached Figure 2.1: two class 4 2 Getting Started Figure 2.2: survival data Now present only .txt file and . csv file are allowed, which might be extended into xls file in the future. So it'll be nice if user could arrange data using excel. Just open the data in excel and save as txt or csv. Different data type may have subtle format difference. For two class data, multiple class data, continue data, the first row should be the sample name. The second row should be the class label. For survival data, the first row should be the sample name, the second row should be time, the third row should be censoring status. See Figure 2.2 For matched data, the first column should be the gene symbol. See Figure 2.1a For unmatched data, the first column should be the probe name, the second column should gene symbol. See Figure 2.1b The rest should be the expression matrix of the study. If there are missing value in the data, please mark it as NA as convention. 2.3 Load Data and Preprocessing 2.3.1 Load Data • File type: two types of files will be allow to be read into the software. Txt file or csv file. • Data type: Four data types are allowed in the software. These types are two class, multiple class, Continue and Survival. • Logged: if the microarray data has been log transformed, please select this check- box. If not please leave it unselect. • Matched: if the probes ID have not been made to unique gene symbol, please unselect the matched checkbox so that the software will choose the unique gene symbol with multiple probe ID using greatest IQR value. If you have already match the probes ID into gene symbol and each study has genes with unique gene names, please select this checkbox. • Add studies: You could select the data files and click on open. One dataset should be in one file. Also these files must be in the same directory. 5 2 Getting Started • Add info (optional): additional information associated with the dataset could be added into the software. These information include the platform time year where each dataset is generated. The format should look like this. It consists of several lines corresponding to several studies. Each line, tab-delimited token is required. The first word is the study name(Author name), the second word is supposed to be the platform, the third word is the year. • Confirm: after you add studies and info, click on the confirm button then the data will be read into the software. If the dataset is big, it may take a while to get the data into the software. 2.3.2 Preprocessing • Merge: after you click the merge button, shared genes will exist across the studies and other genes will be removed. • Filter: user could filter out the unexpressed genes by mean value and uninformative genes by its standard deviation value. Both mean and standard deviation filter threshold need to be specified. • Knn imputation: in case there are some missing value in the datasets, please do Knn imputation before using any further functions. After this step, data has been successfully loaded into the software and pre-processing part is also finished. You could use the meta sections now. 2.4 Meta Analysis By now this software contains only 3 Meta Analysis tools. In the future, it might be extended to have more tools. This tutorial will present how to use all of the three package. But user can only use one or more packages as they want. 2.4.1 MetaQC MetaQC[3] will use 6 quantitative quality control measurement to determine the rank of each studies. If certain study has high rank score, it means probably the study is irrelavent to other studies. • pathway database for EQCp: Here user need to specify the pathway database used for EQCp. There several available database such as GO, Biocarta, KEGG and Reactome. All of these database could be downloaded at MsigDB. Also users are able to use their own pathway database information by clicking load, but make sure the pathway database should be in the format of gmt file. User can use either the default available pathway database or load pathway database from local PC, but it is not allowed to get pathway information from both approaches simultaneously. 6 2 Getting Started • pathway information for CQCp and AQCp: Here user need to specify the pathway database used for CQCp and AQCp. And default availabel database and how to load user's own database are similar to the previous tip. • number of top pathway for EQCp: the number of top pathways used for EQC calculation. For good performance, this shold be set as a reasonable small number. • B for EQC: here B means the permutation times for EQC calculation. • pval cut: pvalue cutoff for AQC calculation. • Pval adjust: whether to use B-H adjustment[7]. After these, you could safely click metaQC button. It may take a while if the permutation times B is big or the datasets are big. After it finishes processing, a QC result table will pop up, it will provide six kinds of quality score and a PCA plot result will come out. User can save the PCA plot result. After you decide with studies are of poor quality studies, you can delete these studies and load the data again by simply load again. Then the gene information window will updata. 2.4.2 MetaDE In this MetaDE panel, user could perform meta DE analysis to multiple studies. Several individual study test methods and meta test methods are availible according to input data type. The software could help users identifying the differentally expressed genes and provide some detailed information about the genes. P-value of each individual studies and the meta-analysis result will also be given. User could save the result and generate the heatmap of the differentally expressed genes by controlling false discovery rate. • individual test: There are three options: regular t-test, modified t-test, paired t-test. For paired t-test, samples must be correctly labeled. • individual tail: user out to specify what kind of tail it to be used. Default is abs. • meta-test: there will be a bunch of options. There are maxP, minP, roP, Fisher[6], AW[4], Stouffer[8], SR, PR, minMCC, FEM, REM, randProd[9] with their OC correction. If OC correction is selected, user doesn't need to tell the software what kind of tail to use because it will compare the result of both tails. Some methods have asymptotic result while some have not. For those methods without an asymptotic approach, user has to use permutation method and specify the number of permutation to use. Also those methods with an asymptotic approach, user could also use permutation method. If roP method is selected, the number rth must be specified, which is a number between 1 and the number of studies. • meta analysis: to start the metaDE analysis. A table will be popped up with a list of all the shared genes and the p-value in each individual study and the meta 7 2 Getting Started pvalue and meta qvalue. User could sort the genes by certain column of p-value or q-value. Also users could click on their interested gene and get detailed information about the gene(the information is from Bioconductor org.Hs.eg.db). • Save as file: user can save the result as csv file or txt file. • Heatmap: user could generate the heatmap of the DE genes given the false discover rate. The heatmap could also be saved. 2.4.3 MetaPath Pathway enrichment is an important method to validate whether the discovered DE genes are reasonable. MetaPath[5] section provide 3 Meta-Analysis Pathway Enrichment tools(MAPE) based on gene level(MAPE_G), pathway level(MAPE_P) and a hybrid of both level(MAPE_I). If user has multiple studies, this section will be easy to use to detect the pathway and visualize the result. • pathway database: First user has to specify which pathway database to use. There are 4 default pathway database, same as used in metaQC. Users can also use their own pathway database by loading gmt files themselves. As before, user can choose pathway database via only one of these method. • Permutation type: user need to point out permutation type (by gene or by sample). • Meta test: user need to specify meta test method to be used, here there are only 4 types of methods (maxP, minP, roP, Fisher). • Pathway gene size range: user need to describe the range of genes one pathway has. Then the software will filter out the pathway databases with genes less than the min size or greater than the max size. • Number of permutation: number of permutation used during the pathway detec- tion. • Meta pathway: Click this button to start MAPE_G, MAPE_P, MAPE_I. It may take a while if the permutation times is big or the input dataset is big. • Qvalue.cut: user could get desirable pathway using three different method by con- trolling false discovery rate. • Plot: they could visualize these pathway under the given false discovery rate. User could also save the plot. 8 3 Example The software prepares some example data to show users how to use it. It is quite easy to go through these examples. Double click metaOmics.exe, you will open the software. It may take a couple of seconds to load the required packages from R. See Figure 3.1 There are 9 Prostate cancer data in txt files in the default load folder. Their Probe_IDs has been matched to gene symbols and the expression data has not been logged2 trans- formed. The data has two class: benign tumors are labelled 0 and localized tumors are labelled 1. 3.1 Load data First user need to specify the file type as txt file, data type as two class in the related combo box, the data is matched by select the matched checkbox and the data is not logged2 transformed by unselect the logged checkbox. Click on the add studies button(See Figure 3.2), an open-file dialog will pop up. In the default folder, there are 9 studies. Select them all and then click on open. See Figure 3.3a. If you are not satisfied with what you have loaded, you could delete the studies in the studies information table by clicking delete studies button. But you need to make sure that all the studies in the information table are in the same directory. User could add external information about the studies by clicking the add info button. This is optional and the requirement are explained in Section 2.3.1. After these steps, click on the confirm button and then the software will read in the data. It may take a while if you dataset is too big. After confirmation, the console window will tell you load complete. Also at this moment, the studies information window will be updated, which will show the current studies names, gene size of each study, sample Figure 3.1: Open the metaOmics software 9 3 Example Figure 3.2: add studies (a) add studies dialog (b) add information dialog Figure 3.3: add studies and information dialog 10 3 Example Figure 3.4: after confirmation size and the external inforamtion. See Figure 3.4. After this step, you've been finished loading data and will go to pre-processing section. 3.2 Preprocessing In this section, you're going to do some pre-processing before the meta-analysis. The first thing you need to do is to merge the data by simply click on the merge button. Then all the studies will have the same dimension of genes, which means these genes are the common genes among the studies. Also the studies information window will update its information. The second thing you need to do is to filter out the unexpressed genes by its mean value and uninformative genes by the standard deviation value. Here in the example, 50% of genes are filtered out by mean value and afterwards 50% of genes are filtered out by standard deviation value. User could specify these numbers in the filter by mean(%) textfield and filter by sd(%) textfield. Then click on filter button, you will finish the filtering part. See Figure 3.5. At last you need to do KNN imputation in case there are some missing value. Missing value are not allowed in metaQC and metaPath. However, in metaDE, missing value does be allowed in some statistical test. But for convenience, user need to do KNN imputation before any further action. After click this button, pre-processing part is done and user could go to the 3 meta-analysis sections. 3.3 MetaQC This section allow users to perform metaQC, including 6 quality control measurement. Additionally, EQCp, CQCp, AQCp need the user to specify external pathway informa- tion. So in the example, combination of default pathway are selected. See Figure 3.6. The number of pathway for EQC is set as 5. In order to perform metaQC quickly in 11 3 Example Figure 3.5: filter Figure 3.6: metaQC panel 12 3 Example (a) metaQC 6 quality scores and rank (b) PCA plot of the metaQC result Figure 3.7: metaQC result the example, the permutation number B is selected as 100. Pvalcut is set as 5%. More detailed information about these parameters please refer to Section 2.4.1. After specify- ing these parameters, you could perform metaQC by clicking the metaQC button. The progress of metaQC will be shown in the console information window. When the process is finished, two dialogs will pop up. One dialog is the rank score of metaQC result, based on which user could make a decision on which studies are of bad quality (See Figure 3.7a). Another dialog will be a PCA plot of six quality score, user could save the plot as file onto local PC (See Figure 3.7b). Study Nanni and Dhanasekaran has high metaQC rank score and relatively small sample size, which serves as the evident that these two studies have low correlation with other studies. It is reasonable to remove these two studies in order to acheive a more consistent result. Therefore just select these two studies in the studies information window and click on delete button (See Figure 3.8a), then confirm again and then merge the data. This time you will probably find that there would be more shared genes across the studies, which also indicates the remain studies have better correlation. In the example again we filter 50% unexpressed genes by mean value and 50% uninformative genes by standard deviation value. And then do the KNN imputation again. Then the data after metaQC selection is ready for the further meta-analysis. (See figure 3.8b) 13 3 Example (a) remove studies with poor metaQC performance (b) reload data Figure 3.8: metaQC selections and reload data Figure 3.9: metaDE analysis parameter selection 3.4 MetaDE After users load the data or reload the data after metaQC and their corresponding pre-processing, it is the time to perform the metaDE analysis section. In the example, individual test is selected as regt, individual tail is selected as abs, meta test is selected as roP, asmptotic checkbox is selected and rth is choosen as 5 (See Figure 3.9). Perform metaDE analysis by clicking meta analysis button. After a while, a dialog with all the shared genes and their p-value of each individual study and meta p-value, meta q-value will pop up. User could sort these genes by their alphabetic order or by significance of any p-values (See Figure 3.10a). If the user wants to see more information about a particular gene, simply click on the gene name and a dialog with detailed gene information will pop up (See Figure 3.10b). 14 3 Example (a) sort metaDE analysis result (b) get detailed information about a particular gene Figure 3.10: metaDE analysis result Clicking save as file button, user is able to save the metaDE analysis result as txt or csv file. Also in the example, controlling false discover rate at 5%, clicking on heatmap button will generate the expression heatmap of all the differentally expressed genes (See Figure 3.11). 3.5 MetaPath In the example, the parameters are set as Figure 3.12. Default pathway database KEGG is choosen. Permutation type is selected as gene and the meta statistics is selected as maxP. The lower limit of gene amount in pathway is 5 and the higher limit is 500. In order to get the result quickly, permutation number is set as 500. Then click on meta pathway button to perform MAPE_G, MAPE_P, MAPE_I separately. After it finishes, user could click on qvalue cutoff button to get the significant pathway under certain qvalue threshold. Commonly used threshold is 0.05. However in our example, no pathway will be detected under such a threshold. If qvalue cutoff is 1, then all the pathway will be listed with their p-value in each individual study and 3 MAPE result (See Figure 3.13a). User could also click on plot to get the significant pathway below the given qvalue cutoff (See Figure 3.13b). 15 3 Example Figure 3.11: differentally expressed genes' heatmap Figure 3.12: metaPath panel 16 3 Example (a) metaPath result (b) metaPath plot Figure 3.13: meta pathway result 17 Bibliography [1] George C. Tseng, Debashis Ghosh and Eleanor Feingold. Comprehensive literature review and statistical consideration for microarray meta-analysis. Nucleic Acids Re- search, 2012, Vol. 40, No. 9 37853799. [2] Xingbin Wang, Dongwan D. Kang, Kui Shen, Chi Song, Shuya Lu, Lun-Ching Chang, Serena G. Liao, Zhiguang Huo, Shaowu Tang, Naftali Kaminski, Etienne Sibille, Yan Lin, Jia Li, and George C. Tseng. An R package Suite for Microarray Meta-analysis in Quality Control, Differentially Expressed Gene Analysis and Pathway Enrichment Detection. [3] Dongwan D. Kang, Etienne Sibille, Naftali Kaminski, and George C. Tseng. (2012) MetaQC: Objective Quality Control and Inclusion/Exclusion Criteria for Genomic Meta-Analysis. Nucleic Acids Research. 40(2):e15. [4] Li J and Tseng,G.C. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. Annals of Applied Statis- tics. accepted. [5] Kui Shen and George C Tseng. (2010) Meta-analysis for pathway enrichment analysis when com- bining multiple microarray studies. Bioinformatics. 26:1316-1323. [6] Fisher R. Combining independent tests of significance. American Statistician, 2(5):30 1948. [7] Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B-Methodological, 57, 289-300,1995. [8] Stouffer, S., Suchman,E., DeVinnery,L., Star,S., and Wiliams,J. The American Sol- dier,volumn I: Adjustment during Army Life. Princeton University Press, 1949. [9] Hong, F., et al. (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis, Bioinformatics, 22, 2825-2827. 18