Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Software Tutorial for Microarray
Meta-analysis
by line
August 29, 2012
Contents
1 Introduction 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How to get it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Reporting bugs and update . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Getting Started 4
2.1 Graphical Interface Introduction . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Prepare the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Load Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.3 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Example 9
3.1 Load data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
1 Introduction
1.1 Background
Microarray plays an important role in genomic studies. With rapid development of
high-throughput genomic technology, combining multiple studies is very important to
increase the statistical power. Many researchers have put up with methodology dealing
with these genomic problems and some of these methods have been made as package
in some language such as R, but it is still hard for experimental scientist to use be-
cause of the complicated operation in these language, which is not the emphasis of some
researchers[1]. Therefore this paper will present a user-friendly GUI software, implement-
ing the metaOmics package written by. It will be easy to use as well as to understand
results.
1.2 How to get it.
This software could be downloaded at http://www.biostat.pitt.edu/bioinfo/software.htm
1.3 Reporting bugs and update
There may be a lot of settings you will not feel comfortable or convenient to use. If you
feel there is any, feel free to let the author know and he'll improve it in the future version.
The author sincerely appreciate it if you could give some suggestions and feedback.
Also if you find and bugs, feel free to contact the author1 and he'll revise that in the
next version. Thank you for your contribution.
1Please contact zhh18@pitt.edu
3
2 Getting Started
2.1 Graphical Interface Introduction
The GUI consists 4 parts as you can see when opening the software.
• Studies Inforamtion. This field will keep track of the input data information,
including studies name, gene size, sample size, platform name and year generating
the data. Keep in mind that the gene size may change if user merge the data or
filter the data.
• Console window. This window will provide information regarding what jobs the
software is corrently doing. If user clicks a button that is not allowed without
another action, this window will provide some warning message.
• Load data panel. This panel allows user to load data from local computer.
Afterward, user could do preprocessing then it is ready to use the meta-analysis
panel.
• meta-analysis panel. There are 3 currently availible analysis tools in this panel,
including metaQC, metaDE and metaPath. User could visualize and save the
result.
2.2 Prepare the data
Data is supposed to be arranged as a matrix format. Each row represents a gene and
each column represents a sample. Also there should be proper gene names or probeID
names, sample names. See Figure 2.1
(a) data matched (b) data unmached
Figure 2.1: two class
4
2 Getting Started
Figure 2.2: survival data
Now present only .txt file and . csv file are allowed, which might be extended into xls
file in the future. So it'll be nice if user could arrange data using excel. Just open the
data in excel and save as txt or csv.
Different data type may have subtle format difference.
For two class data, multiple class data, continue data, the first row should be the
sample name. The second row should be the class label.
For survival data, the first row should be the sample name, the second row should be
time, the third row should be censoring status. See Figure 2.2
For matched data, the first column should be the gene symbol. See Figure 2.1a
For unmatched data, the first column should be the probe name, the second column
should gene symbol. See Figure 2.1b
The rest should be the expression matrix of the study.
If there are missing value in the data, please mark it as NA as convention.
2.3 Load Data and Preprocessing
2.3.1 Load Data
• File type: two types of files will be allow to be read into the software. Txt file or
csv file.
• Data type: Four data types are allowed in the software. These types are two class,
multiple class, Continue and Survival.
• Logged: if the microarray data has been log transformed, please select this check-
box. If not please leave it unselect.
• Matched: if the probes ID have not been made to unique gene symbol, please
unselect the matched checkbox so that the software will choose the unique gene
symbol with multiple probe ID using greatest IQR value. If you have already
match the probes ID into gene symbol and each study has genes with unique gene
names, please select this checkbox.
• Add studies: You could select the data files and click on open. One dataset should
be in one file. Also these files must be in the same directory.
5
2 Getting Started
• Add info (optional): additional information associated with the dataset could be
added into the software. These information include the platform time year where
each dataset is generated. The format should look like this. It consists of several
lines corresponding to several studies. Each line, tab-delimited token is required.
The first word is the study name(Author name), the second word is supposed to
be the platform, the third word is the year.
• Confirm: after you add studies and info, click on the confirm button then the data
will be read into the software. If the dataset is big, it may take a while to get the
data into the software.
2.3.2 Preprocessing
• Merge: after you click the merge button, shared genes will exist across the studies
and other genes will be removed.
• Filter: user could filter out the unexpressed genes by mean value and uninformative
genes by its standard deviation value. Both mean and standard deviation filter
threshold need to be specified.
• Knn imputation: in case there are some missing value in the datasets, please do
Knn imputation before using any further functions. After this step, data has been
successfully loaded into the software and pre-processing part is also finished. You
could use the meta sections now.
2.4 Meta Analysis
By now this software contains only 3 Meta Analysis tools. In the future, it might be
extended to have more tools. This tutorial will present how to use all of the three package.
But user can only use one or more packages as they want.
2.4.1 MetaQC
MetaQC[3] will use 6 quantitative quality control measurement to determine the rank
of each studies. If certain study has high rank score, it means probably the study is
irrelavent to other studies.
• pathway database for EQCp: Here user need to specify the pathway database used
for EQCp. There several available database such as GO, Biocarta, KEGG and
Reactome. All of these database could be downloaded at MsigDB. Also users are
able to use their own pathway database information by clicking load, but make sure
the pathway database should be in the format of gmt file. User can use either the
default available pathway database or load pathway database from local PC, but
it is not allowed to get pathway information from both approaches simultaneously.
6
2 Getting Started
• pathway information for CQCp and AQCp: Here user need to specify the pathway
database used for CQCp and AQCp. And default availabel database and how to
load user's own database are similar to the previous tip.
• number of top pathway for EQCp: the number of top pathways used for EQC
calculation. For good performance, this shold be set as a reasonable small number.
• B for EQC: here B means the permutation times for EQC calculation.
• pval cut: pvalue cutoff for AQC calculation.
• Pval adjust: whether to use B-H adjustment[7].
After these, you could safely click metaQC button. It may take a while if the permutation
times B is big or the datasets are big. After it finishes processing, a QC result table will
pop up, it will provide six kinds of quality score and a PCA plot result will come out.
User can save the PCA plot result.
After you decide with studies are of poor quality studies, you can delete these studies
and load the data again by simply load again. Then the gene information window will
updata.
2.4.2 MetaDE
In this MetaDE panel, user could perform meta DE analysis to multiple studies. Several
individual study test methods and meta test methods are availible according to input
data type. The software could help users identifying the differentally expressed genes and
provide some detailed information about the genes. P-value of each individual studies
and the meta-analysis result will also be given. User could save the result and generate
the heatmap of the differentally expressed genes by controlling false discovery rate.
• individual test: There are three options: regular t-test, modified t-test, paired
t-test. For paired t-test, samples must be correctly labeled.
• individual tail: user out to specify what kind of tail it to be used. Default is abs.
• meta-test: there will be a bunch of options. There are maxP, minP, roP, Fisher[6],
AW[4], Stouffer[8], SR, PR, minMCC, FEM, REM, randProd[9] with their OC
correction. If OC correction is selected, user doesn't need to tell the software
what kind of tail to use because it will compare the result of both tails. Some
methods have asymptotic result while some have not. For those methods without an
asymptotic approach, user has to use permutation method and specify the number
of permutation to use. Also those methods with an asymptotic approach, user
could also use permutation method. If roP method is selected, the number rth
must be specified, which is a number between 1 and the number of studies.
• meta analysis: to start the metaDE analysis. A table will be popped up with a
list of all the shared genes and the p-value in each individual study and the meta
7
2 Getting Started
pvalue and meta qvalue. User could sort the genes by certain column of p-value or
q-value. Also users could click on their interested gene and get detailed information
about the gene(the information is from Bioconductor org.Hs.eg.db).
• Save as file: user can save the result as csv file or txt file.
• Heatmap: user could generate the heatmap of the DE genes given the false discover
rate. The heatmap could also be saved.
2.4.3 MetaPath
Pathway enrichment is an important method to validate whether the discovered DE
genes are reasonable. MetaPath[5] section provide 3 Meta-Analysis Pathway Enrichment
tools(MAPE) based on gene level(MAPE_G), pathway level(MAPE_P) and a hybrid
of both level(MAPE_I). If user has multiple studies, this section will be easy to use to
detect the pathway and visualize the result.
• pathway database: First user has to specify which pathway database to use. There
are 4 default pathway database, same as used in metaQC. Users can also use their
own pathway database by loading gmt files themselves. As before, user can choose
pathway database via only one of these method.
• Permutation type: user need to point out permutation type (by gene or by sample).
• Meta test: user need to specify meta test method to be used, here there are only 4
types of methods (maxP, minP, roP, Fisher).
• Pathway gene size range: user need to describe the range of genes one pathway
has. Then the software will filter out the pathway databases with genes less than
the min size or greater than the max size.
• Number of permutation: number of permutation used during the pathway detec-
tion.
• Meta pathway: Click this button to start MAPE_G, MAPE_P, MAPE_I. It may
take a while if the permutation times is big or the input dataset is big.
• Qvalue.cut: user could get desirable pathway using three different method by con-
trolling false discovery rate.
• Plot: they could visualize these pathway under the given false discovery rate. User
could also save the plot.
8
3 Example
The software prepares some example data to show users how to use it. It is quite easy
to go through these examples. Double click metaOmics.exe, you will open the software.
It may take a couple of seconds to load the required packages from R. See Figure 3.1
There are 9 Prostate cancer data in txt files in the default load folder. Their Probe_IDs
has been matched to gene symbols and the expression data has not been logged2 trans-
formed. The data has two class: benign tumors are labelled 0 and localized tumors are
labelled 1.
3.1 Load data
First user need to specify the file type as txt file, data type as two class in the related
combo box, the data is matched by select the matched checkbox and the data is not
logged2 transformed by unselect the logged checkbox.
Click on the add studies button(See Figure 3.2), an open-file dialog will pop up. In
the default folder, there are 9 studies. Select them all and then click on open. See
Figure 3.3a.
If you are not satisfied with what you have loaded, you could delete the studies in the
studies information table by clicking delete studies button. But you need to make sure
that all the studies in the information table are in the same directory.
User could add external information about the studies by clicking the add info button.
This is optional and the requirement are explained in Section 2.3.1.
After these steps, click on the confirm button and then the software will read in the
data. It may take a while if you dataset is too big. After confirmation, the console window
will tell you load complete. Also at this moment, the studies information window will
be updated, which will show the current studies names, gene size of each study, sample
Figure 3.1: Open the metaOmics software
9
3 Example
Figure 3.2: add studies
(a) add studies dialog (b) add information dialog
Figure 3.3: add studies and information dialog
10
3 Example
Figure 3.4: after confirmation
size and the external inforamtion. See Figure 3.4. After this step, you've been finished
loading data and will go to pre-processing section.
3.2 Preprocessing
In this section, you're going to do some pre-processing before the meta-analysis.
The first thing you need to do is to merge the data by simply click on the merge
button. Then all the studies will have the same dimension of genes, which means these
genes are the common genes among the studies. Also the studies information window
will update its information.
The second thing you need to do is to filter out the unexpressed genes by its mean
value and uninformative genes by the standard deviation value. Here in the example,
50% of genes are filtered out by mean value and afterwards 50% of genes are filtered out
by standard deviation value. User could specify these numbers in the filter by mean(%)
textfield and filter by sd(%) textfield. Then click on filter button, you will finish the
filtering part. See Figure 3.5.
At last you need to do KNN imputation in case there are some missing value. Missing
value are not allowed in metaQC and metaPath. However, in metaDE, missing value
does be allowed in some statistical test. But for convenience, user need to do KNN
imputation before any further action. After click this button, pre-processing part is done
and user could go to the 3 meta-analysis sections.
3.3 MetaQC
This section allow users to perform metaQC, including 6 quality control measurement.
Additionally, EQCp, CQCp, AQCp need the user to specify external pathway informa-
tion. So in the example, combination of default pathway are selected. See Figure 3.6.
The number of pathway for EQC is set as 5. In order to perform metaQC quickly in
11
3 Example
Figure 3.5: filter
Figure 3.6: metaQC panel
12
3 Example
(a) metaQC 6 quality scores and rank (b) PCA plot of the metaQC result
Figure 3.7: metaQC result
the example, the permutation number B is selected as 100. Pvalcut is set as 5%. More
detailed information about these parameters please refer to Section 2.4.1. After specify-
ing these parameters, you could perform metaQC by clicking the metaQC button. The
progress of metaQC will be shown in the console information window. When the process
is finished, two dialogs will pop up. One dialog is the rank score of metaQC result, based
on which user could make a decision on which studies are of bad quality (See Figure
3.7a). Another dialog will be a PCA plot of six quality score, user could save the plot as
file onto local PC (See Figure 3.7b).
Study Nanni and Dhanasekaran has high metaQC rank score and relatively small
sample size, which serves as the evident that these two studies have low correlation
with other studies. It is reasonable to remove these two studies in order to acheive a
more consistent result. Therefore just select these two studies in the studies information
window and click on delete button (See Figure 3.8a), then confirm again and then
merge the data. This time you will probably find that there would be more shared genes
across the studies, which also indicates the remain studies have better correlation. In the
example again we filter 50% unexpressed genes by mean value and 50% uninformative
genes by standard deviation value. And then do the KNN imputation again. Then the
data after metaQC selection is ready for the further meta-analysis. (See figure 3.8b)
13
3 Example
(a) remove studies with poor metaQC performance (b) reload data
Figure 3.8: metaQC selections and reload data
Figure 3.9: metaDE analysis parameter selection
3.4 MetaDE
After users load the data or reload the data after metaQC and their corresponding
pre-processing, it is the time to perform the metaDE analysis section. In the example,
individual test is selected as regt, individual tail is selected as abs, meta test is selected
as roP, asmptotic checkbox is selected and rth is choosen as 5 (See Figure 3.9).
Perform metaDE analysis by clicking meta analysis button. After a while, a dialog
with all the shared genes and their p-value of each individual study and meta p-value,
meta q-value will pop up. User could sort these genes by their alphabetic order or by
significance of any p-values (See Figure 3.10a). If the user wants to see more information
about a particular gene, simply click on the gene name and a dialog with detailed gene
information will pop up (See Figure 3.10b).
14
3 Example
(a) sort metaDE analysis result (b) get detailed information about a particular gene
Figure 3.10: metaDE analysis result
Clicking save as file button, user is able to save the metaDE analysis result as txt or
csv file. Also in the example, controlling false discover rate at 5%, clicking on heatmap
button will generate the expression heatmap of all the differentally expressed genes (See
Figure 3.11).
3.5 MetaPath
In the example, the parameters are set as Figure 3.12. Default pathway database KEGG
is choosen. Permutation type is selected as gene and the meta statistics is selected as
maxP. The lower limit of gene amount in pathway is 5 and the higher limit is 500.
In order to get the result quickly, permutation number is set as 500. Then click on
meta pathway button to perform MAPE_G, MAPE_P, MAPE_I separately. After it
finishes, user could click on qvalue cutoff button to get the significant pathway under
certain qvalue threshold. Commonly used threshold is 0.05. However in our example,
no pathway will be detected under such a threshold. If qvalue cutoff is 1, then all the
pathway will be listed with their p-value in each individual study and 3 MAPE result
(See Figure 3.13a). User could also click on plot to get the significant pathway below
the given qvalue cutoff (See Figure 3.13b).
15
3 Example
Figure 3.11: differentally expressed genes' heatmap
Figure 3.12: metaPath panel
16
3 Example
(a) metaPath result (b) metaPath plot
Figure 3.13: meta pathway result
17
Bibliography
[1] George C. Tseng, Debashis Ghosh and Eleanor Feingold. Comprehensive literature
review and statistical consideration for microarray meta-analysis. Nucleic Acids Re-
search, 2012, Vol. 40, No. 9 37853799.
[2] Xingbin Wang, Dongwan D. Kang, Kui Shen, Chi Song, Shuya Lu, Lun-Ching Chang,
Serena G. Liao, Zhiguang Huo, Shaowu Tang, Naftali Kaminski, Etienne Sibille, Yan
Lin, Jia Li, and George C. Tseng. An R package Suite for Microarray Meta-analysis
in Quality Control, Differentially Expressed Gene Analysis and Pathway Enrichment
Detection.
[3] Dongwan D. Kang, Etienne Sibille, Naftali Kaminski, and George C. Tseng. (2012)
MetaQC: Objective Quality Control and Inclusion/Exclusion Criteria for Genomic
Meta-Analysis. Nucleic Acids Research. 40(2):e15.
[4] Li J and Tseng,G.C. An adaptively weighted statistic for detecting differential gene
expression when combining multiple transcriptomic studies. Annals of Applied Statis-
tics. accepted.
[5] Kui Shen and George C Tseng. (2010) Meta-analysis for pathway enrichment analysis
when com- bining multiple microarray studies. Bioinformatics. 26:1316-1323.
[6] Fisher R. Combining independent tests of significance. American Statistician, 2(5):30
1948.
[7] Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate - a Practical
and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society
Series B-Methodological, 57, 289-300,1995.
[8] Stouffer, S., Suchman,E., DeVinnery,L., Star,S., and Wiliams,J. The American Sol-
dier,volumn I: Adjustment during Army Life. Princeton University Press, 1949.
[9] Hong, F., et al. (2006) RankProd: a bioconductor package for detecting differentially
expressed genes in meta-analysis, Bioinformatics, 22, 2825-2827.
18