Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R 
 
 
 
By 
 
Vasundhara Akkineni 
B.Tech, University of Madras, 2003 
 
 
 
 
A Thesis 
Submitted to the Faculty of the 
Graduate School of the University of Louisville 
In Partial Fulfillment of the Requirements 
for the Degree of 
 
 
 
 
 
MASTER OF SCIENCE 
 
 
 
Department of Computer Engineering and Computer Science 
University of Louisville 
Louisville, Kentucky 
 
 
 
May 2006 
 
 
 
 
 
 
MICROARRAY DATA ANALYSIS USING JAVA AND R 
 
 
By 
 
Vasundhara Akkineni 
B.Tech, University of Madras, 2003 
 
 
 
A Thesis Approved on 
 
 
 
April 14, 2006 
 
 
By the following Thesis Committee: 
 
 
 
 
 
Dr. Eric C. Rouchka, Thesis Director 
 
 
 
 
 
 
Dr. Dar-jen Chang 
 
 
 
 
 
 
Dr. Thomas Knudsen 
 
 
 
 ii
DEDICATION 
 
 
 
 
Dedicated to my parents 
  
 
Mr. Sarat Kumar Akkineni 
 
 
and 
 
 
Mrs. Surya Rani Akkineni 
 
 
Thanks for everything, papa and ma. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 iii
ACKNOWLEDGEMENTS 
 
 
 
First, I would like to thank my thesis director, Dr. Eric C. Rouchka for his direction, 
assistance and guidance. I would also like to thank the members of my thesis committee, 
Dr. Dar-jen Chang and Dr. Thomas Knudsen for their time. I thank Tim Hardin, 
Elizabeth Cha, Yamini Rudraraju and Eric Stutzenberger for making the Bioinformatics 
lab an enjoyable place to come to each day. Additional thanks to the other members of 
the Bioinformatics Research Group. I thank all the friends I have made over the years at 
the University of Louisville. I have learned a lot from each one of you. Finally, I thank 
my parents with all due respect for their love, support and encouragement. Support for 
this project was provided by NIH-NCRR grant # P20 RR16481 (Nigel Cooper, PI). 
 
 
 
 
 
 
 
 
 
 
 
 
 
 iv
ABSTRACT 
MICROARRAY DATA ANALYSIS USING JAVA AND R 
VASUNDHARA AKKINENI 
APRIL 14, 2006 
 
Microarray technology has become an essential tool in functional genomics for 
monitoring the expression of many genes in parallel.  Gene expression values obtained 
from microarray experiments help biologists to understand the way in which a cell 
responds to varying conditions (including, but not limited to development over time, 
response to environmental stimuli, or disease states) by analyzing the increase or 
decrease in the expression level of genes. We have developed web-based software that 
provides biologists with several statistical solutions for analyzing gene expression data. 
This platform independent java servlet first performs normalization of the gene 
expression values in order to eliminate any systematic bias in the measured intensity 
values arising from the microarray process. Several normalization methods like Total 
Intensity Normalization, Median Normalization and Lowess Normalization have been 
implemented. After normalization, visualization of the experimental data can be 
performed using scatter plots, MA plots, RI plots and image maps of the intensity ratios. 
For detection of genes which are differentially expressed the software provides fold-
change detection and t-test techniques. The tool also provides the users the ability to 
create a workflow of the different analysis tools used to study the uploaded 
 v
data. All the statistical routines used in this software were developed in R called from 
Java code. This software is a freely available tool to statistically analyze microarray 
experiments. 
 vi
TABLE OF CONTENTS 
DEDICATION……………………………………………………………………………iii 
ACKNOWLEDGEMENTS………………………………………………………………iv 
ABSTRACT……………………………………………………………………………….v 
LIST OF TABLES………………………………………………………………………..xi 
LIST OF FIGURES...…………………………………………………………….……...xii 
1. INTRODUCTION ...................................................................................................... 1 
1.1 Overview of molecular biology .......................................................................... 4 
1.2 DNA.................................................................................................................... 4 
1.3 RNA .................................................................................................................... 6 
1.4 mRNA................................................................................................................. 6 
1.5 Gene .................................................................................................................... 7 
1.6 Central Dogma of Molecular Biology ................................................................ 7 
1.7 MicroRNA (miRNA) .......................................................................................... 8 
1.8 Microarrays ......................................................................................................... 9 
2. MICROARRAY ANALYSIS TECHNIQUES......................................................... 13 
2.1 Microarray data analysis ................................................................................... 13 
2.2 Log ratios .......................................................................................................... 16 
2.3 Normalization ................................................................................................... 17 
2.3.1 Total intensity normalization .................................................................... 17 
2.3.2 Median normalization ............................................................................... 18 
2.3.3 Lowess normalization ............................................................................... 18 
2.4 Scatter plot ........................................................................................................ 19 
 vii
2.5 MA plot............................................................................................................. 20 
2.6 RI plot ............................................................................................................... 21 
2.7 Difference between MA and RI plots ............................................................... 22 
2.8 Identifying differentially expressed genes ........................................................ 23 
2.8.1 Fold change............................................................................................... 23 
2.9 Clustering.......................................................................................................... 24 
2.10 Types of clustering............................................................................................ 24 
2.10.1 Hierarchical clustering .............................................................................. 25 
2.10.2 Dendrogram .............................................................................................. 26 
2.10.3 Heat maps.................................................................................................. 27 
3. LITERATURE REVIEW ......................................................................................... 29 
3.1 Bioconductor..................................................................................................... 29 
3.2 TM4 and MIDAS.............................................................................................. 31 
3.3 BASE: BioArray Software Environment.......................................................... 32 
3.4 WebArray: an online platform for microarray data analysis ............................ 32 
3.5 SNOMAD (Standardization and Normalization of MicroArray Data)............. 33 
4. IMPLEMENTATION SPECIFICS .......................................................................... 35 
4.1 R........................................................................................................................ 35 
4.1.1 Statistics and R.......................................................................................... 36 
4.1.2 R and Windows™..................................................................................... 36 
4.2 Rserve ............................................................................................................... 37 
 viii
4.2.1 Installation of Rserve ................................................................................ 38 
4.3 Java ................................................................................................................... 39 
4.3.1 Java language ............................................................................................ 39 
4.3.2 Java platform............................................................................................. 39 
4.4 Java servlets ...................................................................................................... 40 
4.5 JSP (Java Server Pages) .................................................................................... 42 
4.6 JDBC (Java Database Connectivity)................................................................. 42 
4.7 MySQL ............................................................................................................. 43 
4.8 Apache Tomcat ................................................................................................. 44 
5. OBJECTIVES AND RESULTS............................................................................... 46 
5.1 Uploading experiment data ............................................................................... 47 
5.2 Normalization methods..................................................................................... 48 
5.3 Data visualization.............................................................................................. 50 
5.3.1 Scatter plot ................................................................................................ 50 
5.3.2 MA plot..................................................................................................... 51 
5.3.3 RI plot ....................................................................................................... 52 
5.4 Creating a process pipeline ............................................................................... 53 
5.5 Identifying genes of interest.............................................................................. 57 
5.5.1 Fold change cut-off ................................................................................... 57 
5.6 Clustering genes................................................................................................ 60 
5.7 Top and bottom intensity ratios ........................................................................ 61 
 ix
5.8 Search genes...................................................................................................... 63 
5.9 Output results as files........................................................................................ 65 
6. CONCLUSIONS....................................................................................................... 67 
6.1 Possible improvements ..................................................................................... 68 
REFERENCES…………………………………………………………………………..70 
APPENDICES…………………………………………………………………………...72 
CURRICULUM VITAE…………………………………………………………………77 
 
 
 
 
 
 
 
 
 x
LIST OF TABLES 
Table 2.1 - Gene expression matrix with raw gene expression data................................. 14 
Table 2.2 - Gene expression matrix with intensity ratio values........................................ 15 
Table 2.3 - Gene expression matrix with log 2 intensity ratio values ............................... 16 
Table 3.1 - Bioconductor packages................................................................................... 30 
Table 5.1 - Color coding scheme for differentially expressed genes using fold change .. 57 
 xi
LIST OF FIGURES 
Figure 1.1 - An overview of the formation of proteins....................................................... 4 
Figure 1.2 - DNA double helix structure ............................................................................ 5 
Figure 1.3 - Formation of mRNA ....................................................................................... 6 
Figure 1.4 - Central Dogma of Molecular Biology............................................................. 8 
Figure 1.5 - Preparation of microarrays............................................................................ 10 
Figure 2.1 - Process of obtaining a gene expression matrix ............................................. 13 
Figure 2.2 - Effects of lowess normalization .................................................................... 19 
Figure 2.3 - A scatter plot ................................................................................................. 20 
Figure 2.4 - An MA plot ................................................................................................... 21 
Figure 2.5 - An RI plot...................................................................................................... 22 
Figure 2.6 - Construction of a two-dimensional dendrogram representing a hierarchical 
cluster of related genes...................................................................................................... 26 
Figure 2.7 - A heat map with a dendrogram and a color key............................................ 28 
Figure 4.1 - R command line interface on startup ............................................................ 37 
Figure 4.2 - Java servlet execution process ...................................................................... 41 
Figure 4.3 - The three-tier architecture of a JDBC connection......................................... 43 
Figure 5.1 - Sample data file for analysis ......................................................................... 47 
Figure 5.2 - Process of uploading data files...................................................................... 48 
Figure 5.3 - Normalized data file using total intensity normalization .............................. 49 
Figure 5.4 - Normalized data file using median normalization ........................................ 49 
 xii
Figure 5.5 - Total intensity normalization scatter plots and text files .............................. 50 
Figure 5.6 - matrix of scatter plots with a zoomed out portion for two specific 
experiments ....................................................................................................................... 51 
nn×
Figure 5.7 - matrix of MA plots with a zoomed out portion for two specific 
experiments ....................................................................................................................... 52 
nn×
Figure 5.8 - matrix of RI plots with a zoomed out portion for two specific 
experiments ....................................................................................................................... 53 
nn×
Figure 5.9 - Steps for pipelining analysis ......................................................................... 54 
Figure 5.10 - Pipeline screen to define a sequence of routines to be performed on the data
........................................................................................................................................... 55 
Figure 5.11 - Results screen after submitting the pipeline screen .................................... 56 
Figure 5.12 - Color based image map of the gene’s intensity ratios between two 
experiments ....................................................................................................................... 59 
Figure 5.13 – Individual gene details................................................................................ 59 
Figure 5.14 - Heat map with a dendrogram to represent expression clusters ................... 61 
Figure 5.15 - User selected columns for calculating the intensity ratios and the number of 
top/bottom genes needed................................................................................................... 62 
Figure 5.16 - Results showing top 10 ratios and bottom 25 ratios between two 
experiments ....................................................................................................................... 63 
Figure 5.17 - Search screen with the list of genes from the uploaded data file ................ 64 
Figure 5.18 - Output screen for a search done on gene information................................. 65
 xiii
1. INTRODUCTION 
Two complementary advances, one in knowledge and one in technology, are 
greatly facilitating the study of gene expression and the discovery of the roles played by 
specific genes in the development of disease.  As a result of the Human Genome 
Project[1], there has been an explosion in the amount of information available about the 
DNA sequence of the human genome, including identification of a large number of genes 
within these previously unknown sequences. The challenge currently faced by scientists 
is to find a way to organize and catalog this vast amount of information into a usable 
form. The full impact of the Human Genome Project will be realized only after the 
functions of the new genes are discovered.  
With this vast amount of information comes the need for tools to make sense of 
the data. This led to the second advance which facilitated the identification and 
classification of the DNA sequence information and the assignment of functions to these 
new genes- the DNA microarray technology.  With the invention of the DNA chip, 
researchers have gone from looking at genes one at a time to tens of thousands at a 
time[2]. In order to really understand a genome, scientists need to understand how genes 
interact with each other and which genes are present under different conditions. This can 
be done by measuring the amount of each mRNA present in the cell. Microarrays enable 
us to measure this for thousands of genes simultaneously.  With the aid of a computer, the 
amount of mRNA bound to the spots on the microarray is precisely measured, generating  
 1
a profile of gene expression in the cell. Microarrays generate huge amounts of valuable 
data and the handling and analysis of such data is becoming one of the major bottlenecks 
in the utilization of the technology. The raw microarray data are images, which have to 
be transformed into gene expression matrices—tables where rows represent genes, 
columns represent various samples such as tissues or experimental conditions, and 
numbers in each cell characterize the expression level of the particular gene in the 
particular sample. These matrices have to be analyzed further, if any knowledge about the 
underlying biological processes is to be extracted and this forms the basis for my thesis- 
microarray data analysis.  
The data analysis process constitutes the analysis of the gene expression matrix 
using either supervised or unsupervised methods. Among the many statistical packages 
available for data analysis, ‘R’ is a statistical package which is widely used for the 
analysis of microarray data[3]. Several open source software are available which perform 
data analysis using R functionality as their base. Most of these packages either require 
some hands on programming experience and syntactical knowledge of the software in 
order to perform the analysis of the microarray data or are platform dependent and not 
universally available for all types of users.  
The Bioinformatics Research Group (BRG) [http://kbrin.a-
bldg.louisville.edu/brg/], which is a joint collaboration between the Speed School of 
Engineering and the School of Medicine at the University of Louisville, came up with the 
initiative for developing user-friendly software that can be used by biologists who 
generally lack programming knowledge. This thesis work is concentrated on developing a 
web based java tool which allows users to upload their data files in the format of a gene 
 2
expression matrix and then performs normalization of the data, produces plots to 
visualize the data, perform clustering of similar patterns of differentially expressed genes 
and lets users to save their results to a text file.  
It should be noted that a good understanding of these methods and the biology 
behind the data is needed to choose the most appropriate for solving a particular problem. 
The rest of chapter one is devoted to an overview of molecular biology, including a 
discussion of DNA, RNA, genes and microarrays. Chapter two discusses microarray 
analysis techniques, including an overview of log ratios, normalization, visualization 
plots, differentially expressed genes using the fold change method, and clustering. 
Chapter three is a literature review of existing microarray data analysis software, their 
drawbacks and how the system being developed caters to the needs of the user who lacks 
programming expertise. Chapter four gives a detailed description of the software used for 
the development of the microarray analysis tool. Installation and implementation 
specifics are also covered in a detailed manner. An overall discussion of the system being 
developed in the form of its objectives and the results obtained is dealt with in chapter 
five. Conclusions and further improvements to the microarray data analysis tool are 
discussed in chapter six. A detailed glossary of terms is also available as part of this 
thesis for the reader’s reference. 
 
 
 
 
 
 3
1.1 Overview of molecular biology 
Every cell in an organism contains a full set of chromosomes and identical genes. 
At a given point of time, only a subset of these genes is active. These genes define certain 
unique properties of a cell type. The information contained in the DNA is transcribed into 
messenger RNA (mRNA) molecules, which are then translated into proteins, which 
perform most of the important functions of the cell. Figure 1.1 illustrates this process.  
Cell Nucleus
Chromosome
Protein Gene (DNA)Gene (mRNA), 
single strand  
Figure 1.1 - An overview of the formation of proteins 
 
1.2 DNA 
Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the 
information of life.  A single stranded DNA molecule, called a polynucleotide or 
oligomer, is a chain of small molecules called nucleotides.  There are four different 
nucleotides, or bases:  adenosine (A), cytosine (C), guanine (G) and thymine (T).   
 4
Stringing together a simple alphabet of four characters together we can get 
enough information to create a complex organism. The ends of the polynucleotide are 
marked either 3’ or 5’.  The general convention is to label the coding strand from 5’ to 3’ 
(left to right).  For instance, the following is a polynucleotide: 
5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’ 
DNA can be either single-stranded or double stranded.  When DNA is double-
stranded, the second strand is referred to as the reverse complement strand. 
Complementary bases are determined by which pairs of nucleotides can form bonds 
between them.  In the case of DNA, A binds to T, and C binds to G.  For the 
polynucleotide given above, the double-stranded polynucleotide is as follows: 
5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’ 
|       |      |     |      |      |      |      |      |      |      |      |      |      |      |     | 
3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’ 
Two complementary polynucleotide chains form a stable structure known as the 
DNA double helix (Figure 1.2). Using the double stranded molecule as a template, 
proteins will be produced for active genes with the help of RNA molecule. 
 
 
Image source: www.genecrc.org/site/ lc/lc2b.htm
Figure 1.2 - DNA double helix structure 
 5
1.3 RNA 
Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from 
nucleotides.  However, instead of thymine (T), an alternative base uracil (U) is found in 
RNA.  RNA can be found as double-stranded or single-stranded, and can also be part of a 
hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is 
important in the cell and contributes in a variety of ways.  One of the most important 
roles of RNA is in protein synthesis. Two of the major RNA molecules involved in 
protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA). 
1.4 mRNA 
Messenger RNA (mRNA) is a linear molecule encoding genetic information 
copied from DNA molecules. DNA is copied into a single stranded mRNA molecule by 
the transcription process. This occurs as follows. Genes consist of coding regions called 
exons and non-coding regions called introns. mRNA processing removes introns and 
splices the exons together. Processed mRNA can be translated into a protein sequence. 
 
 
 
 
 
 
Source: http://www.ebi.ac.uk/microarray/biology_intro.html
Figure 1.3 - Formation of mRNA 
 
 6
Therefore, in order to determine which genes are active in a cell (i.e., those that are 
producing a protein product) one can measure the amount of mRNA present. This gives 
an approximation of the activity of individual genes in a cell. 
1.5 Gene 
A gene can be described as the physical and functional unit of heredity that carries 
information from one generation to the next[4].  A gene can be thought of as the DNA 
sequence necessary for the synthesis of a functional protein or RNA molecule. Proteins 
are important components of the body that determine how the different kinds of 
molecules in the body are organized and act. Thus, proteins play a key role in the way we 
look and the also in making us a unique individual. Genes are expressed as proteins, a 
complex process consisting of two main steps: Each gene (DNA) is converted 
(transcribed) into messenger RNA (mRNA), RNA that serves as a template for protein 
synthesis. The resulting mRNA then guides the synthesis of a protein through a process 
called translation. Thus isolating the mRNA helps us to find expressed genes from the 
human genome. 
1.6 Central Dogma of Molecular Biology 
The Central Dogma of Molecular Biology states that the region of a double 
stranded DNA molecule that corresponds to a gene is copied, or transcribed, to a 
complementary single stranded mRNA molecule[5]. The single stranded mRNA 
molecule then gets translated to a protein (Figure 1.4). If mRNA molecules can be 
identified, the expression level of the corresponding genes can be determined. 
 7
 Source: http://www.accessexcellence.org/RC/VL/GG/images/central.gif
Figure 1.4 - Central Dogma of Molecular Biology 
1.7 MicroRNA (miRNA) 
A miRNA is a form of single-stranded RNA which is typically 20-25 nucleotides 
long, and is thought to regulate the expression of other genes[6]. miRNAs are RNA genes 
which are transcribed from DNA, but are not translated into protein. The DNA sequence 
that codes for a miRNA gene is longer than the miRNA. This DNA sequence includes the 
miRNA sequence and an approximate reverse complement. When this DNA sequence is 
transcribed into a single stranded RNA molecule, the miRNA sequence and its reverse-
complement base pair to form a double stranded hairpin loop which is a primary miRNA 
structure (pri-miRNA). Pri –miRNAs are processed in the nucleus into hairpin RNAs 
called Pre-miRNAs. The pre-miRNA molecule is then actively transported out of the 
nucleus by a carrier protein. Thus through a mechanism that is not fully characterized, the 
 8
bound mRNA remains untranslated resulting in reduced expression of the corresponding 
gene. 
The function of miRNAs appears to be in gene regulation. miRNAs have been 
reported to be critical in the development of organisms; they are differentially expressed 
in tissues and are involved in viral infection processes. In the past two to three years, a 
great deal of effort has gone in understanding how, when and where miRNAs are 
produced and functions in cells, tissues and organisms. Several research groups have 
provided evidence that miRNAs may act as key regulators of processes as diverse as 
early development, cell proliferation and cell death, apoptosis and fat metabolism and cell 
differentiation. There is speculation that the role of miRNAs in regulating gene 
expression could be as important as that of transcription factors. The discovery of 
miRNAs and their functions has added insight into how gene regulation is much more 
complex than the Central Dogma of Molecular Biology previously led biologists to 
believe. 
1.8 Microarrays 
Microarrays, developed in the lab of professor Patrick Brown at Stanford, in the 
early 1990’s, took molecular biology by storm[7]. They are small slides spotted with 
fixed samples of DNA, each for a different gene. When a researcher prepares a labeled 
cell extract and incubates it with the slide, messengers in the sample anneal to the fixed 
DNA, showing which genes in the sample are active. Microarray technology helps to 
identify genes that are expressed under different conditions such as during the stages of a 
cell cycle, under different environmental conditions, under diseased states at a particular 
 9
time, or under different tissue or cell types. A microarray is typically a glass slide, on to 
which DNA, cDNA or Oligonucleotide molecules are attached at fixed locations (spots).  
There may be tens of thousands of spots on an array, each containing a huge 
number of identical DNA molecules of varying lengths. For gene expression studies, each 
of these molecules ideally should uniquely identify a single gene in the genome. 
Microarrays are used to compare gene expression levels in two different samples, for 
example, a cell in a healthy state and a diseased state. A microarray employs the ability of 
a given mRNA molecule to bind specifically to, or hybridize to, the complementary DNA 
from which it is originated. 
 
Source: http://www.bioteach.ubc.ca/MolecularBiology/microarray/index.htm
Figure 1.5 - Preparation of microarrays 
 
 10
 A microarray contains many DNA sequences, and the expression levels of 
thousands of genes can be determined in a single experiment by measuring the amount of 
mRNA bound onto each spot of the array. Arranged systematically, the particular 
sequences can be identified by the location of the spots on the slide. 
For two channel experiments, the relative abundance of each of the gene-specific 
sequences in two RNA samples (test and reference) may be estimated by fluorescently 
labeling the samples, mixing them and hybridizing them to the sequences on the glass 
slide. The two samples of mRNA from the cells (target) are reverse transcribed into 
cDNA, and labeled using two different dyes (red Cyanine 5 and green Cyanine 3). 
Usually, the reference sample is labeled Cy3 and the test sample with Cy5. The mixture 
reacts with the spotted cDNA sequences (probes). This results in cDNA sequences from 
the targets and the probes base-pairing with one another. After this hybridization step is 
complete, the microarray is placed in a scanner, consisting of lasers with different 
wavelengths, a microscope and a camera. The slide is scanned twice, first using one 
colored laser and then the second. Laser light excites the fluorescent dyes, Cy3 is excited 
by green laser light and Cy5 is excited by red laser light[4]. Green spots indicate that the 
test substance has lower activity than the reference substance, red spots indicate that the 
test substance is more abundant than the reference substance; yellow spots mean that 
there is no change in the activity level between the two populations of test and reference 
substance. Black represents areas where neither test nor control substance has bound to 
the target DNA. The process of creating and labeling a microarray can be observed in 
Figure 1.5. 
 11
Having an introduction to the central dogma of molecular biology, genes, 
microarrays, their preparation process and uses, the next chapter introduces microarray 
data analysis and the techniques used to analyze microarray data for obtaining useful 
information and knowledge about the underlying biological processes. 
 
 
 
 
 
 
 
 
 
 
 
 12
2. MICROARRAY ANALYSIS TECHNIQUES 
2.1 Microarray data analysis 
Analysis of microarray data is performed to identify which genes are involved in 
the process being studied.  It involves statistical analysis by various graphical and 
numerical means to select differentially expressed (DE) genes or to find groups of genes 
whose expression profiles can reliably classify the different RNA sources into 
meaningful groups. The analysis of gene expression data is performed by constructing the 
gene expression matrix that describes spot quantitations from different hybridizations. 
The process of constructing a gene expression matrix from the raw microarray data is 
summarized in Figure 2.1. 
 
 
 
 
 
 
 
 
 
Figure 2.1 - Process of obtaining a gene expression matrix 
 
 13
A gene expression matrix is a matrix, in which the first column represents the 
gene names, and the subsequent columns represent the different experimental conditions 
and the cell values usually represent the gene expression value for the given experiment. 
Given in Table 2.1 is a gene expression matrix with sample gene expression values. 
 
 
 
 
 
 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 
Gene 1 403 409.3 611.5 569.2 536.6 580.2 
Gene 2 757.3 574.4 826.7 595.3 755.2 956 
Gene 3 284.4 327.3 421.6 336.6 391.3 412.6 
Gene 4 2314.2 1685.3 2264.7 2204.1 2233.1 2458.4 
Gene 5 1574.5 1273 1484.6 1321.2 1474.7 1774.1 
Gene 6 2333.7 1796.8 2464.5 2372.5 2095.9 2735.7 
Table 2.1 - Gene expression matrix with raw gene expression data 
The cell values can also be the intensity ratios of the particular experiment with a 
preset experiment value. In the gene expression matrix in Table 2.2, the cell values are 
the intensity ratios of all the other experiments with experiment 1, calculated from the 
expression matrix of Table 2.1. 
 
 
 
 
 
 
 14
 Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1
Gene 1 1.015633 1.51737 1.412407 1.331514 1.439702 
Gene 2 0.758484 1.091641 0.786082 0.997227 1.26238 
Gene 3 1.150844 1.482419 1.183544 1.375879 1.450774 
Gene 4 0.728243 0.97861 0.952424 0.964955 1.062311 
Gene 5 0.808511 0.942903 0.839124 0.936615 1.12677 
Gene 6 0.769936 1.056048 1.016626 0.898102 1.172259 
 
Table 2.2 - Gene expression matrix with intensity ratio values 
Data analysis is based on the hypothesis that there are biologically relevant 
patterns to be discovered in the data. The microarray data analysis process depends on the 
analysis of the gene expression matrix using both supervised and unsupervised methods. 
Most data analysis methods use raw expression values, intensity ratios or both for their 
analysis routines. Data analysis methods take these huge sets of data as input and produce 
both visual and numerical results for interpretation and further analysis. The most 
commonly used microarray data analysis methods include log ratios of the gene intensity 
data in order to spread the values across a given range, normalization to identify and 
remove bias from the data, diagnostic plots of the microarray data for visualization 
purposes, and methods to identify differentially expressed genes and clustering of genes 
with similar behavior patterns. Each of these analysis techniques are discussed in detail in 
the following sections. 
 15
2.2 Log ratios 
A logarithmic transformation produces a continuous spectrum of values and treats 
up and down regulated genes evenly across a range [8]. Equation 2.1 shows the formula 
for calculating log2  ratios. 
i
i
i
G
RX 2log=     [Equation 2.1] 
 
Where i=1, 2…..Ngenes and  Ri , Gi are the measured intensity values for gene ‘i’ from two 
different experimental conditions. 
By using  log2 values, X=0 represents equal expression, X=1 represents up 
regulation by a factor 2, X=-1 down regulation by a factor 2, X=2 up regulation by factor 
4, and so on. Additionally, calculating log 2 values spreads the values more evenly across 
the intensity range and provides better visualization of the data and it tends to make the 
variability of data more constant over the intensity range[9].  Given in Table 2.3 are the 
log2 values of the intensity ratios calculated in Table 2.2. 
 Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1
Gene 1 0.02237883 0.60157266 0.49815582 0.413067216 0.52577046 
Gene 2 -0.39880918 0.12649896 -0.34724803 -0.004006164 0.33614569 
Gene 3 0.20269214 0.56795340 0.24311371 0.460353645 0.53682236 
Gene 4 -0.45750812 -0.03119360 -0.07032387 -0.051465694 0.08720612 
Gene 5 -0.30666134 -0.08481948 -0.25304488 -0.094472262 0.17219357 
Gene 6 -0.37718928 0.07867587 0.02378897 -0.155049228 0.22929092 
 
Table 2.3 - Gene expression matrix with log 2 intensity ratio values 
 16
2.3 Normalization 
The goal of normalization is to identify and remove any systematic bias in the 
measured fluorescence intensities, arising from variation in the microarray process rather 
than from biological differences between the RNA samples or the printed probes[10;11]. 
Sources of bias include: 
• labeling efficiencies of the dyes 
• different amounts of Cy3 and Cy5 labeled mRNA 
• scanning parameters 
• spatial or plate effects, print tip effects, etc. 
In the normalization process, a normalization factor (also referred to as scaling 
factor) is calculated and is multiplied to all the values of an experiment. Either of the 
experiments which are being compared can be multiplied with the normalization factor. 
This process is the same as taking a constant value away from the log of the normal ratio. 
2.3.1 Total intensity normalization 
Total intensity normalization computes the normalization factor by summing the 
measured intensities in both the experiments considered[10]. This is shown in Equation 
2.2, 
∑
∑
=
==
array
k
k
array
k
k
total
N
G
N
R
N
1
1   [Equation 2.2] 
where Narray is the total number of genes, Gk and Rk are the measured intensity values of 
the kth gene in both the experiments. The intensities are then rescaled such that Gk’ = Ntotal 
 17
Gk   and   Rk’= Rk and the normalized expression ratio for each feature are calculated 
(Equation 2.3). 
k
k
totalk
k
k
G
R
NG
RT 1
'
'' ==   [Equation 2.3] 
 
This is equivalent to 
)(log)(log)'(log 222 totalkk NTT −=  
2.3.2 Median normalization 
In median normalization the normalization factor is found by calculating the 
median of the array in question. Hence the equation becomes  
akk medianTT −= )(log)'(log 22  where a is the experiment array. 
The advantage of using the median normalization is that it is insensitive to outliers which 
occur commonly in microarray data sets. 
2.3.3 Lowess normalization 
Lowess stands for Locally Weighted Linear Regression. It is also referred to as 
Loess. Lowess uses a linear regression model whereas Loess uses a quadratic regression 
model. The lowess normalization procedure subtracts a Lowess regression curve from the 
data to normalize it[10;12]. 
 
 
 
 
 
 18
  
 
 
Figure 2.2 - Effects of lowess normalization 
 
A Lowess curve is first drawn on the RI Plot. The lowess curve is calculated by a 
regression process which calculates the dependence of the ratio on the intensity and puts 
it in a mathematical context. 
The dependence,  for each gene (i) is calculated by observing its distance 
from the curve. On subtracting the dependence from the observed log
)( ixy
2 ratios, the 
equation becomes: 
))(^2(log)(log)()(log)'(log 2222 ikikk xyTxyTT −=>−=   [Equation 2.4] 
Figure 2.2 shows the effect of lowess regression on a set of data. The plot on the right 
hand side is the RI plot itself and the plot on the left hand side is the RI plot fitted with 
the lowess curve. Lowess detects the systematic deviations in the RI plot and corrects 
them by carrying out a local weighted linear regression function given by Equation 2.4, 
and uses this function, point by point, to correct the measured ratio values. The results of 
applying such a lowess correction can be seen in the left hand side plot of Figure 2.2. 
Lowess analysis is used as a normalization method that can remove intensity dependant 
effects in the log2 ratio values. 
2.4 Scatter plot 
The scatter plot is an important graphical tool for studying the spread and linearity 
of data[8]. In its simplest form, two variables are plotted along the axes, and marks are 
 19
drawn according to these coordinates. The intensity values of genes under different 
experiments can be depicted as a scatter plot. A scatter plot is straightforward, but very 
high correlation between the two experimental intensity values makes the features of the 
plot difficult to discern. In an ideal scatter plot, all the spots are clustered around the 
diagonal line representing y=x. Figure 2.3 shows a scatter plot with most of the data 
points clustered around the diagonal line. 
 
Figure 2.3 - A scatter plot 
 
2.5 MA plot 
An MA plot is a scatter plot with transformed axes[8]. The X-axis conforms to the 
logged total intensity value of the two experiments; the Y-axis shows exactly the log-
ratio of the two experiments. MA plots are used to identify spot artifacts and detect 
intensity-dependant patterns in the log ratios. Since the interest lies in deviations of the 
points from the diagonal line, it is beneficial to rotate the axes and re-scale the axes as in 
the MA plot. The MA plot serves to increase the room available to represent the range of 
differential expression and makes it easier to see non-linear relationships between the log 
intensities. The MA plot in Figure 2.4 shows the differentially expressed genes more 
 20
clearly than the scatter plot in Figure 2.3. If an MA plot clearly shows the dependence of 
the log ratio M on overall spot intensity A, this suggests that intensity or ‘A’ dependent 
normalization method may be preferable. 
 
Figure 2.4 - An MA plot 
 
2log1log 22 ExpExpM −=  )2log1(log
2
1
22 ExpExpA +=  
2.6 RI plot 
A ratio-intensity (RI) plot is also a scatter plot like the MA plot that shows the 
intensity specific effects for all the genes by plotting the log ratio as a function of the 
product of the intensities[9;12]. RI plots are used to determine if there is a rough 
correlation between the total intensity of a spot and its ratio. The easiest way to visualize 
intensity-dependent effects, and the starting point for the lowess analysis described in 
section 2.3.3, is to plot the measured log2 (Exp 1/Exp 2) for each gene as a function of the 
log10 (Exp 1*Exp 2) product intensities. This ‘R-I’ plot can reveal intensity-specific 
artifacts in the log2 (ratio) measurements which can be eliminated using lowess 
 21
normalization method. Under the assumption that most genes are not differentially 
expressed, most of the points in the RI plot should fall along the horizontal line. Figure 
2.5 shows an RI plot where a large number of genes which are not differentially 
expressed fall along the horizontal line, and a number of differentially expressed genes 
are scattered away from the horizontal line. 
 
Figure 2.5 - An RI plot 
2.7 Difference between MA and RI plots 
The MA plot and RI plot are used to check if the data exhibits an intensity dependent 
structure. RI plots and MA plots are used in an alternative manner by scientists. 
In an MA plot, plot M=log 2 (R/G) Vs A= (1/2) log 2 (R*G) 
In a RI plot, plot R= log 2 (R/G) Vs I=log 10 (R*G) 
where R and G are two different experiments. 
The type of plots used for analysis is a source of confusion due to the fact that the RI plot 
looks very similar to an MA plot. It is important to know that MA plots are similar to RI 
plots but are not the same. RI plots are most commonly used to show the effect of lowess 
 22
normalization. MA plots are used instead of scatter plots because they serve to increase 
the room available to represent the range of differential expression and makes it easier to 
see non-linear relationships between the log intensities. 
2.8 Identifying differentially expressed genes 
One of the main goals of microarray experiments is to identify differentially 
expressed (DE) genes[11]. It will be practical to identify a limited number of genes which 
are the most likely candidates. This set of DE genes can be further analyzed using 
clustering techniques, etc. 
2.8.1 Fold change 
The fold change detection is a simple approach where a fixed fold-change-cutoff 
interval is used to find genes which are differentially expressed.[10; 13] 
2
1
sampleinvalueExpression
sampleinvalueExpressionchangeFold =   [Equation 2.5] 
 
If a gene’s experimental log-ratio exceeds the upper cutoff interval boundary, then 
it is marked as significant and over expressed. If a gene’s experimental log-ratio falls 
below the lower cutoff interval boundary, then it is marked as significant and under 
expressed. Genes with experimental log-ratios in the range of the interval are marked for 
regular behavior. Important factors to be considered in fold change method are what 
cutoff should be used and should the cutoff be the same for all the genes. Though it is a 
very straightforward method for classifying genes, the fold change method has the 
disadvantage of not considering variability. Hence, genes with large variances are more 
likely to make the cutoff just because of noise. For poorly expressed genes, small changes 
 23
in intensity can lead to large calculated fold changes. And it is not a statistically based 
method. 
2.9 Clustering 
Microarray experiments deal with a large amount of data, which has to be stored 
and analyzed. Therefore a general idea is to reduce the dimensionality of the data. The 
basic concepts in clustering are to try to identify and group together similarly expressed 
genes and then try to correlate and interpret the observations at the biological end. The 
basic principles in gene clustering are: 
1. Organize the data into a small number of homogeneous groups. 
2. Find similar expression patterns of genes. Both low and high expression level 
genes can be placed in the same cluster if their expression profiles have similar 
shape. 
2.10 Types of clustering 
Clustering can be hierarchical or flat, as well as agglomerative or divisive[10]. 
Agglomerative processes start out by considering each object as a separate cluster and 
proceed to group the most similar objects in an iterative fashion until all the data are 
included. Divisive methods start out with the complete set of data as one large group, or 
cluster, and proceed by partitioning the objects starting with those that are most 
dissimilar. Based on their background principle, the different types of clustering methods 
available are Hierarchical agglomerative clustering[9;10], Hierarchical divisive 
clustering[9;10], k-means clustering and self organizing maps (SOM’s)[9;10;13].  
 24
2.10.1 Hierarchical clustering 
The clustering method used for analysis in this tool is hierarchical clustering. The 
hierarchical clustering algorithm uses a bottom-up approach where it iteratively joins the 
two closest clusters starting from a single cluster[9;10;13].  After each step, a new 
distance matrix between the newly formed clusters and the other clusters is recalculated.  
For a set of N genes to be clustered, and a NN ×  distance matrix, the hierarchical 
clustering is performed as follows: 
1. Assign each gene to a cluster of its own. 
2. Find the closest pair of clusters and merge them into a single cluster. 
3. Compute the distances between the new cluster and each of the old clusters. 
4. Steps 2 and 3 are repeated until all the genes are clustered. 
The distance matrix is calculated by considering the shortest distance from any 
member of one cluster to any member of the other cluster. Hierarchical clustering has 
become popular for the following reasons: 
• Hierarchical clustering techniques are meaningful to cluster data at the experiment 
level rather than at the level of individual genes. Such experiments are most often 
used to identify similarities in overall gene expression patterns in the context of 
different treatment regimens. 
• The analysis reveals groups of similar genes that can be studied in greater depth. 
• It is possible to visualize the data in a hierarchical way using interactive computer 
programs. 
While intuitively appealing as a method, hierarchical clustering is not an efficient 
method for very large gene expression matrices as the full distance matrix of all pair-wise 
 25
distances has to be calculated in advance, which for n objects takes an order of n2 steps. 
Hierarchical clustering is also less suitable for noisy data.  
2.10.2 Dendrogram 
Hierarchical clustering can be represented as a tree called a dendrogram[9;10].  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Source 1: http://www.awprofessional.com/articles/article.asp?p=357695&seqNum=4&rl=
Figure 2.6 - Construction of a two-dimensional dendrogram representing a 
hierarchical cluster of related genes 
 
 26
By cutting the dendrogram at a particular height will give the different clusters and the 
ze of the clusters. The dissimilarity of the clusters is proportional to the length of the 
ertical lines projecting from each cluster. Figure 2.6 is an example of how a dendrogram 
f clusters is obtained. 
Each column represents a different experiment, each row a different spot on the 
icroarray. The height of each link is inversely proportional to the strength of the 
orrelation. Relative correlation strengths are represented by integers in the 
ccompanying chart sequence. Genes 1 and 2 are most closely coregulated, followed by 
enes 4 and 5. The regulation of gene 3 is more closely linked with the regulation of 
enes 4 and 5 than any remaining link or combination of links. The strength of the 
orrelation between the expression levels of genes 1 and 2 and the cluster containing 
enes 3, 4, and 5 is the weakest (relative score of 10). (Adapted from: Jeffrey Augen, 
ioinformatics and Data Mining in Support of Drug Discover," Handbook of Anticancer 
rug Development. D. Budman, A. Calvert, E. Rowinsky, editors. Lippincott Williams 
2.10.3 Heat maps 
A heat map is a color image with a dendrogram attached to the left side and to the 
top of the image[10;14]. The rows and columns plotted in the heatmap are re-ordered 
based on the restrictions imposed by the dendrogram. Each row in the heatmap represents 
a gene and the columns represent the different experiments to which the gene is 
subjected. The colors in the heat map simply represent the values in the gene expression 
matrix. One can observe from a heat map (Figure 2.7) that genes with similar gene 
expression profiles (i.e. strings of similar colors) are grouped close together. 
si
v
o
m
c
a
g
g
c
g
"B
D
and Wilkins. 2003) 
 27
 Figure 2.7 - A heat map with a dendrogram and a color key 
The next chapter is a literature review of existing microarray data analysis tools, 
and their advantages and disadvantages in terms of ease of use, availability and 
functionality. It also includes a discussion about the motivation for the developed 
microarray data analysis tool. 
 
 
 
 
 
 
 
 28
3. LITERATURE REVIEW 
There are several commercial and non-commercial solutions as well as a growing 
body of freely available open source software for analyzing microarray data. A review of 
some popular open source microarray data analysis tools is presented here including 
Bioconductor, TM4, MIDAS, BASE, WebArray and SNOMAD. 
3.1 Bioconductor 
Bioconductor is an open source project for computational biology[15].  The main 
focus is to d on analysis. 
Biocon
s at least one vignette, a document that provides a textual, 
sk oriented description of the package’s functionality and can be used interactively.  
lthough initial efforts focused primarily on DNA microarray data analysis, many of the 
ftware tools are general and can be used broadly for the analysis of genomic and 
xpression data. Bioconductor has adopted object-oriented programming as its primary 
rogramming paradigm. 
he main features of the Bioconductor project are: 
ƒ Use of R to provide a wide range of statistical and graphical methods for the 
analysis of genomic data. 
eliver high-quality infrastructure and end-user tools for expressi
ductor is built completely on R[3;14] and R packages. A list of the different types 
of packages available is given in Table 3.1. In addition to providing genomic data 
analysis tools, Bioconductor has excellent integrated, dynamic documentation. Each 
Bioconductor package contain
ta
A
so
e
p
T
 29
ƒ Help integrate biological literature data from PubMed and LocusLink with the 
analysis of genomic data. 
ƒ Allows the development of extensible, scalable and interoperable software. 
ƒ Provide high qual le research. 
ƒ 
nalysis tool with a simple user interface, which does not require 
the user upload data for analysis and 
dow o  the 
web-ba e focus of this thesis. 
ity documentation and reproducib
Provide training in computational and statistical methods for the analysis of 
genomic data. 
Task Packages 
General programming tools Biobase, graph, tkWidgets, reposTools, 
rhdf5 
Annotation AnnBuilder, 
Table 3.1 - Bioconductor packages 
Although Bioconductor has the advantage of building on the existing toolkit of 
statistical applications, it is command line based which is imposing for many users.  The 
tkWidgets package provides some functionality for creating GUI’s, but even that requires 
additional programming. 
The need for an a
annotate 
Graphics Geneplotter, hexbin 
Preprocessing microarray data Affy, marrayClasses, marrayInput, 
marrayNorm, marrayPlots, marrayTools  
Differential gene expression Genefilter, multtest, ROC 
any kind of programming skills, and instead lets 
nl ad the results in a point-and-click fashion, is the main factor in developing
sed application which is th
 30
3.2
alysis suite of tools was developed to provide the 
mic of the 
mic ajor applications, 
Mic a
System  Multiexperiment Viewer (MeV).  Since the focus of this project is 
in array data analysis, the discussion is confined to MIDAS.  
M alysis System
MIDAS is a java application which pr  users an intuitive interface to design 
an sses combining one or more ring steps.  MIDAS 
reads “.tav” (TIGR ArrayViewer file type, w mn, tab-delimited text 
fo  purposes o
a single slide) files generated by TIGR Spo ia 
M les include lo ormalization.  It 
also includes background- ate analysis and filtering, 
and the
s the data in tav format. While TM4 overcomes some of the 
limitati
accessed through web, instead of 
 TM4 and MIDAS 
The TM4 microarray an
roarray community with a comprehensive set of tools to handle all aspects 
roarray process[16]. The TM4 suite of tools consist of four m
ro rray Data Manager (MADAM), TIGR_Spotfinder, Microarray Data Analysis 
 (MIDAS), and
 micro
IDAS: Microarray Data An  
ovides
alysis proce normalization and filte
hich is an eight-colu
rmat developed at TIGR for the f storing the intensity values of the spots on 
tfinder or retrieved from the database v
ADAM. Normalization modu wess and total intensity n
 and quality- control trimming, replic
 identification of differentially expressed genes using intensity dependent Z-scores 
and user defined fixed fold-change cut-offs. MIDAS provides scatter plots that illustrate 
the effects of each algorithm on the data. When the normalization and filtering steps are 
complete, MIDAS output
ons of a command-line driven system, it has the disadvantage of requiring users to 
maintain current copies of the software locally and to update the system as it evolves.  
Thus the need for an analysis tool which can accept data as a simple text file, 
instead of program specific formats and which can be 
 31
maintaining local copies of the program on a user’s computer, has been another 
motiva
BASE was developed using a web-based approach which closely integrates a data 
management system with a data analysis system[16]. Since expression analysis tools are 
evolving rapidly, BASE has a plug-in architecture that allows new modules to be easily 
added for data transformation, analysis, or visualization. BASE incorporates a data 
analysis interface that allows users to define an analysis method that passes data through 
multiple routines and to create transformed datasets and subsets. This allows the original 
unmodified data to be analyzed in a number of ways to create multiple analyses. BASE 
allows data to be visualized in a variety of ways. Unmodified and transformed datasets 
can be plotted interactively as scatter plots, displayed in histograms, or viewed as tables. 
Though BASE minimizes the software update problem through its web-based approach, 
it has the disadvantage that it loses a good deal of the graphical functionality that local 
applications can provide. 
The motivation for creating a pipeline process in the application being developed 
comes from the analysis method of BASE. Also, though not yet implemented, the 
integration of the data analysis module with a data management system as done in BASE 
is a good future improvement. 
ting fact in developing this application. 
3.3 BASE: BioArray Software Environment 
3.4 WebArray: an online platform for microarray data analysis 
WebArray offers a convenient platform for biologists to access several cutting-
edge microarray data analysis tools[17]. WebArray runs on a LAMP system (Linux + 
 32
Apache + MySQL + Python) system. Background computations are mostly done by R 
scripts.  The currently implemented functions of WebArray were based on limma (Linear 
Models for Microarray Analysis) and affy package from Bioconductor, the spacings 
LOESS histogram (SPLOSH) method, PCA-assisted normalization method and genome 
mapping method. WebArray incorporates these packages and provides a user-friendly 
interface for accessing a wide range of key functions of limma and others, such as spot 
quality weight, background correction, graphical plotting, normalization, linear modeling, 
empirical bayes statistical analysis, false discovery rate (FDR) estimation, and 
chromosomal mapping for genome comparison. Microarray analysis using WebArray can 
be executed in three steps: 1) uploading and managing files; 2) selecting datasets and 
methods for analysis, 3) browsing results. A good help document is also available with 
detailed annotation of all the functions of WebArray.  Thus WebArray is an excellent free 
open source software for microarray analysis that can be used by an average biologist 
after some training.  
ardization and Normalization of MicroArray Data) 
on to the regular transformations and visualization tools, 
SNOMAD includes two non-linear transformations which correct bias and variance 
which are non-uniformly distributed across the range of microarray element signal 
intensities: 1) local mean normalization; and 2) local variance correction (Z-score 
generation using locally calculated standard deviation). 
3.5 SNOMAD (Stand
SNOMAD is an interactive, user-friendly web-application which can be accessed 
freely via the internet with any standard HTML browser[18]. SNOMAD is a collection of 
algorithms for the normalization and standardization of gene expression datasets derived 
from diverse sources. In additi
 33
The SNOMAD tool is available at - 
http://pevsnerlab.kennedykrieger.org/snomad.htm. No programming expertise or 
software installation is required. Users can upload their gene expression data and specify 
the transformations they wish to apply on their data. Results come in the form of both a 
text file containing numeric values and image files of graphs of the data corresponding to 
all the transformations. 
WebArray and SNOMAD are two user-friendly tools available in the market for 
microarray data analysis. But they have their own disadvantage of having limited 
functionality, confined to a certain set of routines that the user can perform on the data. 
They do not have the scope for adding new R programs to the already existing system. In 
such a case, biologists tend to use multiple tools for obtaining the required results. This 
lack of extensibility formed another motivation for the development of the application 
under discussion. Thus all these above discussed factors led to the development of the 
current application to provide a solution to the community driven need for an easy to use, 
readily available and extensible microarray data analysis tool, which uses R routines for 
analysis. 
 
 
 
 34
4. IMPLEMENTATION SPECIFICS 
In this chapter, the software implementation specifics for the microarray analysis 
tool are discussed. A brief introduction to the R package, which forms the base for the 
statistic
4.1 R 
R is a powerful software environment for data manipulation, calculation and 
graphical display. It is a GNU General Public License project similar to the S language. 
The name is partly based on the first names of the first two authors (Robert Gentleman 
and Ross Ihaka), and partly a play on the name of the Bell Labs language ‘S’[3;14]. 
 supports a wide range of statistical techniques including descriptive statistics, 
linear and nonlinear modeling, classical statistical tests, probability distributions, analysis 
of variance (ANOVA), time series analysis, classification, clustering, robust regression 
and maximum likelihood. 
 
al analysis of the microarray data is given. Description of Java which has been 
used to develop the user interface, and information about Rserve, which is the plug-in 
used to connect to R from Java are also given. Some background about MySQL database 
and the JDBC connection needed to connect to a database from Java code is also 
provided. A clear understanding of the software is needed to understand the 
implementation techniques discussed in this thesis. 
R
 35
R is extensible via user defined functions written in its own language, or through 
the use of dynamically loaded modules written in other languages. It can be used with 
Linux, UNIX and Microsoft Windows™.  
4.1.1 Statistics and R 
Most of the statistical techniques have been built into the base R environment and 
many more are supplied in the form of packages. There are about 10 packages called 
standard packages which are supplied with R and many more can be downloaded from 
the Comprehensive R Archive Network (CRAN) website (http://cran.r-project.org). 
The major difference between R and other statistical systems is that in R, the 
statistical analysis is performed as a sequence of steps with the results of every step 
stored in objects. In systems like SAS and SPSS copious output is obtained from a 
regression or analysis whereas R will give minimal output and store the results in a fit 
ct f further processing by R functions. 
4.1.2 R and Windows™ 
The latest version of R for Windows™ can be downloaded from the CRAN 
website. The version used for development of this project is R 2.1.1. A full installation of 
R on Windows™ takes up to 50 MB of disk space. To install, double click on the icon for 
rw2011.exe and follow the instructions. R installed in this way can be started from the 
start menu or by double clicking the R shortcut. To add packages to the existing R 
system, download the packages from the CRAN website and unzip them into the 
R/rw2011/library folder directly. 
obje or 
 36
 Figure 4.1 - R command line interface on startup 
erted into native data types. 
• Persistent connections until the connection are closed. 
4.2 Rserve 
Rserve [19]is a TCP/IP server which allows other programs to use R facilities 
from various languages without the need to initialize R or link against R library. Rserve 
supports remote connection, authentication and file transfer. Typical use of Rserve is to 
integrate R backend for computing statistical models, plots, etc from other applications. 
The features of Rserve include: 
• R initialization is not necessary. 
• Most R data types are conv
 37
• Offers client independence since the client is not linked to R. 
• Rserve provides some basic security in the form of encrypted user/password 
authentication. 
• Rserve allows transferring files between the client and the server. 
Rserve itself is the server which responds to requests from the clients. It listens 
for incoming connections and processes incoming requests. A client framework was also 
developed – JRclient. JRclient is a client suite which allows a java application to access 
Rserve.  It was developed in java. It provides automatic type translation for most objects 
such as int, double, arrays, string or vector and classes for special R objects such as 
RList, RBool, etc. The idea behind the separation of client/server side allows handling 
multi-threading better when linking to R library directly. 
4.2.1 Installation of Rserve 
R 1.5.0 or  to be able to use 
AIX and Windows™. The Windows™ version 
of Rserve was used for development. Although Rserve works on Windows™, it is not the 
recommended since Windows™ lacks important features that make the separation of 
namespaces possible. Therefore Rserve for Windows™ allows only one connection at a 
time and all subsequent connections share the same namespace.  
Installation process for Windows™: 
1. Make sure to download the proper binary based on the version of R. 
2. Copy the binary Rserve.exe to the same directory where R.dll is located. By 
default it is in the R\rw2011\bin folder. 
3. Run rserve.exe to start the server and to make connections to R. 
higher needs to be installed on your system in order
Rserve. Rserve works on Linux, Solaris, 
 38
Rserve was developed by Simon Urbanek, a researcher at AT&T Research labs. 
Any e
(General Public License). 
4.3 
This microarray data analysis tool was mostly developed using Java in order to 
provide a platform independent solution. The IDE (Integrated Development 
Environment) used for code development is Eclipse SDK 3.1.1. Other IDE’s that can be 
used are Borland’s JBuilder or Netbeans. 
ple object-oriented, distributed, 
interpreted, robust, secure, architecture neutral, portable, high-performance, 
multithreaded, and dynamic language[20]. A program written in java is both compiled 
and interpreted. A java compiler generates an architecture independent object file 
executable on any system supporting the java runtime environment. The object code 
consists of bytecode instructions designed to be both easy to interpret on any machine 
and easily translated into native machine code at load time. So compilation takes place 
only once, interpretation occurs each time the program is executed. 
rogram runs. 
Som
operati e. The java platform differs from these 
on  interested to contribute to the project can do so since it is released under GPL 
Java 
4.3.1 Java language 
Java as described by Sun Microsystems is, a sim
4.3.2 Java platform 
A platform is the hardware or software environment in which a p
e popular platforms like Windows™, Linux, Mac OS, etc. are a combination of the 
ng system and the underlying hardwar
 39
platform
parts: 
1. The Java Virtual Machine (JVM) 
2. The Java Application Programming Interface (API) 
The JVM is the interpreter and the runtime system, which lets java programs run 
on any hardware-based platform where it has been already ported to. The API is a large 
collection of ready-made software components that provide several capabilities. It is a 
grouped up collection of libraries of related classes and interfaces. These libraries are also 
  
4.4 
Servlets[21] are java programs that run on a web server and build web pages.  
Servlets provide a component-based, platform-independent method for building web-
based applications. Servlets are server- and platform- independent which leaves us free to 
select any server, platform and tools for running our application. 
s based on the fact that it is a software-only platform that runs on top of other 
hardware-based platforms. 
The java platform has two 
known as packages[20].
Java servlets 
 40
 Source: http://cs.nmu.edu/~jeffhorn/Classes/CS122/Figures/javaTranslation.gif
Figure 4.2 - Java servlet execution process 
 
 form that specifies 
method=POST. To be a servlet, a class should extend HttpServlet and contain the doGet 
and the doPost methods to handle the GET and POST requests respectively. Both these 
methods take two arguments: an HttpServletRequest and an HttpServletResponse. The 
HttpServletRequest has methods to handle all incoming information such as form data, 
HTTP request headers, and the client’s hostname. The HttpServletResponse lets you 
specify outgoing information such as HTTP status codes, content-type, cookies and most 
importantly lets you post document content back to the client. The two important 
packages that have to be imported into the servlet file are: 
Servlets are class files which handle GET and POST requests. GET requests are 
the usual type of browser requests for web pages in HTTP, when a user types a URL on 
the address line or follows a link from a web page. Servlets also handle POST requests, 
which are generated when someone submits an HTML
 41
1. javax.servlet (for HttpServlet) and  
2. javax.servlet.http (for HttpServletRequest and HttpServletResponse). 
4.5 JSP (Java Server Pages) 
Java Server Pages[20;21] is a technology that lets you mix regular, static HTML with 
dynamically-generated HTML. You simply write the regular HTML in the normal 
manner, using whatever web-page building tools you normally use. You can then enclose 
the code for the dynamic parts in special tags which start with “<%” and end with “%>”.  
A JSP is saved with a .jsp extension and it can be invoked just like any other normal web 
page. Though it appears to be a normal HTML file, a JSP acts like a servlet behind the 
scene
.6 JDBC (Java D
s.  
4 atabase Connectivity) 
JDBC[20;21] defines how a java program can communicate with a database. 
JDBC API provides two packages – java.sql and javax.sql. By using JDBC API, one can 
connect to any database, send queries to the database and process the results. 
JDBC architecture defines the different layers to work with any database and Java. 
1. JDBC API interfaces and classes which are at top most layer (to work with java) 
2. A driver which is at the middle layer (maps java to database specific language) 
3. A database at the bottom (to store physical data) 
 42
 Source: http://www.dbmsmag.com/9610i06.html#figure1
Figure 4.3 - The three-tier architecture of a JDBC connection 
 
The three main interfaces provided by the JDBC API to work with databases are: 
 connection functionality. 
2. 
hich comes 
4.7
conjunction with web technology server applications[22]. 
The database has been designed for speed, which would be useful in large transactions. 
MySQL is currently the most widely installed database, a well respected product that is 
more than capable of commercial operation. In fact, the entire Google search engine is 
built upon MySQL technology[23]. MySQL offers most of the functionality one will 
1. Connection interface provides database 
Statement interface provides SQL query representation and execution 
functionality. 
3. ResultSet interface provides functionality for retrieving the data w
from the execution of a SQL query using Statement.  
 MySQL 
MySQL is a very popular open source database server which is commonly used in 
to create dynamic and powerful 
 43
expect from an RDBMS. It ensures that transactions comply with the ACID model 
(Atomicity, Consistency, Isolation, and Durability), allows the building of indexes, 
supports standard data types and allows for database replication, among other features. 
One area where MySQL falls short is its lack of certain features like sub-queries, 
constraints, views, cursors and objects. MySQL is fast, easy to use, is open source and if 
the application is a web application then MySQL meshes in perfectly with most of the 
web development languages. When using MySQL with java, the MySQL Connector/J 
driver needs to be downloaded from MySQL’s website 
[http://www.mysq
.8 Apache Tomcat 
developm
application container that was created to run Servlets and Java Server Pages (JSP) in web 
applications. Java m
eb pages, 
servlets and JSP into a single directory structure. It can be thought of as a container 
h a deployment directory where you can place all your web application files 
for them
l.com/products/connector/j/]. 
4
Apache Tomcat (also codenamed Catalina) is a standalone Web server used as a 
ent server on your desktop. The Tomcat server [21] is a java based web 
ust be installed for Tomcat to operate.  
Tomcat organizes all the parts of a web application such as static w
whic cts as the 
 to execute without any hassles.  
The root folder is the deployment folder where all the static html files and JSPs 
can be placed. The Servlets are placed in the ROOT/WEB-INF/Classes folder. Pros of 
Tomcat are that it is an open source project, stays on top of the Servlet API 
developments, and works extremely well. The cons are that it is not the fastest 
implementation and that you are on your own for support. 
 44
In the next chapter, a detailed discussion on the developed microarray data 
analysis tool is presented. The discussion is in the form of objectives of the analysis tool 
and the results that have been achieved along with screen shots of the system for easier 
understanding. 
 
 
 
 
 45
5. OBJECTIVES AND RESULTS 
The aim of this thesis was to develop a freely available, platform independent 
plication for visualization, normalization and analysis of microarray experiments and 
so a tool which will guide the users through the steps of normalization and data analysis 
such as identifying differentially expressed genes and to cluster those differentially 
expressed genes into clusters of genes exhibiting similar behavior. 
R, the statistical package which is freely available can be used to perform all 
types of analysis on microarray data, but it has the disadvantage of being a command line 
based package which requires the biologists to know the syntax of various commands and 
also requires the users to be familiar with programming techniques and concepts. The 
users have to in short, be well versed in R to perform efficient data analysis. Thus the 
motivation for the tool comes from the need for a easy to use, point and click kind of 
interface which is easily accessible over the internet, to which users can easily connect to, 
upload their data files retailored to a particular format and get both numeric and visual 
results for interpreting the data without having to worry about the intricacies of 
programming. 
I will now discuss about the different objectives of the tool and the way they were 
implemented and also discuss the results of normalizing and analyzing the microarray 
data using the software tool, which was developed during the course of this work. 
ap
al
 46
5.1 Uploading experiment data 
The user can upload experimental data as a text file, which is more importantly in 
the format of a gene expression m
the genes and the expression values of the genes for different experiments (which may be 
different biological conditions, different time-points, etc). 
atrix. The data text files should contain a listing of all 
 
Figure 5.1 - Sample data file for analysis 
 
As shown in figure 5.1, the gene names form the first column, the other columns 
are the different experiments and the numerical values represent the gene expression data 
for each gene under the different experiments. The data file can be uploaded through a 
user friendly web page as shown below and it will be saved in the working directory of 
the application on the server and will be used for all further analysis (figure 5.2). 
 47
 Figure 5.2 - Process of uploading data files 
5.2 Normalization methods 
A wide choice of common normalization methods is offered to the user to remove 
the systematic bias within the data.  It is also possible to add newly developed 
normalization techniques at a later stage.  
All the previou e applied to the data 
from the uploaded file. The normalization can be applied to more than one experimental 
column or to all the experimental columns.  In the case of multiple experimental columns, 
all the columns are normalized with respect to the first experimental column using the 
normalization method selected. The normalized data can be downloaded as a text file for 
reference or for input to another system. 
 
sly discussed normalization methods can b
 48
 Figure 5.3 - rmalization Normalized data file using total intensity no
 
 
Figure 5.4 - Normalized data file using median normalization 
 49
 50
Figure 5.3 shows an example of total intensity normalization, which can be compared to 
figure 5.4 (median normalization). The slight variation in the normalized data from the 
different methods can be observed.  
5.3 Data visualization  
To get an idea about the condition of the data sets or the effects of different 
normalization methods, different means of graphical display such as scatter plots, MA 
plots and RI plots have been implemented. 
5.3.1 Scatter plot 
Plotting the log 2 intensity values of the one experiment condition versus the log 2 
intensity values of another experiment condition is a common way to display the 
distribution of the data. The normalization step gives two forms of output. One is the data 
file with the normalized data and the second form is a scatter plot of the normalized data. 
 
 
 
 
 
 
 
 
 
Figure 5.5 - Total intensity normalization scatter plots and text files 
 
The graphical result of the normalization step is a nn×  matrix of scatter plots 
where n is the number of experimental columns to be normalized (figure 5.5 and 5.6). 
The nn×  matrix of scatter plots consists of scatte ent versus itself 
and all the other experiments selected. Thus it can be observe om the results that the 
scatter plot of an experiment versus itself is a straight line passing through the origin. The 
 matrix of the scatter plots has an image area map defined on it which lets users to 
zoom in on a particular scatter plot for better viewing of the plots. 
r plots of each experim
d fr
nn×
 
 51
Figure 5.6 - matrix of scatter plots with a zoomed out portion for two specific 
xperiments 
 MA plot 
An alternative to the scatter plot is the MA plot with transformed axes to provide 
intensity information. This tool produces MA plots for both normalized and raw data.  
The MA plots are also produced as a 
 nn×
e
5.3.2
nn×  matrix of individual MA plots where n is the 
number of experiments selected (figure 5.7). The image area map logic has also been 
used for the MA plot which lets the user to zoom in the plot in a separate window. 
 
Figure 5.7 - matrix of MA plots with a z
experiments 
5.3.3 RI plot 
The tool produces RI plots for both raw and normalized data. The   matrix of 
RI plots and the image area map logic has been used for this visualization technique as 
well, which lets the users to view the RI plots for all the n experiments at the same time 
and also to zoom in on a particular experiment’s RI plot (figure 5.8). 
nn× oomed out portion for two specific 
nn×
 52
 Figure 5.8 - matrix of RI plots wi
experiments 
5.4 Creating a process pipeline 
This software tool allows the user to create a process pipeline where the user can 
a outines to be performed on the data uploaded. The pipeline window is 
reached data
After selecting the 
experimental columns for further analysis, the user interface window submits into a 
process pipeline screen, where a variety of operations are listed out from which the user 
can select the processes of is choice and form a sequence of steps to be performed on the 
data. The sequence of steps to reach the pipeline step is discussed in detail below. 
nn× th a zoomed out portion for two specific 
select  set of r
 after uploading the data file (figure 5.9). On uploading the  file, the 
background code parses out the different experiments conducted and displays a window 
to the user with all the experiments listed out from where the user can select those 
experimental columns on which he /she wants the analysis done. 
 53
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 5.9 - Steps for pipelining analysis 
 
 
 
 
 
The columns parsed out 
from the uploaded file, 
which represent the 
different experimental 
columns. 
The columns transferred 
double right arrow button, 
will be subjected to 
here by clicking on the 
further analysis routines.
On clicking the Submit 
button, the window which 
lets the user select a 
pipeline of processes to be 
performed on the data 
uploaded, will open up.
Form 
 54
  
 
 
 
 
 
 
 
 
 
 
igu
rocesses listed as list boxes, which the user can perform on the uploaded data. The first 
ategory is the normalization routines from where the user can select a particular 
ormalization method and apply it to the data. The second category is the various 
isualization plots available to view the distribution of data. The third list box forms the 
ird category of processes called “Ratios and Clustering” which consists of finding the 
p and bottom intensity ratios between experiments, a color coded representation of the 
differentially expressed y the clustering of the 
ost differentially expressed genes into groups with the same pattern of behavior. The 
re 5.10 - Pipeline screen to define a sequence of routines to be performed on the 
data 
In the pipeline window as shown in the figure 5.10, there are three categories of 
 
 
F
p
c
n
v
th
to
 genes by the fold change method and finall
m
Th
ca
The list box with the 
sequence of processes to 
be performed on the data. 
Buttons to change 
the order of the 
processes or to 
delete a selected 
process. 
e three 
tegories of 
routines available 
r analyzing the 
data 
fo
Submit to get 
results. 
Button for transferring 
routines from LHS to 
RHS. 
 55
buttons with the right arrows by the side of each list box of methods is used to select that 
ethod into the pipeline list box on the right hand side. The “Up”, “Down” and “Delete” 
uttons on the right hand side of the pipeline window are used to change the sequence of 
e processes lined up in the “Selected Order of Execution” list box. In order to move a 
articular process up to the beginning of the pipeline, the user has to select the process 
nd click on the “Up” button. Similarly, for moving a particular process to a later stage of 
e pipeline, the user has to select the process and click on the “Down” button. The 
Delete” button is used for deleting a particular process from the pipeline. On clicking 
the “Submit” button the pipeline is taken into the system and the sequence of steps are 
performed on the data. On completion of the pipeline, the results screen is displayed as a 
parate window as shown in figure 5.11. 
 
 
 
 
 
 
 
 
m
b
th
p
a
th
“
se
 
 
 
 
 
 
 
 
 
Figure 5.11 - Results screen after submitting the pipeline screen 
Click on this link to 
data file. 
get the normalized 
Click on the 
links to get 
visual results. 
 56
5.5 Identifying genes of interest 
One of the most important goals of microarray technology is the search for new 
target genes. Methods have been provided to detect differentially expressed (DE) genes. 
The techniques implemented in these methods are fold change detection and statistical t-
test. 
 A certain threshold value is set with which the log 
5.5.1 Fold change cut-off 
ous colors are assigned to represent the genes 
log intensity ratio values.  
2 values of the intensity ratios 
of a gene between different experiments, are compared. If a gene’s log 2 intensity ratio 
value exceeds the threshold value it is marked as differentially expressed (DE) in that 
given experiment. In this software tool, vari
based on the gene’s 2 
Cutoff Range Color coding in RGB(red, green, blue) Color image
< -2.0 (153, 0, 0)  
5.12 −<−≥ and  (255, 0, 0)  
0.15.1 −<−≥ and  (255, 51, 0)  
7.00.1 −<−≥ and  (255, 103, 0)  
5.07.0 −<−≥ and  (255, 204, 51)  
3.05.0 −<−≥ and  (255, 204, 102)  
1.03.0 −<−≥ and  (255, 255, 204)  
0.01.0 <−≥ and  (255, 255, 230)  
1.00.0 <≥ and  (235, 255, 230)  
3.01.0 <≥ and  (204, 255, 153)  
5.03.0 <≥ and  (153, 255, 0)  
7.05.0 <≥ and  (10 , 255, 0) 3  
0.17.0 <≥ and  (51, 255, 0)  
5.10.1 <≥ and  (51, 204, 0)  
0.25.1 <≥ and  (0, 153, 0)  
0.2≥  (0, 51, 0)  
 
Table 5.1 - Color coding scheme for differentially expressed genes using fold change 
 57
The ignment of a color was done foass r easier interpretation of data. The different ranges 
of the 
The process of assigning colors to the different genes for a given experiment is as 
follows. The intensity ratios of all the genes between the two experiments the user has 
log 2 values of the intensity ratios are then calculated. The 
log 2 v
provided as a tool tip (figure 5.12). On 
clicking a r b tensity 
ratios of the gene in other experiments and f the gene’s expressio  value in all 
the experiments can be vie ed (figure 5.1
 
 
 
log 2 intensity ratio values and the corresponding color coding assigned is given 
below. 
 
selected are calculated. The 
alues are then compared with the ranges in Table 5.1. to determine a range and 
then a corresponding color is assigned to the gene, thus categorizing it as either regularly 
expressed, under expressed or over expressed.  All the genes are plotted as matrix of 
color images, where when a user does a mouse over on a particular color image, the gene 
corresponding to it and the intensity value is 
 particular colo ox, a window opens up where the gene name, the in
 a line plot o n
w 3). 
 58
 Figure 5.12 - Color based image map of the gene’s intensity ratios between two 
experiments 
 
Figure 5.13 – Individual gene details 
 
 
 
 
 
 
 
 
 
 59
 60
5.6 Clustering genes 
Clustering of genes allows biologists to identify genes which exhibit similar 
behavior patterns over a set of experiments.  Modules which perform clustering of genes 
using hierarchical clustering technique have been implemented. The outputs obtained 
from the clustering module are a few groups which contain genes which behave similarly. 
This software tool performs clustering in the following way.  The background 
code calculates the intensity ratios of the gene expression values for all the experimental 
columns selected by the user. The intensity ratios are calculated with respect to the very 
first experimental column. Then the background code checks for those genes whose log 2 
intensity ratios across all the experiments is greater than the upper limit of 2.0. It selects
all m 
into groups of gene with the same pat r. The cluster of genes are displayed 
sing a heat map which uses a dendrogram to show the gene clusters and also a color key 
e a color-based illustration of the gene in
In the case of huge data file  
ough the heat map can accommodate all the genes and their corresponding 
lusters, it appears very clumsy and illegible. In order to solve this problem, the top fifty 
ected and then clustered into similar pattern 
er is not looking at a large number of genes clustered together, 
ut can expect a more refined clustering of the most differently behaving genes. 
mploying such a clustering method, the tool tends to overlook the other gene clusters 
which might be carrying  be understood that the 
clustering information of all the genes is present and the tool is applying the clustering 
 
 those genes and then applies hierarchical clustering technique in order to cluster the
tern of behavio
u
to provid tensity ratio values. 
s, the number of genes selected for clustering may be
large and th
c
most differentially expressed genes can be sel
groups. This way, the us
b
By e
 some significant information. It should
routi  to all the genes anne d not just the top fifty genes. But the number of genes used for 
displaying the clustered information are the ones which are most differentially expressed 
in order to make the heat map (Figure 5.14) easier to view and understand. 
 
Figure 5.14 - Heat map with a dendrogram to represent expression clusters 
5.7 
2 
Top and bottom intensity ratios 
The tool has two methods implemented for finding the top and bottom 10, 25, 50, 
75 and 100 genes based on their log intensity ratios between different experiments. The 
user can select the experiments for which the ratio has to be calculated for and also select 
the number of top/bottom ratios needed which can be either of 10, 25, 50, 75 and 100. 
Both these methods return a ordered list of genes and their intensity ratio values. The 
purpose of these two methods is to provide a quick way to determine those genes which 
are the most over/under expressed in a set of selected experiments. 
 61
  
 
 
 
or calculating the intensity ratios and the F ure 5.15 - User selected columns fig
number of top/bottom genes needed 
The result screens obtained for the selections done in figure 5.15 are shown in figure 
5.16. The results are displayed in the form of a table with the gene name and its  
log 2  intensity ratio value. 
 62
  
 
Figure 5.16 - Results showing top 10 ratios and bottom 25 ratios between two 
experiments 
5.8
A module has bee m thousands of rows of 
different 
he search also returns the results of a t-test on the gene 
expression profile which consists of the confidence interval for a regular gene expression 
value and also the up and down regulated gene expression values based on the confidence 
interval. 
 Search genes 
n implemented for searching genes fro
gene expression information (figure 5.17). The search is done by the gene name and 
returns as output, a line plot showing the behavior of the gene expression for the 
experimental conditions. T
 63
 Figure 5.17 - Search screen with the list of genes from the uploaded data file 
n intelligent one which does not require the user to scroll through 
thousan
 
The search is a
ds of rows of gene names. Keying in the first few letters of the gene name will 
highlight the gene in the displayed list. On submitting the above servlet by clicking on the 
“View Graph” button, the user can view in a new window, a line plot of the gene 
expression values for all the experiments in the data file, and a color coded representation 
of the results of a t-test on the gene’s expression profile. The resulting screen is shown in 
figure 5.18. 
 64
 Figure 5.18 - Output screen for a search done on gene information 
5.9 Output results as files 
The normalized data, the clustered groups of genes and the intensity ratios of all 
the genes are available as text files, which can be saved onto the local system of the user 
for further usage or reference. The scatter plots, MA plots, RI plots and the heat maps of 
clustered genes can all be saved locally as JPEG image files.  
 
 
 
 65
The above discussion of the results and features of the microarray data analysis 
tool, show that it is a tool which will be intuitive for all the users who have a basic 
understanding of normalization and data analysis. It will be a very handy tool for all 
kinds of users- biologists and software developers, mainly. It’s simple user interface and 
easily accessible results, supported by a help manual will help the analysis of microarray 
data a simpler and easier task. 
In the next chapter, some of the potential improvements that can be added to the 
tool are discussed as well as the conclusions reached about optimal usage of the tool. 
 
 
 
 
 
 
 
 
 
 
 66
6. CONCLUSIONS 
In this thesis, a platform-independent and versatile software tool for normalizing 
and analyzing microarray data was developed. The software meets the requirements that 
were o
ion. 
The web-based front end is accessible from any web browser and handles all user 
interaction. The program with its current functionality can be used by biologists for data 
analysis, but there are certain improvements still in progress.  
he presented program handles a wide range of functions as listed below: 
• Since normalization of data is an important concern, the software tool provides 
different means of normalizing the data. The possibilities are total intensity 
normalization, median normalization and lowess normalization. 
• The effects of normalization can be observed by useful graphical plots like scatter 
plots, MA plots and RI plots. All plots can be created before and after 
normalization. 
• The system provides the capability of creating a pipeline of processes to be 
performed on the microarray data. The user can also subject the data to specific 
individual analysis techniques incase he/she does not want to create a pipeline. 
riginally set forth for a tool of this nature. The interface is user-friendly. The 
system is available for access to multiple users simultaneously upon user authenticat
T
 67
• To detect differentially expressed genes, this tool provides simple fold-change 
detection and also statistical tests like t-test. The detected target genes can be 
printed to a file, for further analysis. 
• Clustering methods ar sis and are also useful for reducing 
6.1 Possible improvements 
As with most of the software tools, this tool has a lot of scope for improvement. 
Since currently new methods for normalization and analysis are developed, it may be 
useful to adapt the present system to theses needs. 
Possible improvements would be: 
• Further normalization methods 
• Additional diagnostic plots (QQ-plot, volcano plots) 
• 
• ent 
• 
tool. As far as issues are concerned, most of them have been solved and some are still in 
progress.  
 
e widely used for analy
the amount of microarray data to a subset of genes, usually to those which are 
most variable between different experimental samples. This has been achieved by 
using hierarchical clustering method. 
• The system is expandable for further development. 
ANOVA 
User managem
• More sophisticated methods for detection of differentially expressed genes 
Compatibility with multiple types of input data files 
Many suggestions for improvement have been provided by potential users of the 
 68
In conclusion, this microarray data analysis tool developed using Java and R is a 
nity driven solution developed to help make the analysis of microarray data 
 and efficient. Additional functiona
commu
simpler lity will be added on with the continuing 
dev p
 
 
 
elo ment by other members of the Bioinformatics Research Group (BRG). 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 69
REFERENCES 
 [1]  International Human Genome Sequencing Consortium, "Finishing the 
uchromatic sequence of the human genome," Nature, vol. 431, pp. 931-945, 
 [2]  Tom A.van de Goor, "A History of DNA Microarrays," Advanstar 
ommunications Inc., 2005. 
 [3]  eter Dalgaard, Introductory Statistics with R Springer, 2002. 
 [4]  ff Augen, "Bioinformatics and Transcription," in Bioinformatics in the Post-
Genomic Era: Genome, Transcriptome, Proteome, and Information-Based 
edicine Addison Wesley Professional, 2004. 
 [5]  .Crick, "Central dogma of molecular biology," Nature, vol. 227, no. 5258 
 [6]  . Ruvkun, "Molecular biology.Glimpses of a tiny RNA world," Science, vol. 
94, no. 5543, pp. 797-799, Aug.2001. 
 [7]  an Cray, "Gene Detective," 2001. 
 [8]  i Pasanen, Janna Saarela, Ilana Saarikko, Teemu Toivanen, Martti Tolvanen, 
auno Vihinen, and Garry Wong, DNA Microarray Data Analysis Picaset Oy, 
 [9]  Dov Stekel, Microarray Bioinformatics Cambridge University Press, 2003. 
 [10]  Helen C.Causton, John Quackenbush, and Alvis Brazma, A beginner's guide. 
icroarray Gene expression data analysis Blackwell Publishing, 2003. 
 [11]  ordon K.Amyth, Yee Hwa Yang, and Terry Speed, "Statistical issues in cDNA 
Microarray Data Analysis," Functional Genomics:Methods and Protocols, vol. 
24, no. Methods in Molecular Biology, pp. 111-136, 2003. 
 [12]  hn Quackenbush, "Microarray data normalization and transformation," nature 
genetics supplement, vol. 32, no. December 2002, pp. 496-501, 2002. 
 [13]  Knudsen, Guide to Analysis of DNA Microarray Data, Second ed John 
iley & Sons, 2004. 
 
e
Oct.2004. 
C
P
Je
M
F
Aug.1970. 
G
2
D
Tom
M
2003. 
M
G
2
Jo
Steen 
W
 70
 [14]  W.N.Venables, D.M.Smith, and R Development Core Team, An Introduction to R 
Network Theory Ltd, 2004. 
 [15]  Robert C.Gentleman, Wolfgang Huber, Vincent Carey, Rafael Irizarry, and 
Sandrine Dudoit, Bioinformatics and Computational Biology Solutions Using R 
and Bioconductor (Statistics for Biology and Health Series) Springer, 2005. 
[ Source 
rch 
[ an online 
306 
vsner, "SNOMAD 
 
. 
 [19]  ide R Functionality to 
 [20]  gan, Java in a Nutshell O'Reilly & Associates, Inc., 1997. 
pache Sams Publishing, 2003. 
[
 16]  Sandrine Dudoit, Robert C.Gentleman, and John Quackenbush, "Open 
Software for the Analysis of Microarray Data," Biotechniques, vol. 34, no. Ma
2003, p. s45-s51, 2003. 
 17]  Xiaoqin Xia, Michael McClelland, and Yipeng Wang, "WebArray: 
platform for microarray data analysis," BMC Bioinformatics, vol. 6:
Dec.2005. 
 [18]  Carlo Colantouni, George Henry, Scott Zeger, and Jonathan Pe
(Standardization and NOrmalization of MicroArray Data): web-accessible gene
expression data analysis," Bioinformatics, vol. 18, no. 11, pp. 1540-1541, 2002
Simon Urbanek, "Rserve -- A Fast Way to Prov
Applications," 2003. 
David Flana
 [21]  Bruce W.Perry, Java Servlet & JSP Cookbook O'Reilly Media Inc, 2004. 
 [22]  Julie C.Meloni, PHP, MySQL and A
 23]  Robert McMillan, "Loosen the reins, says Google CEO,", 11 ed 2005. 
 71
GLOSSARY 
 
ANOVA (Analysis Of Variance) is a collection of statistical models and their 
associated procedures which compare means by splitting the overall 
cDNA mplementary DNA) DNA synthesized from mRNA or DNA by 
Chromosomes
 the form of one or more large macromolecules called 
 several 
exually 
ome, one from 
CRAN (Comprehensive R Archive Network). A network of ftp and web servers 
around the world that store identical, up-to-date versions of code and 
documentation for R. 
DNA (Deoxyribonucleic Acid) is a nucleic acid, usually in the form of a 
double helix that contains the genetic instructions specifying the 
biological development of all cellular forms of life, and most viruses. 
Exons The coding regions of DNA. 
observed variance into different parts. 
(Co
reverse transcriptase often synthesized from a cellular extract. 
 The DNA which carries genetic information in cells is normally 
packaged in
chromosomes. Most multicellular organisms have
chromosomes, which together comprise the genome. S
reproducing organisms have two copies of each chromos
each parent. 
 72
Fold change The ratio of gene expression between two samples in a microarray 
experiment. 
Gene Units of heredity in living organisms. They are encoded in the 
organism’s genetic material (usually DNA or RNA), and control the 
physical development and behavior of the organism. 
Genome The whole hereditary information of an organism encoded in the 
DNA. It includes both genes and non-coding sequences. 
Human 
Genome Project
A project initiated by the government of United States for DNA 
sequencing of the human genome. 
Hybridization The process of combining complementary, single stranded nucleic 
acids into a single molecule. Nucleotides will bind to their 
complement under normal conditions, so two perfectly 
complementary strands will bind to each other readily. 
IDE (Integrated Development Environment). Environment used for 
developing code for any application. 
 A list of coordinates relating to a specific image, created in order to 
hyperlink areas of the image to various destinations (as opposed to a 
age links to a 
single destination). 
Introns Non-coding regions of DNA. 
Java API (Java Application Programming Interface). Collection of ready-made 
software components that provide several capabilities. 
JDBC (Java Database Connectivity). Set of communication protocols 
 
Image map
normal image link, in which the entire area of the im
 73
between a java program and a database. 
(Joint PhotogJPEG raphic Experts Group) is a commonly used standard 
JSP gy which mixes 
JVM programs to 
JVM ter and runtime system, which lets 
LAMP 
LocusLink rmation about 
Microarray  spots attached to a physical 
Microarray 
experiment 
ing gene expression in a system under controlled 
time, stimulus, developmental stage, or 
MIDAS  TM4 suite. 
method of lossy compression for photographic images. The file 
extensions for this format are .jpeg or .jpg. 
(Java Server Pages). A java programming technolo
static HTML with dynamically-generated HTML. 
(Java Virtual Machine). The interpreter which lets java 
run on any platform where it has already been installed. 
(Java Virtual Machine). An interpre
java programs run on any platform it has been ported to. 
(Linux + Apache + MySQL + Python). A platform used by the 
microarray data analysis tool, WebArray. 
Single query interface to sequence and descriptive info
genetic loci. 
A collection of microscopic DNA
substrate such as glass, plastic or silicon chip forming an array. 
Microarrays are hybridized with labeled samples and then scanned 
and analyzed to generate data. 
An experiment study
conditions to factors such as 
dosage on a sample. 
(Microarray Data Analysis System). Forms a part of the
A java application which provides users an intuitive interface to 
 74
design analysis processes combining one or more normalization and 
miRNA 
pression of other genes.  
A 
nthesis to undergo 
which a gene 
MySQL 
zation 
riation from the microarray 
Oligo tide) Short sequence of nucleotides (<80 base pairs) 
Protein 
gical functions of all living cells and 
PubMed 
from MEDLINE and other life science journals 
RNA 
filtering steps. 
(micro RNA) is a form of single-stranded RNA which is 20-25 
nucleotides long and which regulates the ex
mRN (messenger RNA) is RNA that encodes and carries information from 
DNA during transcription to sites of protein sy
translation in order to yield a gene product. The amount of any 
particular type of mRNA in a cell reflects the extent to 
has been “expressed”.  
A very popular open source database server. 
Normali The process used to standardize microarray data by removing the 
effect of all sources of non-biological va
data, making them comparable. 
(Oligonucleo
always single stranded to be used as probes or spots. 
A complex, high-molecular-weight, organic compound that consists 
of amino acids joined by peptide bonds. Proteins perform a wide 
variety of structural and biolo
viruses. 
A service of the U.S. National Library of Medicine that includes over 
16 million citations 
for biomedical articles back to the 1950s.  
(Ribonucleic acid) A class of nucleic acids that consist of nucleotides 
 75
containing the bases- adenine (A), guanine (G), cytosine (C), and 
uracil (U). An RNA molecule is typically single-stranded and can pair 
Servlet 
SQL guage). A standard computer language for 
Microarray Data Analysis System 
Transcription  from DNA into 
Translation 
ce a specific protein according to the rules specified by the 
tRNA 
tide chain at the ribosomal 
with DNA, another RNA molecule, or form secondary structure by 
hybridizing to itself. 
Rserve Rserve is a TCP/IP server which allows other programs to use R 
facilities without the need to initialize or link against R library. 
Java program that runs on a web server and which is used to build 
web pages. 
(Structured Query Lan
accessing and manipulating databases. 
TM4 A suite of analysis tools developed to handle all aspects of microarray 
process. Includes four major applications, Microarray Data Manager 
(MADAM), TIGR_Spotfinder, 
(MIDAS), and Multiexperiment Viewer (MeV). 
The process in which transfer of genetic information
RNA takes place. It is the beginning of the process that ultimately 
leads to the translation of the genetic code into a protein. 
The second process of protein synthesis, in which mRNA is decoded 
to produ
genetic code. Translation is preceded by transcription. 
A small RNA chain (approximately 75 bp in length) that transfers a 
specific amino acid to a growing polypep
site of protein synthesis during translation. 
 76
CURRICULUM VITAE 
Date of Birth  20, 1982 
ce of Birth 
Undergraduate  India 
Graduate Study
ter Science 
Experience 
uisville, KY
 
 
VASUNDHARA AKKINENI 
 
May
Pla Madras, India 
 Study University of Madras, Madras,
B.Tech. Information Technology 
(1999-2003) 
 University of Louisville, Louisville, Kentucky 
M.S. Computer Engineering and Compu
(2003-2006) 
IT Digitization Intern, GE Energy, Atlanta, GA 
(Jan, 2005 - July, 2005) 
QA Analyst Intern, Yum! Brands Inc, Louisville, KY 
(June, 2004 – Dec, 2004) 
Student Assistant, University of Louisville, Lo
(Aug, 2003 – May, 2004)  
 77