APPEX: Analysis
Platform for identification
of Prognostic gene EXpression
signature in cancer
Please cite your use of APPEX in your publication:
Seon-Kyu Kim, Jong Hwan Kim, Seok-Joong Yun, Wun-Jae Kim and Seon-Young Kim.
APPEX: analysis platform for the identification of prognostic gene expression signatures in cancer. Bioinformatics. 2014 Nov 15;30(22):3284-6.
You
can download a user manual containing full description of the APPEX: appex_document.pdf
You can also easily select an
appropriate workflow provided by the APPEX system through the following schematic
diagram for guideline:
Typical
analysis cases for selecting APPEX workflows.
Contents
(4) Supporting analysis methods
2.3. Analysis methods in detail
5.
Supporting analysis methods
5.1. Cox proportional hazard model
5.2. In-trans correlation approach.
5.4. Time dependent ROC curves
8.
Downloading example datasets
8.1. Column-oriented dataset (single molecule):
example1.column.single_mol.zip
8.2. Row-oriented dataset (multiple molecules):
example2.row.multi_mol.zip
Identification
of robust molecular signature to predict cancer patients¡¯ outcome is profoundly
important, since cancer patients have heterogeneous clinical courses even if
they have similar clinico-pathological
characteristics. By using prognostic molecular signature, cancer patients can
be treated more effectively. As an example, Oncotype DX breast cancer assay is
now performed in the clinic to predict clinical behavior of a breast cancer
patient (1). Furthermore, developing
molecular signatures to predict patient¡¯s responses to treatment such as
chemotherapy or radiotherapy is also important, because it can be used for the
prediction of treatment effectiveness, selection of drugs, and preventing side
effects.
While
many researchers have tried to develop robust prognostic and predictive
signatures from genomics data (2-12), there is no suitable
web-based analysis tool that supports researchers in their efforts of signature
development. Currently, most researchers use either commercial programs such as
SPSS or Matlab or open source scripting language like
R for statistical analysis. For genome-wide analyses, several tools including
BRB-arrayTools (Excel plugin) (13),
TM4 (Java based standalone) (14), or GEPAS (Web based
platform) (15)
have been widely used. However, for many investigators, particularly,
clinicians or oncologists, doing proper statistical analyses using publicly
available tools can be a daunting task. Also, most genome-wide analysis tools
are not equipped with tools for identifying prognosis signature by survival
analysis. Here, we constructed APPEX web based software platform to help
researchers in the efforts to identify prognostic or predictive molecular
signatures from genomics data. APPEX was designed to be easy to use, flexible,
and freely available for advanced statistical survival analyses.
APPEX
is a web-based platform to perform survival analysis, particularly, to support
identifying molecular signatures significantly associated with cancer patients¡¯
outcome. APPEX provides various analysis methods to discover genes or any other
molecules associated with survival of cancer patients. Currently, APPEX
supports seven analyses including Cox proportional hazard model (for single
molecule and multiple molecules) (16),
Super-PC (17),
in-trans correlation analysis (for single molecule and multiple molecules) (7, 18), Time-dependent ROC
analysis(19),
and Multivariate Cox regression analysis (16).
Although major data type of APPEX is the gene expression intensity obtained
from cancer patients and their follow-up times, APPEX is also applicable for
any other continuous numeric signal intensities with time-to-event information.
APPEX
is mainly designed for clinicians and oncologists who investigate cancer
behaviors and are interested in discovering prognostic or predictive
signatures. A user-friendly graphical interface similar to desktop application
is provided, so users can easily handle their own data on the APPEX even if
they are not familiar with statistical analysis packages such as SPSS or R.
APPEX serves various charts and figures as well as downloadable data tables
which include information of significant molecules associated with survival in
each analysis. To serve diverse users from the one who wants to estimate
prognostic value of a single factor to others who want to find a set of
molecules associated with survival, APPEX supports easy and useful approaches
such as a simple copy/paste approach for single factor analysis and a data file
uploading with configuration for multiple factors identification. We defined
the two easy and flexible data formats on APPEX, column-oriented and
row-oriented tab-delimited text data (For more
information, click the link).
Furthermore,
as to user¡¯s personal information, APPEX does not operate user-logging system
and does not require any user information except for e-mail address to
instantly alert the user after completing time consuming jobs. Instead, APPEX
just uses ¡°connection ID¡± which automatically generated when the user accesses
to APPEX. Using auto-generated connection ID, user can always perform an
analysis, access the previous analysis results, or remove their analysis
histories. The user is responsible for the management of his/her own connection
ID. APPEX has no responsibility for it (For more
information, click the link).
In
summary, APPEX is the best choice when you try to discover significant novel
factors to predict clinical behavior of cancer patients from a data of
continuous numeric intensities with follow-up time information of cancer
patients.
When
you access to the APPEX website, the main web page of APPEX with two executable
buttons will be presented as a following figure:
Figure
1.
Main page of APPEX website
(1) APPEX analyzer button:
If the user clicks this button, a dialog interface of APPEX analyzer will be presented
to the user. APPEX analyzer is a starting point for analyses using user¡¯s own
data.
(2) Public dataset explorer
button: If the user clicks this button, a dialog shows a list of public
datasets which contain numeric intensities and follow-up time information.
Public datasets were collected from NCBI GEO
public data repository.
If you click an APPEX
analyzer button at the main web page of APPEX, you can see a dialog of APPEX
analyzer at which you can perform survival analyses as a following figure:
Figure
2.
APPEX analyzer
To perform an analysis at
the APPEX analyzer, users should choose one of the analysis methods which APPEX
supports. If you click a main menu button, the following menu list will be
shown.
Figure
3.
APPEX selective menu
As shown in Figure 3, the
menu on APPEX analyzer consists of a sub-menu of seven analyses, a button to
open public datasets, a button to change connection ID, and a button to quit
APPEX analyzer. By clicking one of the lists in the menu, you can carry out
analysis with your own data or public dataset, change current connection ID, or
terminate APPEX analyzer.
APPEX manages multiple
sessions of users using connection ID which is automatically generated by the
system when a user accesses to APPEX. When you access to APPEX website at first
time, you will get an initial connection ID at an APPEX analyzer as following
figure:
Figure
4.
Connection ID field
If you perform analyses several
times, all analysis results are stored on the APPEX server based on the current
connection ID. When you want to access previous analysis results, you should
remember the connection ID at the time of analysis and replace connection ID to
the previous one. Then you can access previous analysis histories at the left
panel of APPEX analyzer. To change connection ID, you should click a button of
¡°Change connection ID¡± at APPEX menu or click a button of ¡°Change connection
ID¡± located at upper toolbar of APPEX analyzer. When you click a button to
change connection ID, you can see a following dialog window:
Figure
5.
Connection ID setup dialog
A connection ID at upper
part is current ID. If you have a connection ID which was used at previous
analysis, then you can insert it to a text edit field at lower part of the
dialog. Then APPEX analyzer will show analysis history tree which contains your
previous analysis results.
To log history of user¡¯s
analysis and support future access after time-consuming survival analysis,
APPEX maintains analysis histories based on connection ID for a limited
duration. Within that time, users can freely access their own previous analysis
results or remove histories. The maintaining duration for analysis history is
two months. A tree menu of analysis history is located at a left panel of APPEX
Analyzer. It consists of two folders, the one to hold analysis results and the other
to contain uploaded data which were uploaded by user.
Figure
6.
Tree panel of analysis history
Currently, APPEX supports
seven survival analyses to detect significant signatures. It also provides
analysis results of public datasets. We define short term of each analysis as
followings:
1)
CoxSingle: Cox proportional hazard model to estimate prognostic value of single
factor
2)
CoxMulti: Cox proportional hazard model to estimate prognostic value of
multiple factors. Typical genome-wide expression matrix (column: sample; row:
gene) can be applied.
3)
SuperPC: Semi-supervised methods to predict patient survival. Typical
genome-wide expression matrix (column: sample; row: gene) can be applied.
4)
IntransSingle: Estimation of prognostic value using in-trans molecules correlated with single factor. Typical
genome-wide expression matrix (column: sample; row: gene) can be applied.
5)
IntransMulti: Estimation of prognostic value using in-trans molecules correlated with multiple factors. Typical
genome-wide expression matrix (column: sample; row: gene) can be applied.
6)
TimeRoc: Time-dependent ROC analysis. Typical genome-wide expression matrix
(column: sample; row: gene) can be applied.
7)
Multivariate: Multivariate Cox proportional hazard model.
How to use each analysis
method is described at the next section. The following is a typical analysis
flow of APPEX system. All analysis methods were constructed as a following
scheme:
Figure
7.
Schematic diagram of APPEX analyzer
CoxSingle is a survival analysis
based on Cox proportional hazard model to estimate prognostic value of a single
factor (a molecule). CoxSingle is a simple, fast, and very useful way for
clinicians and oncologists to estimate the prognostic value of a molecule. To
perform a CoxSingle process, APPEX requires column-oriented and tab-delimited
text data. Users can insert data by just copying and pasting in the website or
uploading a file which contains numeric intensities, censor, and follow-up time
information. For more information of
column-oriented data format, please click the link.
At first step, just click a
button named ¡°CoxSingle¡± on APPEX analyzer. You can also select a menu item,
¡°Simple Cox proportional hazard model (Single molecule)¡±, from the main menu.
Then APPEX analyzer shows a panel for data uploading as a following figure:
Figure
8.
Dialog of data uploading for CoxSingle
When you see a dialog to
upload data, you can copy and paste your data at the upper text area (Figure 9)
or upload a text file by clicking ¡°Browse¡¦¡± button at the lower file uploading
panel (Figure 10). The text format should be a column-oriented text format.
Figure
9.
Copy and pasted text area on a dialog
Figure
10.
File uploading on a dialog
When you click a button ¡°Go
to next step¡± (Copy and paste) or ¡°Upload Data¡±, your data will be uploaded to
APPEX server and APPEX analyzer will show you a dialog for configuration of
your data properties as a following figure:
Figure
11.
Column identification and parameter setup
On a dialog for parameter
setup, you should select a property for each column. At least four columns
should be designated as ¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Intensity value¡± (Figure 11). In
addition, you should select a patient division method by which the patients in
your data would be divided into two groups (high or low intensities). Finally,
you have to determine whether your data contains a header line or not by
clicking a check button on the parameter setup panel. After all configurations
of CoxSingle, you can click a button, ¡°Perform analysis!¡± to perform analysis
based on cox proportional hazard model. APPEX will show a small progress panel
for a few seconds and present an analysis result tab which includes a summary
of user¡¯s input parameters, estimated prognostic value of a molecule, and
various charts. On CoxSingle analysis, APPEX provides hazard ratio, p-value by
cox regression analysis, p-value by log-rank test with Kaplan-Meier plot,
bar-plot of signal intensities, receiver operating characteristic (ROC) curve
with area under curve (AUC) value, and box plot of two divided patient groups
with two sample t-test p-value (Figure 12).
Figure
12.
An example of analysis result based on Cox proportional hazard model and
supporting charts
After an analysis, you will
find its item from analysis history tree on left panel of APPEX analyzer. You
can access to it in future or remove it from APPEX analyzer by clicking right
mouse button (Figure 13).
Figure
13.
Tree panel of analysis history and popup menu for removing by clicking right
mouse button
¡°CoxMulti¡± indicates an
analysis method of Cox proportional hazard model to estimate prognostic value of
multiple factors (molecules). If you have information of censor, follow-up time
information, and a data matrix which contains genome-wide expressions, CoxMulti
is a typical approach to estimate prognostic values of molecules. To perform a
CoxMulti process, APPEX requires row-oriented and tab-delimited text data. A user
should upload a file which contains censor, follow-up time information, and
genome-wide (multiple genes) expression data. For
more information of row-oriented data format, please click the link.
At first step, just click a
button named ¡°CoxMulti¡± on APPEX analyzer. You can also select a menu item,
¡°Cox proportional hazard model (Multiple molecules)¡±, from main menu. Then
APPEX analyzer shows a panel for data uploading as a following figure:
Figure
14.
Dialog of data uploading for CoxMulti
On a dialog for data
uploading, you can upload a text file by clicking ¡°Browse¡¦¡± button at upper
file uploading panel. You can also choose one of the previously stored your
data list in APPEX server by double clicking an item at lower tree panel
(Figure 14). An uploading file format should be a row-oriented text format. When you
click a button ¡°Upload Data¡± or double click an item of the stored list on tree
panel, your selected data will be uploaded to APPEX server and APPEX analyzer
will show you a dialog for configuration of your data properties as a following
figure:
Figure
15.
Line identification and parameter setup
On a dialog for parameter
setup, you should select a property of each row in your data. At least four lines
should be designated as ¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Data start line¡± (Figure 15). In
particular, clinical information of patient ID, censor, and survival time
should be located at upper than data start line. In addition, you should
insert cut-off p-value to select statistically significant molecules, select
molecule type such as gene symbol or refSeq ID, and
input your email address to receive a message after analysis completion. After
clicking a button ¡°Perform analysis!¡±, APPEX shows an
analysis progress tab which includes summary of user¡¯s data and input
parameters (Figure 16).
Figure
16.
Summary of your analysis and a progress bar in CoxMulti
CoxMulti is a time-consuming
job which depends on the number of molecules in the uploaded data. After
completion of the analysis, you will receive an email message including
connection ID and analysis ID to access to the result. APPEX analyzer will
present a table which includes statistically significant molecules correlated
with patients¡¯ survival (Figure 17). You can download its table by clicking a
button ¡°Click to download table¡±. When you click a button ¡°Survival Curve¡± of
the table, APPEX will carry out CoxSingle process for the selected molecule
(Figure 12).
Figure
17.
Table view of significant molecules by CoxMulti
The ¡°IntransSingle¡± analysis
estimates the prognostic value of a driving candidate (driver) and its
associated molecules (effectors) in disease events. IntransSingle uses
correlation based approach to select associated genes from a candidate
molecule. Then, using a selected gene set (a prognostic signature), APPEX performs an unsupervised hierarchical clustering to
divide total samples into two clusters based on numeric intensities. Finally,
APPEX estimates a prognostic value of this signature using Log-rank test,
Kaplan-Meier, and two-group box plots. To perform an IntransSingle process,
APPEX requires row-oriented text data which is delimited by tab. User should
upload a file which contains censor, follow-up time information, and
genome-wide (multiple genes) expression data. For
more information of row-oriented data format, please click the link.
At first step, just click a
button named ¡°IntransSingle¡± on APPEX analyzer. You can also select a menu
item, ¡°In-trans correlation analysis (Single molecule)¡±, from main menu. Then
APPEX analyzer shows a panel for data uploading as a following figure:
Figure
18.
Dialog of data uploading for IntranSingle
On a dialog for data
uploading, you can upload a text file by clicking ¡°Browse¡¦¡± button at upper
file uploading panel. You can also choose one of the previously stored data
list in APPEX server by double clicking an item at lower tree panel (Figure
18). An uploading file format should be a row-oriented text format. When you click a button ¡°Upload
Data¡± or double click an item of the stored list on tree panel, your selected
data will be uploaded to APPEX server and APPEX analyzer will show you a dialog
for configuration of your data properties as a following figure:
Figure
18.
Line identification and parameter setup for IntransSingle
On a dialog for parameter
setup, you should select a property of each row in your data. At least four
lines should be designated as ¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Data start line¡± (Figure 18). In
particular, clinical information of patient ID, censor, and survival time
should be located at upper than data start line. In addition, you should insert
several parameters which are needed to perform intransSingle
as followings:
(1) Cor.coefficient
(r): a correlation coefficient value to select associated molecules. A value
from 0 to 1 as a cut-off should be inserted. If 0.4 is inserted at this field,
APPEX tries to select molecules correlated with a candidate which have a
correlation coefficient upper than 0.4 and lower than -0.4.
(2) P-value:
a level for statistical significance derived from correlation test. APPEX
selects molecules which have a lower value than user inserted.
(3) Driving
candidate molecule: A name of driving candidate factor. An identifier in user
uploaded data matrix should be inserted.
(4) Molecular
Id type: one of ID types should be selected. APPEX handles following
identifiers: Gene symbol, Entrez Gene ID, RefSeq, Unigene, Affypetrix ID, Illumina ID, and
Agilent ID.
(5) Similarity
metric: a similarity metric for hierarchical cluster analysis. APPEX supports
following metrics: pearson, euclidean, manhattan, canberra, abspearson, spearman,
and kendall.
(6) Linkage
method: a linkage method for hierarchical cluster analysis. APPEX supports
following methods: single, complete, average, ward, median, mcquitty,
and centroid.
(7) Email
address: your email address to receive a message after analysis completion.
After clicking a button
¡°Perform analysis!¡±, APPEX shows an progress tab which
includes a summary of user¡¯s data and input parameters (Figure 19).
Figure
19.
Summary of your analysis and a progress bar in IntransSingle
IntransSingle is a
time-consuming job but a relatively light process than other heavy jobs such as
CoxMulti. The analysis time of IntransSingle depends on the size of your
uploaded data. After completion of the analysis, you will receive an email
message including connection ID and analysis ID to access to its result. APPEX
analyzer will present several charts and a table which includes significantly
associated molecules with a driving candidate (Figure 20). You can download its
table by clicking a button ¡°Click to download table¡±. When you click a button
¡°Survival Curve¡± of the table, APPEX will carry out CoxSingle process for the
selected molecule (Figure 12).
Figure
20.
Charts and a table obtained from IntransSingle process
¡°IntransMuti¡±
is an extended version of IntransSingle process to estimate prognostic values
of user-input driving candidates in a disease event. IntransMulti repeatedly
performs IntransSingle process in each driving candidate and estimates its
prognostic value. IntransMulti is suitable when you do not determine a specific
disease driving candidate in a gene set. Theoretically, all genes or probe IDs
in the uploaded dataset can be set as disease driving candidates and
IntransMulti can be applied using them. However, it needs enormous resources
and time to process, therefore, APPEX currently sets a limit to the maximum
number of driving candidates less than 200 molecules for IntranMulti
process. According to our performance test, IntransMulti needed about 6 days to
process completely when a dataset which had 28,000 genes and 100 patients was
applied and all genes (28,000) were set as driving candidates. To perform an
IntransMulti process, APPEX requires row-oriented text data which is delimited by tab.
User should upload a file which contains censor, follow-up time information,
and genome-wide (multiple genes) expression data. For
more information of row-oriented data format, please click the link.
At first step, just click a button
named ¡°IntransMulti¡± on APPEX analyzer. You can also select a menu item,
¡°In-trans correlation analysis (Multiple molecules)¡±, from main menu. Then
APPEX analyzer shows a panel for data uploading as a following figure:
Figure
21.
Dialog of data uploading for IntransMulti
On a dialog for data
uploading, you can upload a text file by clicking ¡°Browse¡¦¡± button at upper
file uploading panel. You can also choose one of the previously stored data
list in APPEX server by double clicking an item at lower tree panel (Figure
21). An uploading file format should be a row-oriented text format. When you click a button
¡°Upload Data¡± or double click an item of the stored list on tree panel, your
selected data will be uploaded to APPEX server and APPEX analyzer will show you
a dialog for configuration of your data properties as following figure:
Figure
21.
Line identification and parameter setup for IntransMulti
On a dialog for parameter
setup, you should select a property of each row in your data. At least four
lines should be designated as ¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Data start line¡± (Figure 21). In
particular, clinical information of patient ID, censor, and survival time
should be located at upper than data start line. In addition, you should
insert several parameters which are needed to perform IntransMulti as
followings:
(1)
Cor.coefficient
(r): a correlation coefficient value to select associated molecules. A value
from 0 to 1 as a cut-off should be inserted. If 0.4 was inserted at this field,
APPEX tries to select molecules correlated with a candidate which have a
correlation coefficient upper than 0.4 and lower than -0.4.
(2)
P-value: a level for
statistical significance derived from correlation test. APPEX selects molecules
which have a lower value than user inserted.
(3)
Similarity metric: a
similarity metric for hierarchical cluster analysis. APPEX supports following
metrics: pearson, euclidean, manhattan, canberra, abspearson, spearman,
and kendall.
(4)
Linkage method: a linkage
method for hierarchical cluster analysis. APPEX supports following ways:
single, complete, average, ward, median, mcquitty,
and centroid.
(5)
Molecule Id type: one of
molecule types should be selected. APPEX handles following identifiers: Gene
symbol, Entrez Gene ID, RefSeq,
Unigene, Affypetrix ID, Illumina ID, and Agilent ID.
(6)
Driving candidate molecule
list: A list of driving candidate factors. Identifiers existed in user uploaded
data matrix should be inserted. Currently, maximum number of driving candidate
identifiers is 200 and each identifier delimited by carriage return or new line
(¡®\r¡¯ or ¡®\n¡¯).
(7)
Email: your email address to
receive a message of analysis completion.
After clicking a button
¡°Perform analysis!¡±, APPEX shows an analysis progress
tab which includes summary of user¡¯s data and input parameters (Figure 22).
Figure
22.
Summary of your analysis and a progress bar in IntransMulti
IntransMulti is a
time-consuming job which depends on the size of your uploading data. After completion
of your required analysis, you will receive an email message including
connection ID and analysis ID to access to its result. APPEX analyzer will
present a table which includes prognostic value and the number of in-trans
genes correlated with each user-inputting driving candidate (Figure 23). You
can download its table by clicking a button ¡°Click to download table¡±. When you
click a button ¡°Survival Curve¡± of the table, APPEX will carry out
IntransSingle process for selected molecule (Figure 20). In case of Figure 23,
the table view was obtained when we inserted four genes (E2F1, S100A8, CCNB1,
and FOXM1) as driving candidate genes.
Figure
23.
Table view of estimated prognostic values of user-inputted molecules as disease
driver
¡°SuperPC¡± is a method to
select molecules significantly associated with patient survival. This method
carries out prediction by "supervised principal components". It can
predict a censored survival outcome, or a quantitative outcome. It is
especially useful for correlating patient survival or other quantitative
parameters with gene expression data. Detailed methodology is described in (17). To perform a
SuperPC process, APPEX requires row-oriented text data which is delimited by
tab. For more
information of row-oriented data format, please click the link. Since
SuperPC contains cross-validation and prediction steps, user should prepare a
dataset which contains a training set and a validation set together. APPEX
requires the user to select start column in training set and validation set,
respectively. An uploaded user data also should contain censor, follow-up time
information, and genome-wide (multiple genes) expression data.
At first step, just click a
button named ¡°SuperPC¡± on APPEX analyzer. You can also select a menu item,
¡°Super-PC analysis¡±, from main menu. Then APPEX analyzer shows a panel for data
uploading as a following figure:
Figure
24.
Dialog of data uploading for SuperPC
On a dialog for data
uploading, you can upload a text file by clicking ¡°Browse¡¦¡± button at upper
file uploading panel. You can also choose one of the previously stored your
data list in APPEX server by double clicking an item at lower tree panel
(Figure 24). An uploading file format should be a row-oriented text format.
When you click a button ¡°Upload Data¡± or double click an item of the stored
list on tree panel, your selected data will be uploaded to APPEX server and
APPEX analyzer will show you a dialog for configuration of your data properties
as following figure:
Figure
25.
Line identification and parameter setup for SuperPC
On a dialog for parameter
setup, you should select a property of each row in your data. At least four
lines should be designated as ¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Data start line¡± (Figure 15). In
particular, clinical information of patient ID, censor, and survival time
should be located at upper than data start line. In addition, two columns
should be designated as ¡°Training-set start column¡± and ¡°Test-set start column¡±
for APPEX to identify two datasets in an uploaded dataset. Furthermore, you
should select molecule type such as gene symbol or refSeq
ID, and input your email address to receive a message after analysis
completion. After clicking a button ¡°Perform analysis!¡±,
APPEX shows an analysis progress tab which includes summary of user¡¯s data and
input parameters (Figure 26).
Figure
26.
Summary of your analysis and a progress bar in SuperPC
SuperPC is a time-consuming
job which depends on the number of molecules in your uploading data. After
completion of your required analysis, you will receive an email message
including connection ID and analysis ID to access to its result. APPEX analyzer
will present several charts produced while SuperPC process and a table which
includes highly significant molecules correlated with patients¡¯ survival
(Figure 27). You can download its table by clicking a button ¡°Click to download
table¡±. When you click a button ¡°Survival Curve¡± of the table, APPEX will carry
out CoxSingle process for selected molecule (Figure 12).
Figure
27.
Charts and a table obtained from SuperPC process
¡°TimeRoc¡±
means time-dependent ROC curves for censored survival data and a diagnostic
marker (19).
ROC curves are a
popular method for displaying sensitivity and specificity of a diagnostic marker.
Many disease outcomes including cancer are time dependent, which means ROC
curves may vary at several specific time points. TimeRoc calculates a ROC curve with sensitivities and
specificities at a specific time point (e.g. 3 years or 36 months), and then
estimates prognostic values for all molecules stored in a genome-wide
expression dataset. To
perform a TimeRoc process, APPEX requires row-oriented text data which is
delimited by tab. User should upload a file which contains censor, follow-up
time information, and genome-wide (multiple genes) expression data. For more information of row-oriented data format,
please click the link.
At
first step, just click a button named ¡°TimeRoc¡± on APPEX analyzer. You can also
select a menu item, ¡°Time-dependent ROC analysis¡±, from the main menu. Then
APPEX analyzer shows a panel for data uploading as a following figure:
Figure 28.
Dialog of data uploading for TimeRoc
On
a dialog for data uploading, you can upload a text file by clicking ¡°Browse¡¦¡±
button at upper file uploading panel. You can also choose one of the previously
stored your data list in APPEX server by double clicking an item at lower tree
panel (Figure 28). An uploading file format should be a row-oriented text
format. When you click a button ¡°Upload Data¡± or double click an item of the
stored list on tree panel, your selected data will be uploaded to APPEX server
and APPEX analyzer will show you a dialog for configuration of your data
properties as following figure:
Figure 29.
Line identification and parameter setup for TimeRoc
On a dialog for parameter setup, you should select a
property of each row in your data. At least four lines should be designated as
¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±,
and ¡°Data start line¡± (Figure 29). In particular, clinical information of
patient ID, censor, and survival time should be located at upper than data
start line. In addition, you should insert several parameters which are needed
to perform TimeRoc as followings:
(1)
Survival
estimation method: a metric to estimate survival. Nearest Neighborhood Estmation (NNE) or Kaplan-Meier (KM) can be selected.
(2)
Time point:
a time point to estimate survival. As time scale in your data, you should a
proper value of time to estimate (e.g. 3 year or 46 months).
(3)
AUC value: a
cut-off value of area under curve (AUC) to select significant molecules. A
value from 0 to 1 is valid.
(4)
Molecule Id
type: one of molecule types should be selected. APPEX handles following
identifiers: Gene symbol, Entrez Gene
ID, RefSeq, Unigene, Affypetrix ID, Illumina ID,
and Agilent ID.
(5)
Email: your
email address to receive a message of analysis completion.
After clicking a button ¡°Perform analysis!¡±, APPEX shows an analysis progress tab which includes
summary of user¡¯s data and input parameters (Figure 30).
Figure 30.
Summary of your analysis and a progress bar in TimeRoc
TimeRoc
is a time-consuming job which depends on the number of molecules in the
uploaded data. After completion of the analysis, you will receive an email
message including connection ID and analysis ID to access to its result. APPEX
analyzer will present a table which includes statistically significant
molecules correlated with patients¡¯ survival (Figure 31). You can download its
table by clicking a button ¡°Click to download table¡±. When you click a button
¡°Survival Curve¡± of the table, APPEX will carry out CoxSingle process for
selected molecule (Figure 12).
Figure 31.
Table view of significant molecules by TimeRoc
¡°Multivariate¡± is an
analysis to perform multivariate analysis, in which multiple clinical factors
such as age, gender, stage, grade, or drug treatment can be handled together.
The aim of multivariate analysis is to identify association between clinical
factors and to estimate robustness of a factor (molecule) for survival
prediction even after several clinical factors are considered together with it.
Multivariate in APPEX performs based on Cox proportional hazard model (16). Multivariate
analysis is a simple, fast, and widely used survival analysis method in the field
of clinical investigation. To perform a multivariate process, APPEX requires
column-oriented text data which is delimited by tab. User can just insert by
copying and pasting data or upload a file which contains the information of
survival time, censor, and user interested factors. The values of interested
factors should be binary (0 or 1). For more
information of column-oriented data format, please click the link.
At first step, just click a
button named ¡°Multivariate¡± on APPEX analyzer. You can also select a menu item,
¡°Multivariate Cox regression analysis¡±, from main menu. Then APPEX analyzer
shows a panel for data uploading as a following figure:
Figure
32.
Dialog of data uploading for Multivariate
When you see the dialog to
upload data, you can copy and paste your data at upper text area (Figure 33) or
upload a text file by clicking ¡°Browse¡¦¡± button at lower file uploading panel (Figure
34). The text format should be a column-oriented text format.
Figure
33.
Copy and pasted text area on a dialog
Figure
34.
File uploading on a dialog
When you click a button ¡°Go
to next step¡± (Copy and paste) or ¡°Upload Data¡±, your inserted data will be
uploaded to APPEX server and APPEX analyzer will show you a dialog for
configuration of your data properties as following figure:
Figure
34.
Column identification and parameter setup for multivariate analysis
On a dialog for parameter setup,
you should select a property of each column. At least three columns should be
designated as ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±,
and ¡°Interest factor¡± (Figure 34). The number of columns of ¡°Interest factor¡±
can be set more than 1. In addition, you have to determine whether your data
contains a header line or not by clicking a check button on parameter setup
panel. After all configurations of Multivariate, you can click a button,
¡°Perform analysis!¡± to perform a multivariate analysis based on cox
proportional hazard model. APPEX will show a small progress panel for a few
seconds and present an analysis result tab which includes summary of user¡¯s
input parameters and a result table of multivariate analysis. In this analysis,
APPEX provides hazard ratio, 95% confidence interval (lower and upper values),
and p-value by cox regression analysis of each estimated factor (Figure 35). You can download its table
by clicking a button ¡°Click to download table¡±.
Figure
35.
An example of analysis result of multivariate analysis based on Cox
proportional hazard model
We have collected public datasets which contains numeric intensities and
follow-up time information from NCBI GEO public
data repository. You can select one of the datasets stored in APPEX database to
perform a survival analysis on APPEX analyzer. To select and apply a dataset to
an analysis, just click a button named ¡°Public datasets¡± on APPEX analyzer. You
can also select a menu item, ¡°Open public datasets¡±, from main menu. Then APPEX
analyzer shows a panel of public dataset list as a following figure:
Figure 36.
Dialog of public dataset list
On
a dialog of public datasets, you can select one of the datasets and click a
button ¡°Perform analysis!¡± at the right column. Then APPEX will show a pop up
menu in which you can choose an analysis method to perform. If you click one of
the analysis methods, APPEX will load a dataset for a while and present a
configuration dialog associated with a selected analysis method. To see
configuration option and flow of each analysis, please refer to the previous
section of each analysis.
To
provide flexibility and easy access, APPEX defined two easy and flexible data
formats: column-oriented and row-oriented datasets. APPEX analyzer handles two
data formats to analyze and users should prepare their data as one of these
formats before applying the data to APPEX. The user uploaded data file on APPEX
should not be binary but a text type.
A
text data formatted by column-oriented dataset contains a type of data list in
each column. Columns should be delimited by tab (¡®\t¡¯) keyword. The number of
columns is not limited, so users can upload data with any number of columns
even if the data contains redundant columns. The column-oriented dataset is
applicable at ¡°CoxSingle¡± and ¡°Multivariate¡± process on APPEX analyzer.
To perform a CoxSingle process, at
least four columns of the data should be designated as ¡°Patient ID¡±, ¡°Survival
Time¡±, ¡°Censor (death:1/alive:0)¡±, and ¡°Intensity
value¡±. For multivariate analysis, at least three columns should be designated
as ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±, and
¡°Interest factor¡±. Header line inclusion is up to user, since APPEX asks header
inclusion while analyzing. A typical example of column-oriented dataset was
illustrated at Figure 37.
Figure 37.
An example of column-oriented dataset
A
row-oriented dataset contains a type of data list in each line of the text. All
columns should be delimited by tab (¡®\t¡¯) keyword and the first column should
contain a title of each line. The row-oriented dataset is applicable at ¡°CoxMulti¡±, ¡°IntransSingle¡±, ¡°IntransMulti¡±, ¡°SuperPC¡±, and ¡°TimeRoc¡± processes on APPEX
analyzer. To perform each process, at least four lines should be designated as
¡°Patient ID¡±, ¡°Survival Time¡±, ¡°Censor (death:1/alive:0)¡±,
and ¡°Data start line¡±. In particular, all clinical information such as patient
ID, censor, and survival time should be located at upper part of the data than
data start line. A typical example of row-oriented dataset was illustrated at
Figure 38.
Figure 38.
An example of row-oriented dataset
To support users who wish to analyze previously
published datasets to APPEX and find significant prognostic or predictive
signature in cancers, we have collected public datasets which contains numeric
intensities and patients¡¯ follow-up time information from NCBI GEO, a public
data repository. Currently, we have collected a total of 263 datasets from GEO
and constructed a database to explore and analyze datasets on APPEX system.
When a user chooses one of the datasets on APPEX dataset explorer and clicks an
analysis method of the pop-up menu (Figure 36), APPEX analyzer will generate a
file formatted by row-oriented dataset from a selected public dataset and apply
it to an analysis method selected by the user. The generated file will be
automatically saved at user storage area on APPEX which controlled by connection ID. APPEX curation team is maintaining the database
of public datasets by regular update (once per three months). If you want to
know how to use public datasets in APPEX, please refer to the ¡°public
datasets¡± subsection of the ¡°How to
use¡± section.
In this section, methodology of
supporting analysis methods from APPEX analyzer is briefly described. To access
full description of methodology, please refer to the reference document of each
analysis method subsection.
Survival analysis typically examines
the relationship of the survival distribution to covariates. Most commonly,
this examination entails the specification of a linear-like model for the log
hazard. For example, a parametric model based on the exponential distribution
may be written as
or,
equivalently,
that
is, as a linear model for the log-hazard or as a multiplicative model for the
hazard. Here, i is a
subscript for observation, and the x¡¯s are the covariates. The constant ¥á in
this model represents a kind of log-baseline hazard, since log hi(t) = ¥á[or hi(t) = e¥á] when
all of the x¡¯s are zero.
The Cox model leaves the baseline
hazard function ¥á(t) = log h0(t) unspecified:
or,
again equivalently,
This model is semi-parametric because
while the baseline hazard can take any form, the covariates enter the model
linearly. Consider, now, two observations i and I¡¯
that differ in their x-values, with the corresponding linear predictors
and
The hazard ratio for these two
observations,
is
independent of time t. Consequently, the Cox model is a proportional-hazards
model.
Remarkably, even though the baseline
hazard is unspecified, the Cox model can be estimated by the method of partial
likelihood, developed by Cox in the paper in which he introduced the Cox model (16). Although the resulting estimates are not as efficient
as maximum-likelihood estimates for a correctly specified parametric hazard
regression model, not having to make arbitrary, and possibly incorrect,
assumptions about the form of the baseline hazard is a compensating virtue of
Cox¡¯s specification. Having fit the model, it is possible to extract an estimate
of the baseline hazard.
The aim of in-trans correlation
approach is to estimate prognostic value of a molecule (driver) and its
associated molecules (effectors). A gene set of a disease driving candidate and
its associated genes is handled as a signature to predict cancer behaviors on
APPEX. To generate in trans
gene set correlated with a gene feature, Pearson correlation test method is
applied. Using expression data of highly correlated genes with a gene feature,
a hierarchical clustering analysis is performed as described in Eisen et al (20).
According to patients clustering, patients are divided into two sub-groups and
time to survival event of patients in each sub-group is estimated. The
Kaplan-Meier method is used to calculate the time to survival and differences
in survival between the two groups is assessed using log-rank statistics. In
addition, to estimate prognostic values of multiple in trans gene sets in IntransMulti process
of APPEX, Pearson correlation test, hierarchical clustering, Kaplan-Meier
method, and log-rank test are sequentially iterated for user input disease
driving candidates existing in the gene expression data. In-trans correlation
approach was successfully applied at previous investigations (7, 18).
SuperPC indicates "supervised
principal components". It can predict a censored survival outcome, or a quantitative
outcome. It is especially useful for correlating patient survival or other
quantitative parameters with gene expression data. ¡°Supervised principal
components¡± is a generalization of principal components regression. The first
(or first few) principal components are the linear combinations of the features
that capture the directions of largest variation in a dataset. But these
directions may or may not be related to an outcome variable of interest. To
find linear combinations that are related to an outcome variable, SuperPC
compute univariate scores for each gene and then retain only those features
whose score exceeds a threshold. A principal components analysis is carried out
using only the data from these selected features.
Finally, these "supervised
principal components" are used in a regression model to predict the
outcome. To summarize, the steps are:
(1) Compute
(univariate) standard regression coefficients for each feature
(2) Form
a reduced data matrix consisting of only those features whose univariate
coefficient exceeds a threshold theta in absolute value (theta is estimated by
cross-validation)
(3) Compute
the first (or first few) principal components of the reduced data matrix
(4) Use
these principal component(s) in a regression model to predict the outcome
This idea can be used in standard
regression problems with a quantitative outcome, and also in generalized
regression problems such as survival analysis. In the latter problem, the
regression coefficients in step (1) are obtained from a proportional hazards
model.
There is one more important point: the
features (e.g. genes) which important in the prediction are not necessarily the
ones that passed the screen in step 2. There are other features that may have
as high a correlation with the supervised PC predictor. So SuperPC computes an
importance score for each feature equal to its correlation with the supervised
PC predictor. A reduced predictor is formed by soft-thresholding the importance
scores, and using these shrunken scores as weights. The soft-thresholding sets
the weight of some features to zero, hence throwing them out of the model. The
amount of shrinkage is determined by cross-validation. The reduced predictor
often performs as well or better than the supervised PC predictor, and is more
interpretable. For more information about SuperPC, please refer to its
methodology paper (17).
ROC curve is a popular method for
displaying sensitivity and specificity of a continuous diagnostic marker, X,
for a binary disease variable, D. However, many disease outcomes are time
dependent, D(t), and ROC curves that vary as a
function of time may be more appropriate. A common example of a time-dependent
variable is vital status, where D(t) = 1 if a patient
has died prior to time t and zero otherwise. Time dependent ROC method tries to
summarize the discrimination potential of a marker X, measured at baseline (t =
0), by calculating ROC curves for cumulative disease or death incidence by time
t, which is denoted as ROC(t). A typical complexity with survival data is that
observations may be censored. Two ROC curve estimators are proposed that can
accommodate censored data. A simple estimator is based on using the
Kaplan-Meier estimator for each possible subset X > c. However, this
estimator does not guarantee the necessary condition that sensitivity and
specificity are monotone in X. An alternative estimator that does guarantee
monotonicity is based on a nearest neighbor estimator for the bivariate
distribution function of (X, T), where T represents survival time. For more
information about Time dependent ROC curves, please refer to its methodology
paper (19).
APPEX system consists of various software frameworks to handle multiple concurrent analysis jobs steadily and robustly. Basically, APPEX was implemented by a host language, JAVA. To provide user friendly and active interfaces, Google web toolkit (GWT, ver. 2.5.0, https://developers.google.com/web-toolkit) and GWT extended (GXT, ver. 3.0.1, http://www.sencha.com/products/gxt) frameworks were used. Various dialog based interfaces of APPEX were constructed by GWT and GXT libraries. The data transporting between client and APPEX server is controlled by GWT remote procedure call (RPC) method. All statistical analysis methods of APPEX analyzer were implemented by R script language (ver. 3.0.1, http://www.r-project.org) with Bioconductor plugins (ver. 2.12, http://www.bioconductor.org). Calling R modules from a host language is managed by RCaller framework (ver. 2.1.1, https://code.google.com/p/rcaller). To handle multiple time-consuming jobs concurrently, Quartz framework, one of the job scheduling services, was integrated with APPEX (ver. 2.1.6, http://quartz-scheduler.org). To store and handle public datasets from NCBI GEO, MySQL database management system was applied (ver. 5.5.11, http://dev.mysql.com). In addition, data query on MySQL from a host language is controlled by MyBatis, a XML based SQL mapping framework (ver. 3.1.1, https://code.google.com/p/mybatis). All services of APPEX are contained and served on an Apache Tomcat web server (ver. 6.0.26, http://tomcat.apache.org). The following figure is a schematic diagram of APPEX system architecture.
Figure 39.
APPEX system architecture
To make the APPEX system works
steadily and to provide flexibility as highly as possible, we have established
a couple of operating criteria as followings:
(1) Connection ID
When an anonymous user accesses to
APPEX system, a connection ID to control session of the user is automatically
generated. All the materials produced by user activities on APPEX such as
uploaded files or analysis results are managed based on a connection ID. If
users remember a connection ID at previous sessions, they can replace current
connection ID with previous one and access previous results or uploaded data on
APPEX analyzer. The responsibility to manage connection ID such as maintaining
a connection ID and saving or removing its data is up to the user. APPEX just
has a roll to generate new connection ID when users access to APPEX website.
(2) Supporting data formats
APPEX supports two data formats to
handle, column-oriented and row-oriented datasets. Each element of them should
be delimited by a tab (¡®\t¡¯) character. Column-oriented dataset format is
applicable for CoxSingle
and Multivariate
processes on APPEX. Row-oriented dataset format is applicable for CoxMulti,
IntransSingle,
IntransMulti,
SuperPC,
and TimeRoc
processes. Row-oriented dataset format is also used at public dataset
processing. When user select one of the stored public datasets, APPEX tries to
make a file formatted by row-oriented dataset and save it to user area based on
connection ID.
(3) No requiring of personal
information
APPEX does not require and never try
to save any personal information of user. The only thing APPEX requires at each
analysis process is an email address to alert the completion of long time
analysis. Of course, APPEX does not save it after alerting an analysis
completion.
(4) Maintenance of analysis history
Basically, APPEX maintains analysis
history of user for two months after processed date. Meanwhile, APPEX
never try to do anything associated with user¡¯s own data. After two months of
maintenance, APPEX will remove all the contents in the APPEX database.
Using this dataset, you can perform
survival analysis of Cox proportional hazard model (CoxSingle). The data
contains signal intensities of one molecule obtained from 102 tumor patients as
well as clinical information including follow-up time and censor (alive:0/death:1).
The following figure illustrates this example data at Microsoft Excel
environment:
Figure 40.
An example dataset for CoxSingle process
Using this dataset, you can do various
genome-wide (or molecule-wide) approaches including Cox proportional hazard
model (coxMulti), In-trans correlation (intransSingle and intransMulti),
Super-PC, and time-dependent ROC (timeRoc) analyses.
This dataset is a typical example for analysis on APPEX platform. The data
contains genome-wide expression data (24,996 genes) obtained from 100 cancer
patients as well as clinical information including follow-up time and censor
(alive:0/death:1). The following figure illustrates a
part of this example data at Microsoft Excel environment:
Figure 41.
An example dataset formatted by row-oriented dataset
This dataset contains several clinical
factors with a prediction result classified by a molecule formatted by
column-oriented dataset. The first column contains unique identifiers of
patients, the second column indicates censor information (0: alive and 1:
death), and the third column holds follow-up time of each patient. In addition,
this dataset contains the information of gender and disease stage. For
¡°multivariate¡± process in APPEX, the value of all variables should be numeric.
Therefore, the data with string or character type should be converted to
numerical data (e.g. M and F in gender should be converted to 0 and 1). This
dataset obtained from 268 tumor patients. The following figure illustrates a
part of this example data at Microsoft Excel environment:
Figure 42.
An example dataset formatted by column-oriented dataset for multivariate cox
regression analysis