10.
New
version:
Identification of enriched genes from multiple replicate analyses of Affymetrix
microarrays.
We used Affymetrix
Genechips to search for differentially expressed genes in the mouse brain.
About 34,000 genes and ESTs were interrogated with the Mu11kA, Mu11kA, Mu11kB,
Mu19kB and Mu19kC arrays. We compared the following regions: amygdala,
cerebellum, hippocampus, olfactory bulb and PAG. To search for region-specific
genes in each area, we wrote custom code in Matlab (The MathWorks).
Average difference (D) values for
each gene are used for comparisons. The code searches for genes that are highly
enriched in the reference sample with respect to all others. Two criteria are
applied to identify enriched genes: (1) the D value for the gene in the reference sample; and 2) the ratios (fold-difference) of D values for this
gene in the reference sample to that in each of the other samples.
Formally,
a given gene j, with an average difference value in the reference sample
of Dj,ref , is considered to be enriched in the reference sample relative
to the other samples examined if:
i)Dj,ref > minimum
ii) Dj,ref /Dj,other threshold or Dj,ref /Dj,other < 0
for all other samples
The
minimum value and the threshold ratio can be arbitrarily chosen
by the user. Thus, all genes that fulfill both the minimum value and
threshold ratio conditions are identified as being enriched in the
reference sample.
In
our case, based on in situ hybridization
experiments, a threshold ratio of 3.5
was optimal. Using higher threshold ratios (e.g., 5- or 6-fold) failed to
identify many genes found by the lower-stringency search, whose differential
expression could be validated by in situ
hybridization. Conversely, lowering the threshold below 3.5 was more likely to
identify genes whose expression was not region-specific. We set the minimum
value = 1/10 of the mean of D values for all genes on each array (which in our case was ~
120).
The program can be downloaded here.
Instructions for use:
You need Matlab (The MathWorks) to run the
program.
You will need to enter
the data as a .txt file. Consider the following for formatting the data:
The data should be
in matrix form and contain average difference (D) values for all genes across all samples
compared. The number of columns equals the number of samples to compare, and
the number of rows corresponds to the quantity of different genes. In this way,
each column contains D values for all genes in a given sample. (Make sure the genes
have been sorted in the same way (e.g. alphabetically) across all samples.)
1-
First, save the data in the way above described before running the program.
2- Open Matlab.
3- Change directory to the one where you have
downloaded the files.
4- Start by typing start_gs. The
following gui should appear:

LOAD DATA MATRIX:
Click
the Load Data Matrix button and browse to select the file containing
your data.
REF. SAMPLE
Enter
here the sample number that you want to consider as reference in your
comparison. (For example, enter 4 if your reference sample will be the fourth
column of your data matrix).
THRESHOLD
Enter
here the value for the threshold ratio. (For example, if you enter 6 here, you
will only identify genes that are enriched in your reference sample compared to
all other samples by at least 6-fold).
MINIMUM
Enter
here the minimum D value that an enriched gene can have in your reference sample.
OUTPUT FILE
The
output of this program contains the following:
-
indices of the rows
with enriched genes
-
a matrix formed by D values of
enriched genes only.
A .txt file will be
created in c:/temp directory with this output.
Enter here the name
you want to give to this text file.
In
addition, the conditions you set for comparison, together with the number of
genes and samples analyzed will be automatically included in this file, which
may be useful for future reference.
Click here if you want to see an example of how the program
runs with a simple dataset.
Click here to
download this example
dataset.
We
recommend that you try running this code with several different settings to
have an idea of what your data looks like. The default settings are the ones we
recommend for initial comparisons. It takes less than a minute to run this
program for 5 samples and 34,000 genes.
Experimental
datasets are available upon request from Mariela Zirlinger. The code runs under
Matlab.
We appreciate your
feedback:
Mariela Zirlinger
216-76 Caltech
Pasadena, CA 91125