10. New version: Identification of enriched genes from multiple replicate analyses of Affymetrix microarrays.
We used Affymetrix Genechips to search for differentially expressed genes in the mouse brain. About 34,000 genes and ESTs were interrogated with the Mu11kA, Mu11kA, Mu11kB, Mu19kB and Mu19kC arrays. We compared the following regions: amygdala, cerebellum, hippocampus, olfactory bulb and PAG. To search for region-specific genes in each area, we wrote custom code in Matlab (The MathWorks).
Average difference (D) values for each gene are used for comparisons. The code searches for genes that are highly enriched in the reference sample with respect to all others. Two criteria are applied to identify enriched genes: (1) the D value for the gene in the reference sample; and 2) the ratios (fold-difference) of D values for this gene in the reference sample to that in each of the other samples.
Formally, a given gene j, with an average difference value in the reference sample of Dj,ref , is considered to be enriched in the reference sample relative to the other samples examined if:
i)Dj,ref > minimum
ii) Dj,ref /Dj,other threshold or Dj,ref /Dj,other < 0 for all other samples
The minimum value and the threshold ratio can be arbitrarily chosen by the user. Thus, all genes that fulfill both the minimum value and threshold ratio conditions are identified as being enriched in the reference sample.
In our case, based on in situ hybridization experiments, a threshold ratio of 3.5 was optimal. Using higher threshold ratios (e.g., 5- or 6-fold) failed to identify many genes found by the lower-stringency search, whose differential expression could be validated by in situ hybridization. Conversely, lowering the threshold below 3.5 was more likely to identify genes whose expression was not region-specific. We set the minimum value = 1/10 of the mean of D values for all genes on each array (which in our case was ~ 120).
The program can be downloaded here.
Instructions for use:
You need Matlab (The MathWorks) to run the program.
You will need to enter the data as a .txt file. Consider the following for formatting the data:
The data should be in matrix form and contain average difference (D) values for all genes across all samples compared. The number of columns equals the number of samples to compare, and the number of rows corresponds to the quantity of different genes. In this way, each column contains D values for all genes in a given sample. (Make sure the genes have been sorted in the same way (e.g. alphabetically) across all samples.)
1- First, save the data in the way above described before running the program.
2- Open Matlab.
3- Change directory to the one where you have downloaded the files.
4- Start by typing start_gs. The following gui should appear:
LOAD DATA MATRIX:
Click the Load Data Matrix button and browse to select the file containing your data.
Enter here the sample number that you want to consider as reference in your comparison. (For example, enter 4 if your reference sample will be the fourth column of your data matrix).
Enter here the value for the threshold ratio. (For example, if you enter 6 here, you will only identify genes that are enriched in your reference sample compared to all other samples by at least 6-fold).
Enter here the minimum D value that an enriched gene can have in your reference sample.
The output of this program contains the following:
- indices of the rows with enriched genes
- a matrix formed by D values of enriched genes only.
A .txt file will be created in c:/temp directory with this output.
Enter here the name you want to give to this text file.
In addition, the conditions you set for comparison, together with the number of genes and samples analyzed will be automatically included in this file, which may be useful for future reference.
Click here if you want to see an example of how the program runs with a simple dataset.
Click here to download this example dataset.
We recommend that you try running this code with several different settings to have an idea of what your data looks like. The default settings are the ones we recommend for initial comparisons. It takes less than a minute to run this program for 5 samples and 34,000 genes.
Experimental datasets are available upon request from Mariela Zirlinger. The code runs under Matlab.
We appreciate your feedback:
Pasadena, CA 91125