This is an old revision of the document!

Download preprocessed TGP data

Data set description:
Data for the four studies are zip-compressed and available to download through the links below. Each zip-file contains four files in CSV-format (comma-separated values): the FARMS-summarized gene expression values per gene (exprs_*.csv), the informative/non-informative (I/NI) call per gene (ini_*.csv), the sample names (sampleNames_*.csv) and the gene names (geneNames_*.csv). Each sample corresponds to one drug measurement. In the gene expression matrix the columns are genes and the rows are samples. The I/NI call is a filter criteria, which allows detecting information carrying genes (e.g., genes with an I/NI call below 0.5 - smaller I/NI calls means more information). Replicate measurements were collapsed to one measurement per gene.

TGP drug info and pathological findings (CSV, EXCEL format)

Study – rat in vivo single (CSV format)

Study – rat in vivo repeated (CSV format)

Study – rat in vitro single (CSV format)

Study – human in vitro (CSV format)

Download rat in vitro study (LIBSVM format)

Data set description:
The example classification data sets below were build using the drug information from “Drug Information.csv” and the expression data from the rat in vivo single study (CSV format). The example data set contain the gene expression values (FARMS preprocessed) and as labels the drug induced liver injury (DILI) classes (”-1”,”+1”). For different time points (2h,8h, and 24h) and dose-levels (low, middle, and high) the data is stored in LIBSVM format. These binary classification data sets are ready to be analysed using the LIBSVM package. Samples (drugs) being of no DILI concern were labeled as ”-1” and those of most DILI concern as ”+1”. For more details regarding the categorization of DILI see here.

Example data sets

Description data preprocessing

The Japanese Toxicogenomics Project (TGP) includes gene expression data, toxicological information and pathological data of 131 compounds in vitro and in vivo screened for toxicity in rat and in vitro screened for toxicity in human.

Upper panel: The y-axis shows the log expression values of the fatty acid-binding protein 1 (Fabp1) estimated by FARMS after quantile normalization, while the grouped compounds are shown on the x-axis. The time points are encoded by orange, green and blue for 2h, 8h and 24h, respectively. The plot shows strong cell-culture e ffects, within the three time points and compounds, which could not be removed by the quantile normalization.
Lower panel: Same as upper panel but batch corrected. The correction with the matched control within cell-culture clearly reduces the cell-culture e ffects, while compound induced expression changes are preserved.

The standard microarray preprocessing procedure consists of normalization, summarization and filtering. However, the standard preprocessing pipeline can not be applied to these data sets, as the initial quality control of the microarray data revealed severe eff ects between the cell-cultures (see upper panel). To remove these effects, first, the probe-level data of the microarrays were quantile normalized. Secondly, a compound batch correction was made by calculating probe intensity ratios using the corresponding control measurement for the cell-culture (only vehicle without compound) as reference. For the next preprocessing step, summarization, probe sets were defined corresponding to genes using alternative CDFs (Version 15.1.0, ENTREZG) from Brainarray [2] and applied FARMS [1] for summarizing the intensity ratios at probe set level to obtain expression values per gene. For the last preprocessing step, gene filtering, the FARMS based informative/non-informative (I/NI) call [3] was applied to identify all non-informative probe sets.


  1. Hochreiter S, Clevert DA, and Obermayer K (2006). A new summarization method for A ffymetrix probe level data, Bioinformatics, 22(8):943-949
  2. Dai M, Wang P, Boyd AD, et al. (2005). Evolving gene/transcript de finitions signifi cantly alter the interpretation of GeneChip data, Nucleic Acids Res., 33(20):e175
  3. Talloen W, Clevert DA, Hochreiter S, et al. (2007). I/NI-calls for the exclusion of non-informative genes: a highly eff ective feature filtering tool for microarray data, Bioinformatics, 23(21):2897-2902

Chris Sander, PhD
Chris Sander, PhD
Memorial Sloan Kettering Cancer Center

Temple F. Smith, PhD
Temple F. Smith, PhD
Boston University

Jun Wang, PhD
Jun Wang, PhD
Beijing Genome Institute (BGI)

Extended Abstract Proposals Due20 May 2014
Abstract Deadline for Poster Submission 25 May 2014
Notification of Accepted Contributions30 May 2014
Early Registration Closes7 Jun 2014
CAMDA2014 Conference11–12 Jul 2014
ISMB 2014 Conference12–15 Jul 2014
Full Paper Submission Click to save the dates!25 Sep 2014



Landes Bioscience