The CAMDA Challenges
As traditional in CAMDA contests, neither we nor the producers of the data can provide advice on the datasets to individuals as dealing with the files forms part of the analysis challenge. There is, however, an open forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate. For CAMDA 2014, we have have compiled the following exciting contests:
- The Systems Biology challenge of meaningfully integrating multi-track -omics data. We provide data sets from two application domains:
- A selection of large-scale cancer studies of less well-studied diseases from the lastest data release of the International Cancer Genome Consortium (ICGC), including matched gene and microRNA expression profiles from RNA-Seq, somatic CNV, methylation, and protein expression profiles. Only processed data are provided due to access restrictions by ICGC.
- Dual dose response profiles for 14 unknown and 2 known compounds from the InnoMed PredTox project of the EU FP7 program, including matched transcriptomics, proteomics, and metabolomics from multiple platforms and the respective data on liver and kidney damage. Both raw and processed data are provided for contestants.
- The Toxicogenomics challenge based on the Japanese Toxicogenomics Project – the largest publicly available data set in the field covering over 100 drugs by more than 21,000 Affymetrix chips, and featuring multi time-point multi-dosage gene expression profile responses of rat in-vitro, rat in-vivo, and human in-vitro. A key challenge is the prediction of human in-vivo clinical data. The data set has been compiled and extended by additional clinical data in collaboration with Dr Weida Tong of the FDA.
Please notice that CAMDA challenges are not limited to questions proposed here. We look forward to a lively contest!
Dataset 1: ICGC Cancer Genomes
From the comprehensive description of genomic, transcriptomic and epigenomic changes provided by ICGC, the main goal of this challenge is to gain novel biological insights to less well studied cancers selected here. However, we are not merely looking for 'old paradigm' cancer subtype classification!
Data Description and Download
For this challenge, only processed data are provided. These cancers all have matched gene expression, microRNA expression, protein expression profiles, somatic CNV, and methylation.
- Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
- Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?
Dataset 2: InnoMed PredTox
The InnoMed PredTox project of the EU FP7 program was launched with the goal to improve decision making earlier in preclinical safety evaluation by combining results from ‘omics technologies together with conventional toxicology measurements. Towards this goal, in vivo studies were conducted in male Wistar rats with 14 failed proprietary compounds and 2 reference toxicants (troglitazone for liver and gentamicin for kidney), in liver, kidney, serum, and urine samples over mutiple time points.
For this challenge, raw and processed data are provided as separate packages. The data packages contain metadata files, and either processed or raw data folders, listed as following:
- PredTox Overview (PDF file) - the main publication that descripbes the experiment setup, data collection, and outcome.
- PredTox 16 compounds (Excel table) - it summarizes the assays of each compound/study.
- metadata (folder) - it contains microarray annotation and MASS5 QC files, and assays for each compound/study.
- processed data (folder) - each compound has transcriptome, proteome, and metabolome data folders. OR
- raw data (folder) - each compound has transcriptome, proteome, metabolome, and clinical chemistry data folders.
- Question 1: why are these compounds toxic? The 14 included proprietary compounds had been discontinued at certain stages of preclinical development due to toxilocological findings in liver and/or kidney from in-life studies. What can we learn from transcriptome and metabolome data to explain their toxicity?
- Question 2: Can we predict drug toxicity? Toxocity is dosage and time dependant. Can it be predicted at the earlier stage of the development?
Dataset 3: TGP dataset from the Japanese Toxicogenomics Project
The TGP dataset contains over >21,000 arrays for rats treated with mainly human drugs and profiled using the Affymetrix RAE230_2.0 GeneChip®. The main target organ profiled is liver.
In this project, only the data for liver are provided. The data package contains the following files:
- TGP Description (word document) – it provides a brief introduction of the TGP data and human hepatotoxic potential of each drug. More information is available from two references below: Citation 1: Uehara T, Ono A, Maruyama T, Kato I, Yamada H, Ohno Y, Urushidani T., The Japanese toxicogenomics project: application of toxicogenomics. Mol Nutr Food Res. 54(2):218-27, 2010. Citation 2: Chen, M., et al., FDA-approved drug labeling for the study of drug-induced liver injury (DILI). Drug Discov Today, 2011. 16(15-16): p. 697-703.
- Drug Information (Excel table) – the basic information about individual drugs are extracted from DrugBank. The last three columns contain human hepatotoxicity data for each drug described in the paper by Chen et al. (mentioned above in citation 1).
- Pathology Data (Excel table) – A significant portion of the TGP data is derived from in vivo assay using two different treatment protocols (i.e., single treatment and daily repeated treatment). Pathology and clinical chemistry data for each rat (which anchored with each array) are summarized in this table.
- Array Metadata (csv format) – Meta data (e.g., dose, time, sacrifice time and etc) for each array are summarized. Phenotypic data anchored to each array are available from the “Pathology data” table mentioned above.
- MAS5 data (folder) – it contains the MAS5 summarized array data
- FARMS data (csv format) – contain the FARMS summarized array data
- RAW data (folder) – it contains all the array data in the cel format
- Example data (LIBSVM format) - ready to use for binary classification of DILI
This is a typical toxicogenomics dataset. This dataset can be used to address two most important questions in toxicology and safety evaluation:
- Question 1: Can we replace the animal study with in vitro assay? The current safety assessment is largely relied on the animal model, which is time-consuming, labor-intensive, and definitely not in line with the animal right voice. There is a paradigm shift in toxicology to explore the possibility of replacing the animal model with in vitro assay coupled with toxicogenomics. The TGP data contains both in vitro and animal data, which is essential to address this question.
- Question 2: Can we predict the liver injury in humans using toxicogenomics data from animals. Around 40% of drug-induced liver injury (DILI) cases are not detected in the preclinical studies using the conventional indicators (such as pathology, clinical chemistry data). It has been hypothesized that genomic biomarkers will be more sensitive than conventional markers in detecting human hepatotoxicity signals in preclinical studies (i.e., in vitro and in vivo assays). In this project, we provide the human hepatotoxicity data for most of the drugs (the last three columns in the table named “Drug Information”). The contests can explore the possibility of predicting the DILI potential in humans using the in vitro data from rat primary hepatocytes or human primary hepatocytes, or the animal data from two different treatment protocols. Alternatively, these data can also be combined to enhance the predictive power for the human hepatotoxic potential.