Data Collection

Traditionally, chromosomal CGH data has been collected from the rev ish karyotype annotations included in the article's PDFs or supplementary files. We had developed dedicated software to

  • normalize the diverse annotation formats & resolutions, to a standard ISCN 1995 "rev ish" format
  • perform some detection/fixing for obvious errors (e.g. non-existing bands due to typos ...)

With more than 800 articles processed, we have seen all possible errors and more, up to column switching of gain/loss data in the printed article and channel switching in GEO data sets.

Which leads to genomic array data sets.

We collect data from article's supplementary data, GEO, TCGA, arrayExpress ... We assume, that the best data is the called data provided by the original group. Which, of course, is rarely provided. Our strategies are:

  1. look for called aCGH data, in GP annotation or tabular format (article, supplementary tables)
  2. if not, check supplementary files for probe specific data tables
  3. check article for links to GEO/arrayExpress datasets

Independent of the article based data collection, we perform the reverse approach of mining public array data resources (GEO...) for available oncogenomic data. This is a complicated story involving raw data processing, quality control, segmentation etc. which we will publish one day, I guess.