How is the data in Nephroseq prepared?
Datasets are identified for inclusion from both literature searches and suggestions from our users. Good candidates for Nephroseq are datasets with raw gene expression data available for more than five kidney disease samples.
If a dataset appears to be a good candidate for Nephroseq, the next step is to map any available sample metadata to the Nephroseq ontology. In many cases, the data curators reach out to the study authors to request additional data or clarification.
For microarray datasets, gene expression data is log2 transformed and median centered per sample. For RNA-Seq datasets, raw sequencing data were aligned with TopHat v2.0.9 against Ensembl GRCh38 and expression normalization and differential expression analysis were executed using R in bioconductor/edgeR. Normalized data values are reported as log2(Counts Per Million aligned reads) and differential analyses are reported as F-tests. (Note that the dataset detail screen shows more details about processing/analysis of individual datasets.) For more info on RNA-Seq processing, see Kim D et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions (Genome Biology 2013) and Robinson MD et al. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data (Bioinformatics 2010).
Dataset expression values are mapped to our gene model, which allows users to compare multiple datasets together regardless of the original microarray or RNA-Seq platform.
Once the dataset has been mapped to the Nephroseq ontology and gene model, the dataset is run through our internal analysis engine to generate differential expression, co-expression, and outlier analyses. After analyses are generated, the dataset undergoes a scientific review process to ensure the quality of the resulting data.