O-miner is an online analytical suite allowing the automated integrative analysis and annotation of data from complex -omics data. We provide an easy to use tool to analyse both in-house and publicly available data from transcriptomic, genomic and methylation platforms. A generic system is used to provide detailed data analysis reports and figures for all of the data types that are supported. O-miner provides the functionality to analyse multiple datasets together in a meta analysis, increasing the statistical power of publicly available data. O-miner uses R packages for all analytical workflows.
Arrray-based transcriptomics workflow
O-miner provides analysis workflows for raw and normalised data from Affymetrix expression platforms (human and mouse genomes), normalised or unnormalised expression data from Illumina expression platforms, raw data from Affymetrix Exon arrays, and raw and unnormalised data from the Affymetrix multispecies miRNA array. If raw data is uploaded to O-miner, the following analytical steps are applied to the data: Quality Control, Normalisation, Filtering of the normalised expression matrix and differential gene expression analysis with the R package LIMMA1. Probes that are differentially expressed and meet user supplied cutoffs for log2 fold change and adjusted p values are reported. Optional analyses to identify statistically significant Gene Ontology terms from the differentially expressed results are available. All of the data generated is available for download in a zipped format. Survival analyses are available for all platforms. For data from Affymetrix gene expression platforms, estimates of Tumour purity are available. Meta-analyses can be run using O-miner for datasets from the same platform, the COMBAT2 algorithm may be used to combat batch effect amongst datasets.
O-miner is able to analyse pre-processed data from RNA-seq experiments. Data must be provided to O-miner in a matrix format of either raw read counts or as an matrix of normalised read count data (RPKM values). Users are able to choose between two methods to calculate differential expression: LIMMA for both normalised and unnormalised data and edgeR for unnormalised read counts data. User supplied cutoffs for fold change and adjusted P values are reported. Optionally, users may choose to run Gene Ontology analysis to identify statistically significant Gene Ontology terms. All of the analyes generated are available for download in a zipped format. Meta-analyses may be performed using post-processed RNA-seq data from more than one experiment. To do this,users need to combine data from each experiment into one large data matrix and upload to O-miner.
The home page allows users to choose the type of data they want to analyse: transcriptomics, genomics or methylomics. Once selected, the user interface for the selected data type will appear.
Users need to provide a name for their projects. They may also provide an email address, which is optional. A unique project ID is created for each project submitted to O-miner. The email address is useful to inform users about the progress of the submission.Input details requested for the analysis of transcriptomics data are described next.
Array technology and platformArray technology needs to be chosen. The options that are displayed for each section are specific to the array technology selected. O-miner supports the analysis of raw as well as normalised data from Affymetrix expression arrays (Genechip Human Genome U133 Plus 2.0,U133A,U133B,U133A 2.0,U95 version 2,U95A,U95B,U95C,U95D and U95E, Genechip Mouse Genome 430 2.0 and Human Gene 1.1ST), Affymetrix miRNA arrays (Genechip miRNA 2.0 and 3.0), and raw data from Affymetrix Exon Array (Genechip Human Exon 1.0ST). The tool also supports normalised or unnormalised data from Illumina expression arrays (HumanHT-12 v3 and v4 Expression BeadChip, MouseRef-8 V2.0 Expression BeadChip). It also supports the post processing analyses of data from RNA-seq experiments, where input can be a matrix of either raw read counts data or normalised read counts data.
Data TypeData type refers to either raw or normalised/unnormalised data dependent on which platform is chosen. Raw data are CEL files that are uploaded to the server. A minimum of four raw .CEL files may be submitted to O-miner for analysis. All files must be compressed into a single archive (.zip or .tar) and should not exceed 2GB. Once the other options are set users are prompted to upload the data in the 'Data Source' section. After the archive of data files are uploaded in the server, the file names will appear in a File Organiser window in the browser. At this point users are prompted to assign additional information such as sample name and the biological source/state of each sample in the datasets.
Normalised or unnormalised data can be uploaded to the server in a tab delimited file format. This file must be uploaded as a zipped archive and must not exceed 2GB. Data can be generated from using O-miner or any other software. The normalised data matrix must contain the log2 expression values for each of the probes on the array per sample.When the other options have been set the file can be uploaded at the Data source section. Once the data has been uploaded to the server and the information for ecah sample has been extracted the File Organiser window will display the sample names as extracted from the column names of the uploaded data matrix. Users are prompted here to define the biological source/state of each sample. Any incorrect samples can be removed at this stage for further analysis.
An example of a normalised matrix can be found here.
Analysis TypeBoth paired and unpaired analyses are supported by O-miner.
O-miner displays the available .CEL files in the File Organiser window, for the user to define the Sample Names and the biological source/state of each array. Samples that have been uploaded for analysis can be removed at this stage by checking the checkbox on the right hand side of each sample and clicking he 'Delete selected' button. Users must have at least two different groups to proceed to the next stage of the analysis workflow.
Figure: Unpaired analysis - Before setting up biological groups
Figure: Unpaired analysis - After setting up biological groups
Paired analysis can be done with the dataset containing samples from the same subject under two different conditions (i.e., tumour and normal samples from the same patient). Once the data is uploaded and extracted to the server, the File Organiser window allows to make up the pairs by transferring samples to the box on the left hand side by using the the "->" button.
Figure: Paired analysis - Before setting up pairs
Figure: Paired analysis - After setting up pairs
Figure: Paired analysis - After setting up biological groups
Technical replicatesTechnical replicates refers to whether samples from the same condition are taken from the same sample. An additional 'Replicate' column appears in the File Organizer window for the analysis of technical replicates. Replicates are indicated with an identical number assigned to the samples which are replicates.
The batch-effect correction option refers to whether samples from different studies are being analysed together. If chosen, an additional column headed Study will appear in the File Organizer window. The example below shows an unpaired analysis using the batch effect correction algorithm (COMBAT)1 from two different datasets in the Gene Expression Omnibus with the same array platform. Users are required to fill in the biological state/ source for each sample as well as the Study column.
Figure: Batch-effect correction - Before setting up study id
Figure: Batch-effect correction - After setting up study id
ESTIMATE (Affymetrix Expression data only)
If the data to be analysed is from Affymetrix Expression Human Genome GeneChip platforms, then an option is available to run the ESTIMATE3 algorithm for estimating Tumour purity. Users are required to select the Estimate option before the data is uploaded to the server.
Survival analysis is available for all of the transcriptomics platforms supported by O-miner. If survival data is available, then select 'yes' for this option. Once data has been uploaded to the server and the biological source/state has been assigned for each sample through the File Organizer window, the user needs to click the 'Survival data' button. Another table will appear where users can provide survival information for each of the samples.
Figure: Survival analysis - After setting up survival data
The name of each sample is displayed in the first column and information regarding the Time-to-event in months refers to the overall survival for each sample. For example, if the dataset is from patients, overall survival could refer to the time from diagnosis to the time of last follow-up. Event status is required for each of the samples and this refers to whether an event has occurred: 1 if an event has occurred (e.g.death) or 0 if an event has not occurred.
Upload target matrix option
Users can upload their own target matrix files. It is recommended that the users select 'No' for this option, proceed to upload their data and complete the information that is requested in the 'File Organizer' window. For exceptionally large datasets, where it may become tedious to fill in all the information required for each sample on the user interface, then 'Yes' can be chosen for this option. The target matrix file must be prepared as a tab delimited text file and upload to O-miner at the same time data is uploaded. The format of the file is dependent on the type of analysis that is chosen.
To analyse raw data from Affymetrix expression arrays, miRNA arrays and exon arrays, three columns are required with the headers FileName, Name and Group. FileName MUST contain the exact names of each of the CEL files that are uploaded for the analysis of raw data. If normalised data is uploaded then the names in the FileName column must be EXACTLY the same as the rownames of the normalised matrix that is uploaded to O-miner. The Name column must contain an unique identifier for each sample and the Group column must state the biological source/state for each of the samples. An example of target matrix file for unpaired analysis on Affymetrix expression/miRNA/exon array is found here.
To analyse normalised and unnormalised data from Illumina expression arrays and RNA-seq (post-processing), two columns are required: a Name column with the samples names (these must match the column names in the matrix) and a Target column stating the biological source/state of each of the samples. An example of target matrix file for unpaired analysis on Illumina/RNA-seq postprocessing data is found here.
To conduct a paired analysis, in addition to the FileName, Name and Group columns, a fourth column is required with the column header Pairs. This column should contain the same number in two rows of the samples that are considered to be pairs. An analysis that included replicates, requires another column with the column header Replicate. This column should contain the same number in each row of those samples that are considered to be replicates. An example of target matrix file for paired/replicate analysis is found here.
To conduct a survival analysis, in addition to the columns FileName, Name and Group as well as additional Pairs or Replicate, two additional columns are required that should have the column header Surv_Period and Surv_Status. Surv_Period represents the overall survival in months for each of the samples and Surv_Status represents whether an event has occurred (0 if no event has occurred and 1 if an event has ocurred). An example of target matrix file for survival analysis is found here.
Users can upload their own datasets to O-miner or publicly available dataset from the Gene Expression Omnibus (GEO). When uploading own dataaset, users need to select "Compressed file/folder" option and upload the files in zipped format. Publicly available datasets can be uploaded as well from the GEO by choosing "GEO dataset" option as the data source.
If GEO datasets are chosen for uploading, additional fields will appear on the interface to complete the process. The user first needs to pass the GEO dataset identifiers. Then the user needs to choose whether all samples in the dataset are required or only a subset, in which case the GSM number of the files needs to be entered.
O-miner is able to automatically extract phenotypic information from GEO in cases where GEO dataset is chosen as the data source. In order to avail this, users need to select the 'Extract suggestion for target matrix from GEO' option, in which case a provisional File Organizer window will appear similar to the image below:
From this interface, the user needs to assign the column they want to use as the Sample Name and Group using the drop-down menus at the top of each column. Please note that O-miner displays this information separately for each of the GEO datasets that are uploaded and this step needs to be done for each of the GEO datasets shown on the interface separately.
Once these columns have been assigned in the interface, the 'Proceed' button has to be clicked and a final File Organizer window will appear.
Users can then proceed to the next step of input for setting up the LIMMA analysis.
The options for the analysis parameters depend on the platform selected for data analysis. Once the user chooses the platform, most of these options will change dynamically.
188.8.131.52 Quality control
Quality control options are offered for only data from Affymetrix expression array and Illumina expression array platforms.
ArrayMvout4 is an R package. ArrayMvout is run by default on all raw data that is submitted to O-miner from these platforms and outliers are automatically removed from the data. Nine quality control criteria are applied to the data to identify outlier samples after ArrayMvout is implemented. These are: average background, scale factor, percent of present calls, actin 3'/5' ratio, GAPDH 3'/5' ratio,median normalised unscaled standard error (NUSE), median relative log expression (RLE), RLE-IQR (interquartile range of IQR per array, to measure the variability in RLE) and the slope of the RNA degradation measure. These components are analysed using principal components 1, 2, 3 analysis, followed by parametric multivariate outlier detection with calibaration for multiple testing. A false positive rate of 0.01 for outlier detection, adjusting for multiple comparisons according to Caroni and Prescotts adaptaion of Rosner6 is used.
ArrayQualityMetrics5 is an R package that users can select to run further quality control analysis. O-miner will report results from ArrayQualityMetrics but will not remove any samples found to be outliers. Any samples found to be outliers can be manually removed from further analyses.
Exon array analysis
NUSE and RLE plots are generated for the data from the exon array platform. Samples are not automatically removed from analysis. Users can look at these plots and see whether any samples look like outliers.
Options for normalisation depend on the platform that is being analysed.
Affymetrix Expression ArraysO-miner provides a choice of three different normalisation methods for Affymetrix expression array platforms. These are:
Robust Multi-array Average (RMA)
Log2 base scale normalised expression measurements are generated for each probe from each sample and comprises three pre-processing steps: background correction using perfect-match (PM)-only estimation procedure, quantile normalisation; and summarisation using median polishing7.
GCRMA converts .CEL files into an expression set using RMA and taking into account the GC-content of the probe sequence for background correction8.
The default summarisation method of RMA and GCRMA is based on the median polish algorithm, it has been suggested that this could be problematic when having a small and unequal number of samples. O-miner offers tRMA, a proposed correction to the RMA/GCRMA summarisation procedure that reduces inter-array correlation artefacts9.
Illumina Expression Arrays
All of the normalisation methods carried out for Illumina Expression are performed using the R package lumi10.
Robust Spline Normalisation (RSN)
This algorithm is a combination of the features of quantile and loess normalisation. It is designed to normalize variance-stabilized data.
Variance stabilisation normalisation
This method performs variance stabilization and background correction in the same transformation. Instead of using negative controls, the within-array standard errors calculated from the replicate beads are used to remove the relationship between intensity and signal variability that typically exists.
Simple Scaling normalisation (SSN)
Samples are adjusted to the same background level and data is then optionally scaled to the same foreground level. Data adjustment is based on the raw scale data. This method of normalisation is a more conservative method compared to other normalisation methods such as quantile and curve-fitting methods. It is assumed that all each sample has the same background levels and the same scale.
This method estimates the underlying distribution quantiles based on one or two order statistics which are used.
If unnormalised data such as raw read counts are uploaded to O-miner, a default option for data normalisation is the Voom option within LIMMA.
184.108.40.206 Filter method
Filtering helps to reduce the dimensionality of the data by eliminating probes with low intergroup variability, as these probes are likely to be uninformative. Such probes are removed by the application of either a variance filter such as interquartile range and standard deviation, or an intensity based filter, based on signal probe intensity values11.
Interquartile range (IQR)
Selection of IQR as a filter allows one of three threshold values to be applied to the normalised data. The threshold values available are 0.1, 0.25 and 0.5. These values correspond to a soft,intermediate and robust filter, respectively.
Standard deviation calls can be applied as a variance filter to isolate either the top 5% or the top 10% of the most variable probes.
Intensity filtering is based on the gene having an expression measurement greater than a defined value, in this case above 100 fluorescence units in at least 25% or 50% of samples.
220.127.116.11 Differential expression method
For all analysis platforms apart from RNA-seq (post-processing), LIMMA is the only option for differential expression analysis. Differential expression analysis is performed using the R package LIMMA. Once the user has arranged the samples/files in File Organizer window, O-miner will automatically refresh to display a 'LIMMA' comparison section. This provides a list of the predefined biological groups allowing the user to define the contrast and design matrices that are required by LIMMA. Simply this is based on the user selection of the comparisons between the predefined biological groups.
The user is able to select how probes and contrasts are to be combined together in a multiple testing strategy. The choices available are: Separate, Global and NestedF12. The multiple correction methods available for adjusting the p-values are Holm13, Benjamini and Hochberg (BH)14 (also known as FDR) or Benjamini and Yekutieli (BY)15. Selecting none here will allow the data to be analysed without any multiple testing methods being applied to the p-value.
EdgeR (RNA-seq post processing only)
For the differential expression analysis of postprocessed data from RNA-seq, the user is offered a choice of two methods: LIMMA and edgeR. If LIMMA is chosen then the options that are available are exactly the same as above. EdgeR analysis is run within O-miner using the R package edgeR16. If this option is chosen, then a choice of adjustment methods are offered to the user and these are the same as those offered for LIMMA.
18.104.22.168 Output options
Significance thresholds defined by the user can be applied to the data from this section. Users can provide adjusted p-value thresholds and/or log fold change threshold values. In addition, the 'Additional output' selection gives users the opportunity to include statistically significant Gene Ontology terms and/or generate Venn diagrams in the result. If the option to view Gene Ontology terms is chosen then the R package GOstats17 is used to perform a statistical analysis on the genes that are differentially expressed and those that are found to be statistically significantly differentially expressed for each of the comparisons.
O-miner provides a common format for the output of results from each of the analysis workflows that are supported. The results from each analysis are divided into five categories: Summary, Quality Control, Differential Expression, Gene Ontology and Visualisation. If survival analysis has been chosen, then it will appear as an additional category in the result page. Most of the methods used in the workflows to calculate the results are the same between the different types of data that are analysed. However some types of data differ in the methods used at certain stages of the workflows. This guide to the output of results from Transcriptomic workflows within O-miner is divided up into sections that correspond to each section of the result pages and each stage of the analytical workflows.
Every results page contains a summary page describing the platform, normalisation algorithm and filtering method applied to the data. All of the results displayed in different categories are available to download from here.
22.214.171.124 ArrayMvout Report
All analysis of Affymetrix expression array data from raw CEL files will generate a quality control page composed of an ArrayMvout Report and a Cluster Diagram by default. A summary report of the output of ArrayMvout is displayed in this section. Any samples detected as outliers are shown in red and are automatically removed from subsequent analyses.
A summary plot of the average background, scale factor, percent of present calls, actin 3'/5' ratio, GAPDH 3'/5' ratio is provided. Similarly, plots for the Relative Log Expression (RLE) and Normalised Log Expression box plots (NUSE) are produced. Arrays where there is a large variation or deviating significantly from 0 to 1 (for RLE and NUSE,respectively) are often identified as outliers. Finally RNA degradation plots are generated. These display the degradation slope for each probeset by displaying the RNA quality of each probe (5' to 3').
126.96.36.199 ArrayQualityMetrics Report
If ArrayQualityMetrics option is chosen as the input, then ArrayQualityMetrics Report is generated. In this section, Array metadata and outlier detection view is displayed, which can be used to detect outliers within the dataset. Samples with an 'x' in any one of the columns 1-6 could be possible outliers. Samples that maybe possible outliers from ArrayQualityMetrics are not removed automatically from the analysis at this stage.
188.8.131.52 LUMI Report
For quality control analysis for Illumina expression arrays the R package LUMI is used. All plots are available for download. These comprise: a density plot showing the plots of signal intensities for each of the samples; a principal components analysis (PCA) plot of the unnormalised intensity values for each sample; a cluster plot showing the relationship between samples based on the large coefficient of variance(mean/standard variance) using hierarchical clustering methods; and a MultiDimensional Scaling (MDS) plot, which displays the relationship between samples based on the large coefficient of variance(mean/standard variance). A summary report is also produced. This displays the sample name, mean (refers to the gene expression mean for one chip), standard deviation (gene standard deviation for one chip), detection rate and distance to mean (distance to sample mean).
184.108.40.206 Affymetrix exon array QC plots
O-miner uses the R package Aroma Affymetrix18 to pre-process the raw CEL file images. Boxplots of NUSE and RLE are produced, no samples are removed from subsequent analysis.
Unsupervised hierarchical clustering of samples based on the Euclidean distance matrix and the average linkage algorithm is performed on the data. This shows clustering of the samples labelled according to the biological groups that are designated by the user.
220.127.116.11 Estimation of Stromal and Immune cells in MAlignant Tumours using Expression data (ESTIMATE)
O-miner uses the R package ESTIMATE to generate a Tumour Purity ESTIMATE Report for data from Affymetrix platforms only. Single-sample gene set enrichment analysis (ssGSEA) is used to calculate stromal and immune scores to predict the levels of infiltrating stromal and immune cells. From these values, ESTIMATE scores to infer tumour purity in tissues are generated. Histograms of stroma score, immune score and the estimate score are displayed. Results are also displayed in a tabular format with columns for Sample name, Stromal score, Immune score, ESTIMATE score and Tumour purity.
18.104.22.168 All platforms except Affymetrix exon arrays
Differential expression statistics, based on user defined significance thresholds for log-fold change and adjusted p-values, are provided in a tabular format. These statistics will be generated for each LIMMA comparison selected by the user. In addition, generation of a Venn diagram allows the user to visualise the number of differentially expressed genes that are unique and common to the biological groups.
Each comparison chosen by the user is displayed as an expandable tab on the differential expression results page, provided that genes that are statistically significant are found for a comparison. For each comparison in the summary box the total number of probes found to be up- and down-regulated are displayed. The distribution of raw p-values are also displayed as a histogram that is available for download. The distribution of the raw P values determines whether they can be adjusted for multiple testing to generate adjusted p values.
Only the top ten differentially expressed probes are displayed in the interface for each comparison. The full list of differentially expressed probes for each comparison is available for download in an excel format at the top of the results page. On the left hand side of each probe, there is a boxplot, which displays the expression levels for each probe across all of the pre-defined biological groups. Each probeset ID is mapped to the chromosomal location and gene name, the log fold-change and adjusted p-values are also displayed.
22.214.171.124 Affymetrix exon arrays
Differential expression results are reported for three levels from exon array analysis - differentially spliced regions (splicing tab), differentially expressed genes (gene level) and differentially expressed exons.
Figure: Differential Expression Analysis - at splicing level
Figure: Differential Expression Analysis - at gene level
Figure: Differential Expression Analysis - at exon level
Statistically significant Gene Ontology terms that are either over- or under-represented in each comparison for each are displayed as an expandable tab, for each of the Gene Ontologies. These are: Cellular Component, Biological Process and Molecular Function. The Gene Ontology ID, p-value, OddsRatio, Expected count, Observed count , Size and a description of the GO term are displayed, when the tab for each ontology is expanded.
All survival analysis within O-miner is performed using the R package survival19 and allows for the Cox proportional hazards regression model to be fitted to the data. Kaplan-Meier plots are produced for the data at 5, 10 and 15 year intervals. For example, the Kaplan-meier plot below is for a survival time of ten years, where two risk-groups are shown with risk-group 1 showing a higher rate of survival compared to risk-group 2. The calculated logrank p-value is also shown.
126.96.36.199 Expression plot
This feature allows the users to enter a probe or a gene symbol of their choice and generate expression boxplots to compare the expression levels of the gene(s)/probe(s) of interest across all the user defined biological groups. This option is available for all the probes on the array.
188.8.131.52 Correlation plot
A gene symbol or probe ID may be entered here to calculate the Pearsons correlation coefficient and associated p-value for a gene/probe id of interest against all the probes on the array. Only the top ten most significant correlations sorted by p-value are displayed on the interface. All of the correlations are available to download as a text file.
184.108.40.206 Survival plotThe effect of expression of a probe ID/gene of interest on survival can be investigated using this feature. The user supplied gene/probe ID is mapped to the corresponding EntrezGene identifier and a univariate model is applied to the data. Samples are assigned to risk groups based on the median dichtomisation of mRNA expression intensities of the gene of interest.
The genomics workflow in O-miner allows the analysis and visualisation of Affymetrix SNP array data. Two analysis pipelines are offered. These are the Copy Number Binary Segmentation (CBS) pipeline and the Allele Specific copy number analysis of Tumours (ASCAT) pipeline. All results generated are available to download in a zipped format.
The CBS pipeline generates information regarding regions of gain and loss. Several steps comprise the CBS workflow dependent on the input type. The O-miner CBS workflow takes as input raw image CEL files, log2ratios, segmented or binary coded data for a number of Affymetrix SNP arrays. Aroma.affymetrix is applied to the raw CEL files to estimate copy numbers, data normalisation, and quality control. Segmentation is applied using the CBS model. Data threshold is obtained via the quartile regression framework. Regions of gain and loss are generated and annotated from multiple sources. Minimal Common Regions can be calculated using the CGHregions20 algorithm.
The ASCAT pipeline is recommended for the analysis of cancer datasets, because results generated from this pipeline include information regarding loss of heterozygosity (LOH) regions as well as regions of gain and loss. The ASCAT workflow within O-miner accepts only raw data files as input.
Log2ratios and B-allele frequencies are calculated for each of the samples in the datasets using the R package CalMaTe. These are fitted to an Allele-Specific Piecewise Constant Fitting (ASPCF) model and the ASCAT21 algorithm is used to estimate aberrant cell fraction, tumour ploidy and absolute allele-specific copy number calls. Results are then annotated and presented graphically to users as frequency plots and aberration plots.
Genome-seq (post-processing) workflow
The Genome-seq (post-processing) workflow accepts as input pre-processed files from exome sequencing data and whole genome sequencing data. Data from each pair of samples i.e. normal and tumor need to be subjected to the processing steps of alignment to the human genome, BAM sorting and PCR duplicate marking, base quality recalibaration and analysis with Varscan to produce a summary of reads for each of the samples. The output from varscan for each of the samples need to be processed using the perl script supplied here. A text file for each of the samples is produced containing the chromosome number, position, number of reference reads in normal,number of total reads in normal, variant allele frequency (VAF) in normal, number of reference reads in tumor, number of total reads in tumor and the variant allele frequency in tumor samples. The workflow within O-miner will take this input and calculate the log2 ratios (LRR) and B-allele Frequency (BAF) for normal and tumor samples to use as input to the ASCAT algorithm to estimate copy numbers between tumor and normal samples. Data is subjected to the same analytical processes as the ASCAT workflow.
The CBS workflow within O-miner accepts raw data files (.CEL files) or partially processed (normalised, segmented or binary) data files, while the ASCAT workflow accepts only raw data files as input. Depending on the data input type different options will become available.
220.127.116.11 Raw CEL files
If raw CEL files is selected as data type, the relevant options are displayed for selection as described below.
The user is required to indicate on which Affymetrix SNP arrays the data was generated. These include the 10K, 100K, 500K, SNP5 and SNP6 platforms in addition to the individual enzyme arrays (50K and 250K).
The selection indicates whether the users supply their own normals (user-defined normals) to generate a baseline to which all other samples are compared to or whether they will select a HapMap population. (N.B. the HapMap option is not available for 10K samples).
Here, the user is required to specify whether paired or unpaired samples are being submitted for analysis.
For a paired analysis each of the samples uploaded by the user is compared against a reference sample provided by the user.
An unpaired analysis involves samples being compared to a common reference that can be either provided by the user or is comprised of one of the HapMap reference sets provided by O-miner. If the latter is chosen, the user will be prompted to select from one of the four human populations: African YRI (from Yoruba in Ibadan, Nigeria), Japanese JPT (from Tokyo, Japan), Han Chinese CHB (from Beijing, China) and European CEU (from Utah, USA with ancestry from Northern and Western Europe).
Data sourceThe raw data files (.CEL files), including those corresponding to normal samples, must be compressed into a single archive (.zip or .tar) and have a maximum size limit of 2GB. It is possible at this stage to give a GEO series number (similar to transcriptomics workflow) and O-miner will automatically retrieve the data from GEO. After extracting the compressed archive, O-miner displays the available .CEL files in a File Organiser window for the user to create Sample/Reference attributes and define Sample names and Biological group lists. After the submission, O-miner automatically builds up the required directory structure and annotation files to run the methods implemented in the aroma.affymetrix framework.
In summary, O-miner performs initial quality control checks, background correction, allelic cross talk calibration, nucleotide-position probe sequence effects normalisation, probe-level summarisation using robust average (for SNP 5.0 and 6.0 arrays) or log-additive model (for 10K, 100K and 500K arrays), PCR fragment-length effects normalisation and calculates raw copy number estimates (log2ratios) relative to the chosen reference. These normalised estimates are used as input for segmentation methods to identify copy number regions and further subsequent analysis.
Examples of data upload
Unpaired analysis on the single-enzyme or genome-wide SNP arrays
Once all the files have been uploaded to O-miner and are visible within the File Organiser window, they need to be organized into appropriate Samples and References categories. If the control samples are supplied by the user, they should go into the References category. When using a HapMap reference population as a normal baseline against which the different biological groups are being compared, the uploaded files will appear in the Samples category and do not need to be altered. Alternatively, samples can be removed from further analysis by selecting them and clicking the "Delete selected" button. The user can further categorise the data by entering a biological group attribute to define the biological source/state at the origin of each array (for example to subtypes of a certain disease).
Paired data generated on 100K/500K enzyme arrays
Once the uploaded files are available within the File Organiser window, they need to be organized into the appropriate enzyme arrays that are used. Samples belonging to the control group need to be highlighted and moved to the References category. It is important for the effective data analysis that all samples are ordered based on the individual from which the normal and diseased DNA samples were obtained. Samples that have been matched incorrectly can be moved using the up/down arrow buttons. Alternatively, samples can be removed from further analysis at this stage by selecting them and clicking the "Delete selected" button. The user is also able to amend the Group name to represent the different biological/experimental groups and/or subgroups being investigated.
18.104.22.168 Normalised data
Users can select the 'Normalised' option when normalised log2ratios for both the test and reference samples are available. This data will be smoothed by the application of the CBS algorithm
The text file to be uploaded needs to be in a specific format, comprising probe identifiers, chromosome number and the position of each probe. This file needs to be compressed into a zipped file format and not exceed a file size limit of 2GB. Users need to map probe positions to the hg19 version of the human genome. Users should try to avoid representing sample identifiers as a numeric number and also avoid using non-alphanumeric characters.
Once the data is uploaded, the File Organiser window shows the sample names extracted from the column heading of the uploaded file and offers the option to enter a biological group attribute to define the biological source/state at the origin of each sample for further subgroup analysis. NOTE: Please make sure that your probe ID is numeric AND that there are no duplicated probes in your file.
22.214.171.124 Segmented data
The uploaded file must be in a tab-separated format and include the segmented data for each sample and the genomic location of all of the probes, generated using hg19 assembly of the human genome. This file must be in a zip format prior to upload and should not exceed the limit of 2GB size. Please make sure that your probe IDs are numeric and also that there are no duplicated probes in your file.
Once the data is uploaded, the File Organiser window shows the sample names from the column headings of the uploaded file and offers the option to enter a biological source/state at the origin of each sample for further subgroup analysis
The O-miner CBS workflow differs from the genomics analysis pipeline in our previous release in the segmentation stage. Only one segmentation algorithm is applied to the data, that is the Circular Binary Segmentation (CBS) model. This is an improvement on the previous version as it allows the faster processing of data and reduces the complexity of the analytical workflow while retaining the precision.
126.96.36.199 Data annotations
Users can select gene annotations from four major resources: UCSC22, RefSeq23, Ensembl24 and VEGA25. The results are presented in a tabular format in the results, allowing users to identify and visualise genes within regions of genomic alteration. If RefSeq gene annotations are chosen, a 'gene centric' output view is generated displaying all of the genes found within regions of gain and loss in any given target group. In addition O-miner allows the investigation of regulatory elements, such as Conserved Transcription Factor Binding Sites and miRNA, by the finding of whether they map to altered regions. Selection of CytoBand provides the user with a rapid overview of the cytogenetic mapping of copy number alterations.
188.8.131.52 Minimal common regions
For the CBS workflow, users have the option of running a Minimum Common Regions (MCR) analysis on their data. O-miner provides the analysis option of identifying recurrent regions of copy number alterations within the biological groups that are investigated. These Minimum Common Regions can be calculated using CGHregions.
184.108.40.206 Output options
After segmentation, the data is ready to be processed to determine the regions of gain and loss according to the user-defined cut-offs based on the log2ratio threshold value, consecutive number of SNPs that form a copy number region and the frequency of samples where a copy number event was observed.
Once thresholds are applied to the data, the data becomes binary coded (0: no changes, 1: copy number gain, -1: copy number loss) for subsequent analyses. Users may choose to start their analysis from this level by submitting a binary coded file.
Number of consecutive SNPs
The consecutive number of SNPs is set to a default value of 15, thereby only reporting genomic regions with 15 or more consecutive SNPs of a similarly altered copy number status. Alternatively the user may choose the value at which the set the number of consecutive SNPs. If this value is zero, then the submitted data will not be filtered.
Calls of gains and losses
O-miner offers users an option to predict the log2ratio threshold based on the quantile distribution of the median segmented copy numbers. Alternatively the user may enter the required threshold value.
Sample frequency display
An indicator can be incorporated into the output graphics to signal a user defined sample frequency (event observed in at least 20% of samples by default).
After setting up all the parameters, users can click on the "Submit" button to proceeed. Once the analysis completes, an e-mail address is sent, if given, to the user with the URL where the result is available. The user will then have access to results generated at each stage of the analysis workflow. Results may also be downloaded as a single compressed (.zip) file. It is recommended that results are downloaded locally when they are ready.
The results from each analysis are divided into several categories: Summary, Quality Control, MCR (if selected), Sample View and Group View.
Each result page contains a summary view describing the platform, details of the analysis parameters applied to the data as well as intermediate data files generated at each stage of the workflow. All of the results displayed in different categories are available to download as zipped file from here.
Density plots produced by the R package aroma.affymetrix are provided for the user to download.
The unsupervised hierarchical clustering of samples based on the Euclidean distance matrix and the average linkage algorithm is performed on the data. This shows clustering of the samples labelled according to the biological groups that are designated by the user.
Copy number alterations for each of the samples can be viewed either graphically or in a tabular format. A summary table of the altered copy number regions can be viewed for each of the samples that are submitted. Users can browse through the results obtained for each individual sample including log2ratio plots and also annotated regions of gains and losses that can be viewed as a track in the Ensembl Genome Browser alongside a rich collection of annotations.
For each of the individual samples regions of gains and losses are available for each of the samples. These maybe viewed in either a html format or exported as an Excel file. Information such as the number of SNPs within the aberrant regions and the log2ratio values are also provided. The different columns listed in the results can be sorted to reorder the results. Each individual segment can be visualised in greater detail from the interface by selecting the links to the Ensembl genome browser. This will open a new window in the browser window. A gene-centric summary file (in excel format) is produced which lists the mapping of genes to the regions with genomic aberrations in the analysed samples.
Ensembl genome Browser
For each of the samples, a hyperlink to Ensembl genome browser is provided to allow results to be viewed along with the rich collection of annotation tracks that are available through the genome browser. This gives the user the option of building multiple custom tracks to visualise the analysis of multiple biological groups simulataneously.
Log2ratio plots (unfiltered and filtered)
Sample-specific views of log2ratio intensities are available for each of the samples. The y-axis of each of the plots represents the log2 ratios of the test to baseline signal intensity. O-miner produces log2ratio plots both with and without filtering.
O-miner allows the different biological groups to be compared and visualised. Results of analyses maybe viewed either on a single-group basis or as an inter-group comparison. The Group view summarises results based on biological groups originally defined by the user, including frequency plots and a gene level view to summarise the gene content within copy number alterations.
Frequency plots (unfiltered/filtered)
Frequency plots are generated, which display the frequency of all gain and loss events. Individual chromosomal displays and a summary view are provided for each of the groups allowing users to highlight regions of possible interest. Frequency plots are available both before and after application of the default filter or a user-defined filter, which helps to remove variation artefacts. These regions may then be compared in greater depth between groups via the chromosome view facilitating the determination of trends which are both unique and common to each of the categories.
Frequency plots are generated for each chromosome. These highlight common regions of genomic alteration between samples from each of the biological groups. In examples where there are multiple biological or experimental groups/subgroups that have been defined in the target group naming, the results obtained from each group maybe visualised simultaneously. This allows a quick and easy assessment of copy number alterations that are common and unique to these groups.
Regions of recurrent copy number aberrations common to samples within each of the biological groups specified in the target file are identified using a method such as CGHregions. Results for minimum common region analysis are available as frequency plots, tables, and exportable files from both MCR and BED folders located within the MCR tab.
The results from each analysis are divided into several categories: Summary, Quality Control, Sample View and Group View.
The summary page gives information regarding the platform and workflow that the user has chosen for analysis.Log2ratio and B-Allele Frequency data (BAF) is available to download for each of the samples. The log2 ratios and BAF ratios for the subsets of normal and tumour samples are available separately for download (in a paired analysis). All information that is available can be downloaded as a zipped file.
Density plots produced by the R package aroma.affymetrix are displayed, which show the distribution of the normalised signal intensity for each of the samples.
The aberration plot shows a graphical representation of gains and losses across each of the samples, with gains shown in red and losses in blue. This plot is produced using the R package copynumber26.
For each of the samples, the BAF and log2ratio plots and ASCAT profile plots are provided. ASCAT profiles allow users to investigate copy number neutral and LOH events. An estimated value is given for both ploidy and aberrant cell fraction. O-miner provides annotated regions of gain, loss and copy neutral LOH to download as an excel spreadsheet.
Frequency plots for target (unpaired analysis) or target and baseline samples (paired analysis) are generated using the R package copynumber
. It is possible to view a frequency plot for all of the chromosomes and individual plots for each chromosome. Genes that have been significantly altered are available to download in an annotated format as an excel spreadsheet.
O-miner provides analysis workflows for raw (IDAT files) and normalised data from Illumina methylation array platforms. Quality control is performed on the raw data using the R package Champ27. There is a choice of three normalisation methods: BMIQ, SWAN and PBC. The normalised matrix is then filtered and differential methylation analysis is performed using LIMMA. User defined thresholds for the ΔB value and adjusted p-values are applied to the data. Differentially methylated regions are annotated and users can choose to find statistically significant Gene Ontology terms. All of the results are available to download in a zipped format.
O-miner methylation workflow accepts both raw (IDAT) files and normalised files as input, from the Infinium HumanMethylation27 BeadChip and the Infinium HumanMethylation450 BeadChip arrays. Data may be uploaded using a GSE series identifier from GEO or as compressed files to the O-miner server (not exceeding 2GB).
Analysis may be run as either paired or unpaired, where a paired analysis is done when both target and baseline samples are taken from the same patient. The COMBAT algorithm can be run if sammples from more than one study are analysed. If this option is chosen, users need to then fill out the Study column that will appear in the File Organiser window once data files are uploaded to the server. If technical replicates are present within the dataset, users need to indicate whch samples are replicates in the Replicates column that will also appear in the File Organiser window.
For more information on the options for uploading data, please see sub-section 2.1.1, as the upload data options are the same for both transcriptomics and methylomics.
If the user wishes to run an analysis from IDAT files then both the "red colour" (Cy3) scan and the "green colour" (Cy5) scan files need to be uploaded for each sample. Once these have been uploaded to the server, O-miner will automatically format these files for analysis and only one file for each sample will be shown in the File Organiser window to assign biological groups/state.
Quality control is run in the O-miner workflow using the ChAMP method by default. Normalisation methods available are BMIQ (Beta MIxture Quantile dilation)28, SWAN (Subset-quantile Within Array normalisation)29 and PBC peak-based correction30. Filter method refers to which method is used to filter the normalised matrix the available options are: interquartile range, standard deviation and intensity. LIMMA is used to calculate differential methylated probes. The options available for these sections are the same as for transcriptomics data input.
Thresholds for adjusted p-values and ΔB value are used. Only differentially methylated regions passing these thresholds will be reported. Venn diagrams can be generated to show probes that are common or different between the different groups of samples that are analysed.
The results from each analysis are divided into several categories: Summary, Quality Control, Differential Methylation, Gene Ontology and Visualisation. The summary, gene ontology and visualisation categories work exactly the same as the transcriptomics workflow.
QC results are displayed in three ways: Sample quality, QC plots and cluster diagram
The data is filtered for the detected p-value (< 0.01). A table is produced showing the fraction of failed probes for each of the samples in the dataset. O-miner does not automatically remove samples from the analysis. Users may wish to remove samples that have a failed fraction of probes of a value less than 0.05.
Four plots are produced and available to view by downloading:
Raw density plot, which shows the density of unnormalised beta values for each of the samples in the dataset;
Normalised density plot, which shows the values of the normalised beta values for each of the samples;
Raw MDS plot, which is a multidimensional scaling plot and shows the similarity of samples based on the top 1000 most variable probes amongst all of the samples considering the unnormalised B values;
Normalised MDS plot, which is also a multidimensional scaling plot based on the top 1000 most variable probes based on the normalised B values.
An unsupervised hierarchical clustering algorithm is used on the filtered matrix of beta values to generate a clustering plot of the samples in the dataset.
For each of the comparisons performed, an expandable section appears under the differential methylation category. Each section reports summary information giving the number of hypermethylated and hypmethylated regions for the comparison. A histogram of raw p-values is also available for download. For each of the differentially methylated probes, the following information are shown in tabular format: the probeset ID, chromosomal location, gene name, gene description, whether the region is annotated as differentially methylated, CpG island location within the differentially methylated region, ΔB value and the adjusted p-value.
1.Smyth, G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol,3,Article3.
2.Johnson, WE, Rabinovic, A, and Li, C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1):118-127.
3.Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, et al.: Inferring tumour purity and stromal and immune cell admixture from expression data.Nat Commun 2013, 4:2612.
4.Asare,A.L.,Gao, Z.,Carey,V.J.,Wang, R.and Seyfert EMargolis, V.(2009)power enhancement via multivariate outlier testing with gene expression arrays.Bioinformatics,25,48-53.
5.Kauffmann,A.,Gentleman,R.and Huber,W.(2009)arrayQualityMetrics a bioconductor package for quality assessment of microarray data.Bioinformatics,25,415-416.
6.Rosner, B. (1983) Percentage points for a generalized ESD many outlier procedure.Technometrics725,165-172.
7.Irizarry, R.A., Hobbs, B.,Collin, F., BeazerEBarclay, Y.D., Antonellis, K.J.,Scherf, U. and Speed, T.P. (2003) Exploration,normalisation, and summaries of high density oligonucleotide array probe level data. Biostatistics,4,!249-264.
8.Wu, Z. and Irizarry, R.A.(2004)Preprocessing of oligonucleotide array data.Nat Biotechnol,22,656-658;author reply 658.
9.Giorgi,F.M., Bolger, A.M., Lohse, M. and Usadel, B. (2010) Algorithm E driven artifacts in median Polish summarization of microarray data. BMC Bioinformatics,11,553
10.Pan D et al. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24(13):1547-1548.
11.Gentleman, R.,Carey, V.,Huber, W. and Hahne, F.http://www.bioconductor.org/packages/release/bioc/html/genefilter.html.
12.Limma User Guide,www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf
13.Holm, S. (1979) A simple sequentially rejective multiple test procedure. Statistical Society Series7B,57,289-300
14.Benjamini,Y. and Hochberg, Y.(1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Scandinavian Journal of Statistics,6,65-70.
15.Benjamimi, Y. and Yekutieli,D. (2001)The control of the false discovery rate in multiple testing under dependency.Annals of Statistics 29,1165-1188.
16.Robinson MD, McCarthy DJ and Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26, pp. -1.
17.Falcon,S.and Gentleman,R.(2007) Using GOstats to test gene lists for GO term association. Bioinformatics, 23, 257E258
18.Bengtsson, H., Simpson, K.,Bullard, J. and Hansen, K. (2008) aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Tech Report,745,Department of Statistics,University of California, Berkeley, February 2008
19.Therneau T (2015). A Package for Survival Analysis in S. version 2.38, http://CRAN.R-project.org/package=survival.
20.van de Wiel, M.A. and Wieringen, W.N. (2007) CGHregions: dimension reduction for array CGH data with minimal information loss. Cancer Inform,3,55:63.
21.Van Loo P., Nordgard S.H., Lingjærde O.C., Russnes H.G., Rye I.H., Sun W., Weigman V.J., Marynen P., Zetterberg A., Naume B., et al. Allele-specific copy number analysis of Tumours. Proc. Natl. Acad. Sci. U.S.A. 2010;107:16910-16915.
22.Kent,W.J.,Sugnet,C.W.,Furey,T.S.,Roskin,K.M.,Pringle,T.H.,Zahler,A.M.and Haussler, D. (2002) The human genome browser at UCSC. Genome research,12,!996:1006
23.Pruitt,K.D., Tatusova, T. and Maglott, D.R. (2005) NCBI Reference Sequence (RefSeq): a curated non:redundant sequence database of genomes,transcripts and proteins.Nucleic acids research,33,D501:504.
24.Flicek, P.,Amode,M.R.,Barrell,D.,Beal,K.,Brent, S., Carvalho:Silva, D.,Clapham, P., Coates, G., Fairley, S., Fitzgerald, S.et al. (2012) Ensembl 2012.Nucleic acids research, 40,D84:90.
25.Ashurst,J.L.,Chen,C.K.,Gilbert,J.G.,Jekosch,K.,Keenan,S.,Meidl,P.,Searle,S.M.,Stalker, J.,Storey,R.,Trevanion,!S.et al.(2005) The Vertebrate Genome Annotation (Vega) database.Nucleic acids research,33, D459:465
26.Nilsen G, Liestol K, Loo PV, Vollan H, Eide M, Rueda O, Chin S, Russell R, Baumbusch L, Caldas C, Borresen-Dale A and Lingjaerde O (2012). “Copynumber: Efficient algorithms for single- and multi-track copy number segmentation.” BMC Genomics, 13(1), pp. 591.
27.Morris TJ, Butcher LM, Teschendorff AE, Chakravarthy AR, Wojdacz TK and Beck S (2014). “ChAMP: 450k Chip Analysis Methylation Pipeline.” Bioinformatics, 30(3), pp. 428-430.
28.Teschendorff,A. et al. (2013) A beta-mixture quantile normalisation method for correcting probe design bias in illumina infinium 450k DNA methylation data. Bioinformatics, 29, 189–196
29.Maksimovic,J. et al. (2012) Swan: subset-quantile within array normalisation for illumina infinium humanmethylation450 beadchips. Genome Biol., 13, R44.
30.Dedeurwaerder, S. e. a. (2011). Evaluation of the infinium methylation 450k technology. Epigenomics, 3(6), 771–84.