Preprocessing

Removal of SNP-enriched Probes

139721 sites were removed because they overlap with SNPs. The list of removed probes is available in a dedicated table accompanying this report.

Removal of Cross-reactive Probes

34264 probes were removed because their sequences are non-specific and have a high likelihood of cross-hybridization [1]. The list of removed probes is available in a dedicated table accompanying this report.

Greedycut

The Greedycut algorithm iteratively removes from the dataset probes and samples of highest impurity. These correspond to the rows and columns in the detection p-value table that contain the largest fraction of unreliable measurements. This section summarizes the results of applying Greedycut on the analyzed dataset.

Unreliable Measurements

We considered every β value to be unreliable when its corresponding detection p-value is not below the threshold T:

pT = 0.05

The figure below summarizes the observed number of unreliable measurements per probe and per sample.

Number of values per

Figure 1

Open PDF Figure 1

Cumulative distribution function of number of unreliable values per probe/sample.

Filtered Probes and Samples

RnBeads executed Greedycut using the threshold given above and applied all its steps. Briefly, Greedycut is an iterative algorithm that filters out the probe or sample with the highest fraction of unreliable measurements one at a time. Note that every iteration of the algorithm produces a matrix of retained measurements and a set of removed ones.

We calculated false positive rate (α) and sensitivity (s) when the retained measurements are considered as prediction for the reliable ones. Among all matrices produced by Greedycut, we selected the one that maximizes the value of the expression s + 1 - α, thereby giving equal weights to the sensitivity and specificity. Presented geometrically on a ROC curve, this is the point that is furthest from the diagonal. The results of the Greedycut procedure and the selected iteration are presented in the figure below.

Metric
Iterations to show

Figure 2

Open PDF Figure 2

Change of table dimensions / metric related to accuracy as Greedycut progressively removes probes and samples. Accuracy is calculated by treating the retained entries as predictive of reliable measurements. The red circle, if present, marks the last iteration that was executed.

Based on the criteria described above, 111193 probes and 0 samples were filtered out. Links to the lists of removed items are given below.

Type Removed Table
Probes 111193 removed_sites_greedycut.csv
Samples 0

Filtering Summary I

As a final outcome of the filtering procedures, 285178 probes and 0 samples were removed (30 samples and 581717 probes were retained). These statistics are presented in a dedicated table that accompanies this report and visualized in the figure below.

Figure 3

Open PDF Figure 3

Fractions of removed values in the dataset after applying filtering procedures.

The figure below compares the distributions of the removed methylation β values and of the retained ones.

Plot type

Figure 4

Open PDF Figure 4

Comparison of removed and retained β values.Both distributions are estimated by randomly sampling 1000000 values in each group.

Normalization

The data was normalized using method dasen from [2].

Effect of Correction

This section shows the influence of the applied normalization procedure on CpG methylation values. The following figure compares the distributions of the β values before and after performing normalization.

Plot type

Figure 5

Open PDF Figure 5

Comparison of β values before and after correction.Both distributions are estimated by randomly sampling 1000000 values in each group.

The next figure gives an idea of the magnitude of the correction by showing the distribution of shifts, i.e. degrees of modification of the raw methylation values.

Figure 6

Open PDF Figure 6

Histogram of observed magnitude of β value correction.

The figure below gives a more detailed view. This color-coded 2D histogram shows the uncorrected β values and their respective shifts after performing the normalization procedure.

Figure 7

Open PDF Figure 7

2D histogram showing the raw β values and the magnitude of the corrections.

Sample Mean Methylations

Sample average methylation cannot be visualized because no valid Sentrix ID and Sentrix Position information could be extracted from the sample annotation table.

Region Annotations

In addition to CpG sites, there are 4 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.

Annotation Description Regions in the Dataset
tiling

Genome tiling regions of length 5000

199057
genes

Ensembl genes, version Ensembl Genes 75

31322
promoters

Promoter regions of Ensembl genes, version Ensembl Genes 75

39940
cpgislands

CpG island track of the UCSC Genome browser

24918

Context-specific Probe Removal

The studied dataset contains in total 649 probes of the specified contexts. All these (removed) probes are available in a dedicated table accompanying this report. The table below summarizes the number of removed probes per context.

Context Probes
CC 0
CAG 534
CAH 59
CTG 6
CTH 0
Other 50

Removal of Probes on Sex Chromosomes

12946 probes on sex chromosomes were removed at this step. The list of removed probes is available in a dedicated table accompanying this report.

Removal of Probes with (Many) Missing Values

153 probes were removed because they contain more than 15 missing values in the methylation table. This threshold corresponds to 50% of all samples. The total number of missing values in the methylation table before this filtering step was 5088. A dedicated table of all removed probes is attached to this report.

The figure below shows the distribution of missing values per probe.

Probes to include

Figure 8

Open PDF Figure 8

Histogram of number of probes that contain missing values. The vertical line, if visible, denotes the applied threshold.

Filtering Summary II

As a final outcome of the filtering procedures, 13748 probes and 0 samples were removed (30 samples and 567969 probes were retained). These statistics are presented in a dedicated table that accompanies this report and visualized in the figure below.

Figure 9

Open PDF Figure 9

Fractions of removed values in the dataset after applying filtering procedures.

The figure below compares the distributions of the removed methylation β values and of the retained ones.

Plot type

Figure 10

Open PDF Figure 10

Comparison of removed and retained β values. The distribution of retained betas is estimated by randomly sampling 1000000 values.

References

  1. Pidsley, R., Zotenko, E., Peters, T.J., Lawrence, M.G., Risbridger, G.P., Molloy, P., Djik, S.V., Muhlhausler, B., Strizaker, C., Clark, S.J. (2016) Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome Biology, 17:208
  2. Pidsley, R., Wong, C., Volta, M., Lunnon, K., Mill, J., and Schalkwyk, L. (2013)A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics, 14(1), 293