Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis

Transcriptie

1 Master s Thesis Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis Thesis Committee: Prof.dr.ir. M.J.T. Reinders Dr.ir. J. de Ridder Dr. C.P. Botha, MSc Dr. E.A. Sistermans Dr. M.M. Weiss Ir. J. Nijkamp Daphne M. van Beek Author daphne@daphnevanbeek.nl Student number Thesis supervisors Prof.dr.ir. M.J.T. Reinders Dr. M.M. Weiss Dr. E.A. Sistermans Date October 23, 2012

2 Preface This report is made as part of the Master s Thesis project of the master Computer Science, track Bioinformatics at the Delft University of technology. The main focus of this document lies on the paper that is written as the result of my research on the detection of copy number variations using next generation sequencing data. In the future it is likely that next generation sequencing techniques will replace the current array-comparative genomic hybridization technique that is currently used in clinics. For next generation sequencing data to replace this technique, the minimal coverage required for competitive detection should be known; this was the focus of my work. Besides the main paper, a supplement is provided to give some additional information about the research that was done. A work document is also included, in this document the progress and observations made during the project are registered. The Master s thesis project was done at the Bioinformatics Lab at Delft University of Technology in collaboration with the department of Clinical Genetics of the VU Medical Center in Amsterdam. Acknowledgements Thanks to Janneke Weiss, Marcel Reinders and Erik Sistermans for their advice, supervision and all the interesting discussions. Thanks to Desiree Steenbeek, Daoud Sie, Quinten Waisfisz and Bauke Ylstra for the tips and their help on both acgh and NGS issues I encountered during the project. Thanks to Roy Straver for the collaboration in the starting phase of our thesis projects. It was great that we could merge our ideas and even invented our own, new alignment method (now we only have to make sure the code works)! 1

3 Master s Thesis October 23, 2012 Pages 1 21 Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis D.M. van Beek 1,, M.M. Weiss 2, D. Sie 3, B. Ylstra 3, M.J.T. Reinders 1 and E.A. Sistermans 2 1 Delft Bioinformatics Laboratory, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands 2 Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands 3 Department of Pathology, VU University Medical Center, De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands Defence date: October 23, 2012 Supervisors: M.J.T. Reinders, M.M. Weiss and E.A. Sistermans ABSTRACT Motivation: Copy Number Variations (CNVs) are known to be involved in various disease phenotypes. Detection of CNVs is done clinically by performing acgh analysis. Next Generation Sequencing (NGS) techniques are a promising replacement for this analysis, due to their higher resolution. To make NGS CNV detection applicable in the clinic, the method should be competitive with acgh. In terms of costs, this could be reached by shallow whole genome sequencing (< 0.5x coverage). It is not yet known what the minimal coverage required for competitive reliable analysis should be. Here we present a thorough comparison of detected CNVs by both acgh and NGS for different coverages of NGS data. Results: Analysis of NGS data with a coverage of 0.4x after alignment and pre-processing produces a list of CNVs that largely overlaps with regions detected by acgh and that show little additional detected CNVs. A lower coverage generates much more CNVs that probably are erroneously detected. Dilution experiments show that it is probable that even lower coverage can be used (0.23x). In depth analysis also revealed that it is essential to consider the read-depth plots of both test and control sample, besides their log 2 ratio, as this provides valuable information about the origin of the CNV. To reduce the number of false positives considerably, we introduced a rule-filter that removes on average 80% of the unreliable CNVs. The best choice for the parameters of the CNV detection tools is very variable and depends largely on the coverage of both test and control sample. Therefore, we conclude that for application in the clinic a quality control system should be developed. Contact: daphne@daphnevanbeek.nl 1 INTRODUCTION Copy number changes may indicate (potential) pathogenic insertions and deletions that can be explanatory towards a specific phenotype (for example intellectual disability or forms of heredital cancer [Vissers et al., 2009], [Campbell et al., 2008]). CNVs to whom correspondence should be addressed in specific regions of the genome may influence regulation and expression of genes, which makes detection of these aberrations important in localizing causal genes of a patient s phenotype. There are numerous methods available to detect copy number variations (CNVs) in the genome. The current standard to detect CNVs in clinics is to perform array-comparative genomic hybridization (acgh) analysis [Pinkel et al., 1998], [Holstege, 2010]. Next Generation Sequencing (NGS) is becoming less expensive, better applicable and can achieve a higher resolution: NGS makes it possible to detect smaller CNVs and is not limited to predefined probe positioning. Therefore, the use of NGS data for the detection of these disease-related CNVs is an attractive option to consider. The NGS methods developed are mainly focused on the detection of aberrations in cancer samples and sequencing data with high coverage (>10x). To improve the detection of disease-related CNVs, other methods need to be evaluated. The CNVs that need to be detected are generally smaller (acgh resolution is around 50 kb, but smaller CNVs may be involved) and can be situated anywhere on the genome. Early methods for detection of CNVs using NGS read data were read-depth based and mainly focused on the detection of copy number changes and chromosomal rearrangements in cancer [Campbell et al., 2008]. Numerous aberrations (both insertions and deletions) can be observed over a chromosome of a genome affected by cancer. Read-depth methods assume that reads are mapped to the genome according to a distribution (for example, a Poisson distribution [Yoon et al., 2009]). As a duplicated region exists twice, it will have twice as many reads sequenced; thus alignment to a reference genome will result in a higher number of reads aligned in this region (the other way around for a deletion) [Bailey et al., 2002]. Later read-depth methods also focused on finding CNVs in other types of tissue, as it became known that CNVs play a large role in the development of diseases. Paired-end methods using NGS were mentioned for the first time in 2007 [Korbel et al., 2007] and were also mainly used for CNV detection in cancer samples. Currently the paired-end approach is the most-used NGS method in the detection of CNVs, c All rights reserved. 1

4 D.M. van Beek et al as the detection itself is less dependent on the natural variation of coverage that occurs when using whole genome sequencing. Pairedend sequencing generates a read at each end of a longer DNA strand ( bases). As the length of the total strand is known, the distance between the mapping positions of the two paired reads should be the same. If this distance is longer on the reference genome, a deletion is located in the sample; if shorter, an insertion has taken place. Other aberrations can also be detected; if one of the paired reads is mapped to the reference genome in opposite direction, its corresponding region is inverted. To compete with acgh, the costs of the NGS method should be similar or lower. Paired-end sequencing takes twice as long and is about twice as expensive as single-end sequencing. This is the main reason why we chose to work with single-end whole genome shallow sequencing. Shallow sequencing means that not the whole genome is covered, but only a small amount, reducing sequencing costs. Currently, the costs of shallow sequencing 20% of the genome (indicated by 0.2x) is comparable to the cost of one acgh analysis. With 0.2x coverage, on average one read per 255 base pairs is aligned. When looking at CNV regions that are larger than this minimum, using less than 1x coverage is still reliable. To decide for replacing acgh with NGS in the clinic, it is necessary to know the minimum coverage that is needed to detect the same CNVs that are currently detected using the acgh Agilent 180k oligo microarray. The probes on this microarray are 60 base pairs in length and on average 1 probe per base pairs is encountered, covering about 3.5% of the genome. As it is possible to achieve a higher resolution using NGS and NGS is not limited by predefined probe positions, we are also interested whether we can detect CNVs that are smaller than the smallest CNVs found by acgh (which can detect CNVs of >50 kb in length). To achieve this, a suitable CNV detection tool should be selected that meets our requirements. 2 RESULTS 2.1 Approach The schematic overview of the CNV detection pipelines are displayed in figure 1 and will be discussed in this section. In short, the NGS pipeline (figure 1A) consists of the alignment of single-end reads using a reference genome and a pre-processing step to reduce read depth biases resulting from repetitive regions (reducing the effects of read-towers by application of the REad-Tower RemOval filter; RETRO) and PCR duplicates. After reducing the biases created by repetitive regions and PCR duplicates, the mapped data is analyzed by a CNV detection tool. The output is further processed by a rule-filter to reduce the number of false positives in the data. Array CGH (figure 1B) is a microarray technology that makes use of a control and a test sample, both labeled with a different fluorescent dye. After hybridization of the sample DNA to the probes on the microarray, differences in color brightness indicate which sample has more DNA attached and copy number changes can be determined using this information. Aberrated regions are detected by a tool that segments and merges similar log 2 ratio s. We applied Nexus copy number [Darvishi, 2010]. To replace the current acgh practice by NGS methods, it is important to establish the minimal coverage at which reliable results are generated. To determine this minimal coverage one can dilute a high coverage data set (>100x) and find the dilution at which regions found in the high coverage data set are not detected anymore. Alternatively we can compare regions detected using low coverage data to the acgh results. Non-tumor human samples of over 100x coverage sequenced using an Illumina HiSeq2000 were not available online (we restricted ourselves to Illumina HiSeq2000 as the sequencing platforms influence sequencing results due to different biases and error rates [Lam et al., 2012]) and we chose not to use data from another platform as we do not know to what extent this influences our CNV detection. Sequencing a sample at 100x coverage was too expensive, which left us with the alternative low coverage approach by defining the acgh output as standard of truth. The available NGS sample data is processed by four different CNV detection tools. These tools all use different algorithms and can be divided into two groups; read-depth based and control-based. The read-depth based tools look at the positioning of the reads after mapping and can distinguish between insertions and deletions in the sample by deviating amount of reads aligned in a specific region. The control-based tools also apply this read-depth method, but in addition look at a control sample that is sequenced in the same manner as the test sample. As it is assumed that some of the biases in the test sample will also occur in the control sample, this control sample is used to reduce the effects of the bias, mainly by comparing the ratio between the test and control sample reads. To choose an optimal NGS tool for our purpose, scoring methods were developed to compare the output of the acgh and NGS methods. Four different NGS CNV detection tools were evaluated; 2 are based on the read-depth and 2 on the control-based approach. For the control-based tools we experimented with several control samples at different coverages as input. To reliably compare acgh output to NGS output, the control sample used in the acgh experiments was also used here. This control sample was sequenced in three different runs, thus it can also be used to see if there are noticeable differences in the results when changing the control sample coverages (see figure 8 for details). An external control sample was also used as input; a sample pool of DNA obtained from pregnant women (in which about 5% of the reads are of fetal origin [Lo et al., 1997]), to determine if it is possible to use an external (non-commercial) control sample. After extensively comparing NGS and acgh outcomes, we came to the conclusion that the acgh output could not be considered as standard of truth. The acgh method did not find all CNVs and some of the detected acgh regions turned out be wrongly detected after close inspection. Therefore, we manually annotated both the acgh and NGS outcomes and calculated some statistics to determine distinguishing features for the detected regions that are considered true CNVs. Using these features a rule-filter was created, which aims to reduce the number of false positives generated by the CNV detection tool. In the case of large aberrations, the NGS tool may return detected regions that are overlapping. These regions are merged after application of the rule-filter to make sure the number of detected regions resulting from the tool is correct. There are multiple factors that can influence how a sample is sequenced, for example temperature and the chemical composition of the sample that is to be sequenced. Small differences in these factors can introduce a bias in the read-depth, but as it is not known the exact extent to these influences, we needed to determine the importance of sequencing the control sample on the same lane as 2

5 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis A Next Generation Sequencing Pipeline Reference data R1 CNV detect R2 R3 R1,2,3 Manual reannotation calls acgh calls Read data Alignment Pre-processing Rule-filtering Validation BWA B Labeling Sample Array CGH Pipeline Labeling Reference Hybridization on Microarray Scanning Microarray Analysis by Feature Extraction Calls generated by Nexus Fig. 1. Next generation sequencing and achg pipelines for the detection of CNVs. The steps described are explained in more detail in the methods and results section. In figure A, orange indicates a list of choices that is evaluated. For example; four CNV detection tools are tested and an optimal tool and optimal settings are chosen based on ROC-analysis; the alignment of the reads is optimized to allow for SNPs. A manual annotation of all the CNVs generated by acgh and the NGS pipeline is performed and the results are used to develop a rule-filter to reduce false positives. In figure B, the general overview of the acgh technology is displayed. the sample. This is done by comparing results between multiplexed and stand-alone test and control samples. 2.2 Comparing acgh and NGS results We would like that the outcome of the detected CNV regions based on NGS data resemble as much as possible the regions detected by acgh. Therefore we define the following situations. True positives (TP) are defined as the number of CNV regions found by acgh that are also found by the NGS CNV detection tools (an overlap of the detected regions is observed). If multiple NGS regions overlap one acgh region, it will be counted as one TP. False positives (FP) are detected regions that are found by the NGS tool, but not by acgh. False negatives (FN) are acgh regions not found by the NGS tool. The first and second criteria for selection are slightly correlated, the more regions the NGS tool detects (TP and FP combined), the higher the chance that the detected regions overlap randomly with the acgh data (TP). 2.3 DWAC-seq is the best CNV detection tool for analyzing the NGS data Two types of tools are evaluated in this section; read-depth based (CNVnator [Abyzov et al., 2011], RDXplorer [Yoon et al., 2009]) and control-based (DWAC-seq [Koval and Guryev, 2011], CNV-seq [Xie and Tammi, 2009]). We compare these four CNV detection tools for the NGS data using the following criteria: 1. Calls as reported by the acgh analysis should be found using the NGS data. 2. Number of false positives and false negatives should be low, as a large number indicates that the tool s performance is not good enough. 3. Run time should be no more than a day. These four CNV detection tools are applied to four samples (see Methods for details) and table 1 shows an overview of the outcomes for each of these four samples. The first striking observation is that the number of total calls 1 differs quite a lot between the tools (from about 50 to even ). A high number of false positives is, however, not trustworthy and likely do not indicate true CNV regions 2. For that reason DWAC-seq and RDXplorer are the best candidates as they generate the lowest amounts of calls. Between DWAC-seq and RDXplorer, DWAC-seq has the highest number of true positives (48% over 25%) and is, therefore, selected as best CNV detection tool. The second observation is that samples 1 and 2 show different behavior when compared to samples 3 and 4. The number of total calls of sample 3 and 4 is a lot higher (twice as high or more). This is due to the difference in coverage; sample 1 and 2 have a coverage of 0.4 and 0.5 fold respectively, while samples 3 and 4 have a much higher coverage; 1.4 and 3.6 fold respectively. The detected CNV regions within the acgh data have been subjected to clinical analysis. Some of these regions have been indicated as being relevant for diagnosis, they are classified in the following types: Type II, probably/possibly benign; Type III, probably/possibly benign and Type IV, pathogenic. These regions should thus not be missed when using the NGS data for CNV detection. The calls overlapping with these regions can be found between parenthesis in the columns in table 1. As can be seen CNVnator performs best: it finds all of these regions. DWAC-seq performs second best, only missing two regions. As the number of false positive is very high for CNV-nator, we still make the choice for DWAC-seq (see supplement section 6.9 for more details about the two missed clinical calls). 2.4 Finding the best parameter settings for DWAC-seq DWAC-seq is optimized by considering two parameters of the tool that can be varied: threshold, T, that the method uses for deciding whether the signal is strong enough to make a CNV call, and the size of the window, W, which is defined in the number of reads (instead of number of bases) and drives the minimal size of the regions that can be detected (see Methods for details). We varied the threshold between 0.1 and 0.5, and the window size between 100 and 5000 base pairs and for each setting we compared the results with the detected regions according to the acgh data. The resulting ROC-curves and the accompanying analysis can be found in the Supplement (6.6). These ROC-curves show that the setting of parameter T is not very sensitive, whereas varying the window size showed large variability over all samples. Based on 1 To indicate a detected CNV region the term call is used. 2 As the real CNV regions are not known care should be taken with drawing the conclusion that false positive calls are no real CNVs. However, as the acgh analysis and also two of the four CNV detection tools indicate a low amount of CNV regions, it is likely that those regions actually will not be true CNV regions. 3

6 D.M. van Beek et al Table 1. Number of true positives (TP), total number of calls (TP + FP) and the run time for the four selected NGS CNV detection tools. The control sample used by DWAC-seq and CNV-seq is A R1 (supplement S2). The test samples are aligned to the reference genome using the strict setting (see section 5.2). The first columns indicate the total number of calls made when using the acgh data (CGH calls) and the number of regions that were found to relevant for diagnosis after clinical analysis (Clinical calls). The results of CNVnator are calculated for a window size of 1 and 10 kb respectively. In between the parenthesis are the number of calls that overlap with the clinically relevant calls as found by using the acgh data. For CNVnator and RDXplorer only calls larger than 1 kb are kept. CNV-seq calculates for each sample an optimal window size. The (t) column states the total number of call regions that are found and the (c) column states the number of calls after combining regions. The runtime is an approximation of the running of the algorithm on data with a coverage of approximately 1x. Sample CGH calls Clinical calls CNVnator RDXplorer DWAC-seq CNV-seq 1 kb 10 kb 100 bp 100 bp (t) (c) TP (1) 17 (1) 1 (0) 17 (1) 15 (1) (1) 27 (1) 3 (0) 13 (1) 17 (0) (4) 30 (4) 8 (1) 25 (2) 26 (2) (5) 20 (4) 9 (0) 18 (5) 19 (4) Total calls Run time (1x) ± 2 hours ± 4 hours ± 10 hours ± 30 min these observations, we set T equal to 0.25 and chose the smallest window size for which the number of true positives is nearly maximal while the false positives are still relatively low (W =100 reads per window). The window size is dependent on the coverage of the sample, but we also observed that the window size heavily depends on the coverage of the control (table 5). This is because DWAC-seq uses the control sample for determining the location of the windows (see supplement section 6.5.3). As can be seen in table 5, when the coverage of the control sample increases from 1.3 to 4.5 (3.5 x higher coverage), the number of calls rise considerably (from 69 to 211 for sample 3) when the window size is kept constant. Note that the window size is defined by the number of reads, which means that when the coverage increases that a window size fixed in number of reads decreases in terms of bases. That implies that smaller regions can be detected, but which in turn increases the chance for false positives. Hence, the window size needs to be set dependent on the coverage of the control sample (i.e. larger window size when coverage is higher). Previous studies revealed that an average of 12 copy number variants per individual are present [Feuk et al., 2006]. We have used this in order to reason that when a method finds much higher numbers they are likely to be false. In the acgh results, the average number of CNVs found is 28, which complies with the findings of Feuk et al. As the resolution of NGS is higher than acgh, we expect to find more abberated regions. Additionally DWAC-seq tends to separate large abberated regions into smaller ones, which also would generate more detected regions. With the chosen settings of DWQCseq, we find around 50 abberated regions which we think is in good agreement with what we would expect reasoning from the findings of Feuk et al. 2.5 The composition of the control sample has a large influence on the found CNVs To reliably compare the acgh and NGS results, the control sample that is used in the acgh analysis is also used as a control sample when using DWAC-seq. Hence, the higher the coverage of the control sample, the more accurate results are produced by DWACseq. Control sample A (see table??) that is used for both pipelines is a commercial product by Kreatech Diagnostics [Kre, 2010]. The control sample is genomic DNA isolated from whole blood samples of 100 female anonymous donors. To reliably perform CNV detection, the control sample should be diverse enough to exclude CNVs that are particular for the heritage of just one person. As DWAC-seq uses the control sample in order to detect regions that have an aberrated copy number, variations can be found when the test sample varies, but also when the control sample varies. An example is depicted in figure 2. An explanation of the decrease in the read-depth of the control sample can be that there is a deletion (with respect to the reference genome) located at the position in the control samples. Note that this deletion should be consistent over the 100 persons from which the control sample is created, indicating a population bias for the control sample. Hence, when the test sample is not of the same population we will make false calls, as is the case for this example. Ideally for these experiments, the control should consist of DNA of a large number of individuals originated from all sorts of backgrounds. Creating such a control sample is, however very expensive. Therefore, we conclude that it is essential to visualize the read-depths of both the test sample, as well as the control sample for every call made by the CNV detection tools. 2.6 The strictness of the alignment of the reads does not have a large influence on the found CNVs Most experiments are performed using both the strict (no mismatches allowed) and lenient (one mismatch allowed) alignment methods (see Methods for details). Both alignment methods are compared to see which one obtains an optimal result. The run time of both alignments is not very different. Table 2 gives an overview of the CNV detection results when using both alignment methods. Ideally for one of the alignment methods to be superior to the other, 4

7 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis additional factors (window size was less ideal for this sample for example) may also play a role. Control Fig. 2. A CNV (false positive) call made by DWAC-seq. This CNV is located at chromosome 1. Shown are the log 2 ratio (top panel); read-depth of the test sample (2nd panel from the top); read-depth of the control sample (3rd panel from the top) and the DWAC-seq output signal (bottom panel). The DWAC-seq output signal is the ratio test/ref corrected by the median for each segment (which is the region between two breakpoints). The decrease in read-depth in the control sample indicates that all control subjects appear to have a deletion at this position with respect to the reference genome, indicating a population bias. From the log 2 ratio (top panel) one can not discern the origin of the CNV that is displayed. is that the number of unique false negatives is low and the number of unique true positives high. Unique in this context means that the call is not found by the other alignment method. One of the low and one of the high coverage samples (sample 2 and 4) display a better performance for the lenient alignment. The number of unique false negatives is lower than the strict approach and the number of unique true positives is higher or the same. The other two samples (sample 1 and 3), however, show the opposite result. Based on this knowledge and on the fact that a lenient approach allows for SNPs, the lenient setting is chosen as optimal. It is interesting to see what causes the difference between the two alignment methods. A large number of calls that are classified as unique are examples of calls that lie in a region in which the test sample has very low coverage; the control sample contains an expected number of reads in the region, but the sample contains no reads at this position; or they lie at a region bordering another CNV. Calls that have a very low coverage as compared to average in both test and control sample are considered not reliable. As most unique calls can be considered unreliable due to low coverage, this is probably the reason why some calls are found using one setting and not the other setting. The explanation for the mediocre performance of the lenient alignment settings for sample 3 may be explained by the large number of total calls that are made, but 2.7 Detailed comparison of acgh and NGS results After the optimization of DWAC-seq, we can compare the results for the acgh and NGS data in more detail. We do so by plotting the read-depth of both the test and control sample and the log 2 ratio for the detected regions. The read-depth is calculated by dividing the mapped reads into bins of 1000 bases. The log 2 ratio is calculated by log 2( RD test RD ref ). Figure 3 shows an example of such a plot. It shows a detected region by DWAC-seq (bottom panel) that is caused by a deletion in the test sample (2nd panel from the top) which also results in a negative log 2 ratio (top panel). This is an example of a TP as the region is detected based on both the acgh (orange lines) as well as the NGS data (red lines). Close inspection of a number of these plots, however, revealed that: 1) some FNs are no true CNVs, i.e. the NGS clearly does not indicate an aberration and the acgh seem to be misanalyzed (figure 6 shows an example); 2) not all FP calls are false, i.e. they clearly show an aberration in the NGS data which is not picked up on the basis of the acgh data (figure 5 shows such an example); and 3) not all TP calls are true CNVs, i.e. the acgh data is misanalyzed (figure 4 shows such an example). Normalized RD Sample Normalized RD Control median e e Fig. 3. A true positive found in sample 1 using the optimized DWAC-seq settings. For an explanation of the figure see figure 2. The red vertical lines indicate the start and stop positions of the region detected based on the NGS data, the yellow vertical lines indicate the start and stop positions of the region as detected based on the acgh data. The figure clearly displays a deletion in the test sample. Comparing the read-depths between the test sample and the control sample results in an understanding of the resulting log 2 ratio. 5

8 D.M. van Beek et al Table 2. Results after comparison of strict (no mismatch allowed) and lenient (1 mismatch per read allowed) alignment methods for the four samples. The alignment type is noted for each sample by an 0 or 1 for allowing 0 and 1 mismatches respectively. The number stated is the number of calls made for each class TP, FP and FN. The number between the parentheses gives the number of unique calls for that sample and alignment type that are not called by the other alignment type. DWAC-seq is run with control sample A R1 that has a coverage of 1.2x for the strict alignment, 1.3x for the lenient setting. Sample 1 Sample 2 Sample 3 Sample 4 Coverage 0.318x 0.431x 0.378x 0.510x 2.804x 3.634x 1.243x 1.374x Class TP 20 (3) 23 (4) 16 (2) 17 (4) 44 (1) 34 (0) 18 (2) 21 (2) FP 28 (14) 25 (12) 26 (15) 23 (13) 28 (9) 40 (19) 26 (4) 29 (7) FN 10 (1) 12 (3) 14 (4) 11 (1) 10 (0) 11 (1) 7 (3) 6 (2) Clinical 1/1 1/1 1/1 1/1 2/4 2/4 5/5 5/5 Total e-6 Normalized RD Sample Normalized RD Control median ,430 1e Normalized RD Control Fig. 4. A true positive found in sample 2 using the optimal DWAC-seq settings. The log 2 ratio is relatively variable, which is the result of the low read-depth of both test and control sample. The resulting call is, therefore, not very reliable, as both the test and control show a similar decrease in read-depth, and no clear increase or decrease can be observed from the log 2 ratio. Fig. 5. An example of a call found by DWAC-seq that does not overlap with an acgh call (sample 4). The plots show a deletion in the test sample around 65.2 Mb, which is indicated by vertical red lines. Even though this call does not overlap with the acgh data, it is considered to be a true deletion, due to the clear decrease in test sample read-depth as compared to the control sample. The call has been missed by the acgh data due to resolution. From these inspections we conclude that using the acgh as standard of truth does not verify the results properly, because with the increased resolution of the NGS data a more motivated detection can be made. This also hints towards an overinterpretation of the acgh data because of the lack of resolution. An unreliable CNV can be recognized by a variable log 2 ratio in the region, i.e. the high variability is generally caused by a low read depth of both test and control sample which makes decision on whether the region is aberrated or not to be unreliable. In extrema, there are number of detected regions that contain a high number of empty bins (i.e. bins with no reads), again causing unreliable calls. It can also happen that repetitive regions overlap with the bins, which makes the mapping of the read in such a region unreliable and consequently the detection of aberrations. It is important to note that calls made by acgh in these regions are also unreliable, as the repetitiveness of the region will also influence the hybridization of the DNA on the probes of the array. The visual inspections also revealed that only looking at the log 2 ratio is not enough to generate a accurate conclusion about a call. The read-depth of the test and control sample provide important additional information. Taken together, the acgh calls are not further considered to be the standard of truth. An other approach is considered in section

9 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis Normalized RD Sample Normalized RD Control median 1e e Fig. 6. An acgh call, depicted by the vertical yellow lines, that is not found by DWAC-seq (sample 2). The plots show that a large number of bins in the plotted region is empty (indicating that no reads have been mapped to this area). The variance in the read-depth is high in both the test sample and control sample and the log 2 ratio does not show a clear increase or decrease. Although the log 2 ratio of the acgh data indicates a deletion (see figure 7), also here the evidence in the acgh is not very convincing (only few probes that also show variable signal). We conclude that in this case the NGS analysis is more reliable than the acgh data and that this region can not be considered a CNV. 2.8 Filtering the calls generated by DWAC-seq on quality improves the reliability of the calls As the acgh calls are inconsistent to use, a reliable standard of truth has been made. This is done by manually annotating all calls generated by both the acgh and the NGS pipelines. Table 3 shows an overview of the results of the detections based on the NGS data when comparing to this manual annotated set of regions. The annotation of the calls results in distinguishing features between the positive and negative set: bins that remain empty because no reads can be mapped in this region; and bins with a very low coverage as compared to the average coverage of the chromosome. We used these features to develop a rule-filter that is used as a post-processing step after running the DWAC-seq analysis (see Methods for details). The results of the manual annotation and the rule filtering are displayed in table 3. Using this rule-filter, good calls are distinguished from bad calls by following two considerations: 1) homologous deletions result in complete coverage in the control sample and mostly empty bins in the test sample, and 2) good calls will generally have a low percentage of empty bins, as coverage is needed to accurately define a call as a CNV. Hence, if the coverage is 100%, it is likely that the call is correct, and when the coverage drops also the reliability of the call drops. Fig. 7. The acgh data for the region discussed in figure 6 (top to bottom: log 2 ratio (blue); test sample signal (green); control sample signal (red) and p-value of the log 2 ratio (black)). The blue line (top panel) indicates the threshold for calling a gain, the pink for a loss. There are clearly two probes with a log 2 ratio below the threshold value. Comparing this figure with figure 6, we conclude that the region can not be considered as a true CNV, as the coverage is low (which could be an indication of repetitive sequences) and a large number of empty bins are observed around the region of interest. Table 3 shows that the rule-filtering step eliminates, on average, 80% of the negative CNV set from the calls. Of the total number of calls an average of 30% is deleted, reducing the number of false positives considerably. Of the positive set some calls are deleted due to the filtering, but this is limited to 6% on average. Making the rules stricter results in more calls from the positive set that are deleted, which can be an option if the number of total calls is still too large. After the rule-filtering, the overlapping calls are merged. This merging provides a number of calls that gives a more reliable overview of what is actually found with DWAC-seq. It reduces the number of total passed calls because of small overlapping regions at the end and beginning of certain calls. We want to merge these regions because they represent one big CNV. After the filtering and merging, the resulting calls that are detected are ranging between a length of 17 kb and 33.7 Mb, which is the detection limit for the NGS pipeline when applied on the four test samples available. The actual detection limit can deviate below and above these numbers, it is possible to detect regions smaller than 17 kb and larger than 33.7 Mb, but this can not be shown with the data available. The detection limit depends on the size of the window that is used, which depends on the coverage of the control sample (in this case around 4000 bases per window when assuming uniform distribution of the reads). The lower limit of the detection lies somewhere between 4 and 17 kb, as it is not possible to detect regions smaller than one window. 7

10 D.M. van Beek et al Table 3. Results of the rule-filter applied on the detected regions from the NGS data, when comparing them to the manual annotated set of calls. The manual annotation split the combined calls based on the NGS and acgh data in three classes: positive set (actual CNVs), negative set (no CNVs, or too low coverage), and a undetermined set (hard to determine the class). The numbers in the columns Pos. (positive), Neg. (negative) and Und. (undetermined) are the detected regions that are assigned to the corresponding set. All calls that are generated by acgh and NGS are assigned to a class. The Total row displays the total number of calls, and how these calls are manually annotated. The Passed row shows how many calls of the three sets pass the filter for this sample, the Deleted row the calls that do not pass the filter. The total number of calls after rule-filtering and merging is given in the first column. Sample Class Count Pos. Neg. Und. Sample 1 Total Merged: 30 Passed 43 (71.7%) Deleted 17 (28.3%) Sample 2 Total Merged: 23 Passed 30 (58.8%) Deleted 21 (41.2%) Sample 3 Total Merged: 43 Passed 71 (83.5%) Deleted 14 (16.5%) Sample 4 R1 Total Merged: 40 Passed 43 (75.4%) Deleted 14 (24.6%) Determining the minimal coverage To determine the minimal coverage for the NGS data to still reliably detect aberrated regions we set up a simulation in which the coverage is artificially reduced before applying the detection tools. This is done using sample 3 and 4, of which sample 4 is sequenced in two different runs (sample 4 R1 and sample 4 R2). The original coverage of 3.6, 1.2 and 1.0 for all samples, respectively are reduced, by uniform sampling, in steps of two to a coverage between 0.1 and The resulting calls are analyzed and observations can be found in table 4. The lower coverages show an increase in the total number of calls (increase in false positives). As the same control sample is used, the positioning of the windows and the window size are exactly the same for all runs. There does not seem to be a large decrease in quality of the output (the number of false negatives) for a coverage between 0.1 and 0.35, but a large increase of false positives is observed. Note that a large part of the false positives appear on the X-chromosome, which could be due to the use of a female control sample while samples 3 and 4 are male. The clinically relevant CNVs (i.e. those annotated with Type II to IV), are found for all coverages displayed in table 4, with the exception of two calls in sample 3 that were also missed in the previous experiments (see for a description of the missed CNVs the supplement). The lowest coverage tested in the previous experiment is 0.43x after mapping and pre-processing steps. Lower coverage could also be possible, the reduction experiments show that there is still a reasonable low number of false positives encountered at lower coverage. There seems to be a turning point for sample 3, sample 4 R1 and sample 4 R2 between , and respectively. This turning point is determined by looking at the number of calls after merging, this number is kept below 50. Concluding from this reduction experiment, the lowest coverage that can be used for the detection of CNVs with reduced NGS data lies at 0.23x The coverage for the control sample should be close to the coverage of the test sample Sample 4 and the control sample are sequenced in multiple runs (see for more details figure 8), which allows to compare the results of two independent runs. Table 5, first row, shows the results and one can notice that the performance for both runs of sample 4 are more or less the same. Even when the data of the two runs is combined, the performance stays the same. We also analyzed the runs by using all measurements done with the combined control sample, A R1,2,3 (which results in an increase of the control sample coverage to 4.5x). The result is shown in the second row in table 5, which shows that the false positives increase strongly. This is probably due to the fact that the size of the window is incorrect because the window size is expressed in the number of reads aligned per window and the coverage of the control sample drastically increased. Therefore we tried two alternative settings for the window size, 350 and 700. It can be observed that the number of false positives do drop but at the same time the number of true positives also drops, so that the performance eventually is not improved when using the combined data for the control sample. Also the number of merged calls (last column in table 5) shows that combining data for the control sample does not help as the number of merged calls when using the same coverage for the control sample as for the test sample is closest to the expected number of calls (somewhere between 30 and 50, see section 2.4) Multiplexing test and control sample As multiple factors can influence the process of sequencing, we expect that the best results will be possible when the test sample and the control sample are sequenced in the same lane (multiplexing, for details see figure 8). The setting in which the corresponding test and control sample are multiplexed are colored red in table 6. We expect that multiplexing reduces biases that can occur due to differences in sequencing conditions such as differences in temperature or chemical composition of the genetic material. Other bias can result from GC-content. GC-rich regions are known to influence the number of reads that are sequenced in a region [Benjamini and Speed, 2012]. In table 6 for sample 4 R2 the exact opposite of what we expect can be observed: a higher number of false negatives is observed than when using a control sample from another sequencing run (so no multiplexing). Sample 4 R1 does show the results as we expect: low number of false negatives and relatively low number of false positives for the case where the control sample is multiplexed with the test sample. As stated before, the number of TP, FP and FN are based on acgh analysis and is not very reliable. When looking at the results of the rule-filtering, for both samples the least data is deleted when using control sample A R2 and the lowest number of calls are passed. After merging the number of calls for sample 4 R2 in combination with control sample A R2 is very low, even lower than the number of acgh calls that are made. We conclude that the advantages of multiplexing test and control samples can not be directly derived from table 6. It seems that 8

11 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis Table 4. Detection results for sample 3 and 4 when the coverage is artificially reduced by uniform sampling. A R1 is used as control sample (coverage of 1.3x). The X: states the number of false positives that are called on the X-chromosome. The total number of calls made on the acgh can be calculated by adding the numbers of TP and FN. There is a difference when adding the TP, FP and FN and the passed and filtered calls. This difference is explained by the fact that a TP is counted only once, even if more than one NGS call overlaps an acgh call. Samples 1 and 2 are added for reference purposes. Sample 3 Coverage TP FP FN Passed Filtered Merged No reduction 3.63x (X: 8) Reduction 1a 0.45x (X: 7) Reduction 1b 0.45x (X: 9) Reduction 2a 0.23x (X: 11) Reduction 2b 0.23x (X: 14) Reduction 3a 0.11x (X: 45) Reduction 3b 0.11x (X: 45) Sample 4 R1 Coverage TP FP FN Passed Filtered Merged No reduction 1.24x (X: 7) Reduction 1a 0.35x (X: 11) Reduction 1b 0.35x (X: 13) Reduction 2a 0.17x (X: 16) Reduction 2b 0.17x (X: 18) Reduction 3a 0.09x (X: 39) Reduction 3b 0.09x (X: 42) Sample 4 R2 Coverage TP FP FN Passed Filtered Merged No reduction 1.05x (X: 7) Reduction 1a 0.26x (X: 22) Reduction 1b 0.26x (X: 28) Reduction 2a 0.13x (X: 32) Reduction 2b 0.13x (X: 26) Reduction 3a 0.07x (X: 53) Reduction 3b 0.07x (X: 42) Sample x (X: 1) Sample x (X: 0) Table 5. The effect of the change in coverage of test and control sample. As sample 4 is sequenced in two, and the control sample is sequenced in three runs (see figure 8 for details), the data can be used separately and merged to see the effects of the changing coverage on the results. Combinations are made using the two runs from sample 4, control sample A R1 and control sample A R1,2,3 (see section S2). For control sample A R1,2,3 three different window sizes are used. TP FP FN Passed Filtered Merged Control A R1 Sample 4 R (75.4%) 14 (24.6%) 31 Sample 4 R (68.5%) 17 (31.5%) 28 Sample 4 R1, (73.8%) 16 (26.2%) 34 Control A R1,2,3 Sample 4 R (75.1%) 70 (24.9%) 161 W =100 Sample 4 R (75.2%) 79 (24.8%) 171 Sample 4 R1, (76.6%) 54 (23.4%) 131 Control A R1,2,3 Sample 4 R (67.0%) 28 (33.0%) 47 W =350 Sample 4 R (59.3%) 35 (40.7%) 36 Sample 4 R1, (65.4%) 27 (34.6%) 40 Control A R1,2,3 Sample 4 R (64.8%) 19 (35.2%) 27 W =700 Sample 4 R (56.9%) 22 (43.1%) 19 Sample 4 R1, (62.5%) 18 (37.5%) 22 the use of a non-multiplexed control sample can perform as good as a multiplexed sample. As the extend of the bias resulting from sequencing on different runs is not yet known in detail, more data should be generated to determine the necessity to multiplex test and reference sample. 3 DISCUSSION 3.1 The control group should be genetically diverse The control sample used in the analyses is of commercial origin and is suitable because it contains a pool of 100 test subjects, 9

12 D.M. van Beek et al Table 6. True positives, false positives, false negatives, calls that passed (P) and failed (F) the rule-filtering for all combinations of the two runs of sample 4 (sample 4 R1 and sample 4 R2 ) and the 3 runs of the control sample (A R1,A R2 and A R3 ) as well as taken all three runs together (A R1,2,3 using W = 350). The merged (M) column displays the calls that passed the filter after the merging step. Run 3 of the control sample has a coverage that is about twice as high as runs 1 and 2, so the window size was set at W = 200. The rows with the text colored red are the settings in which the sample 4 was multiplexed with the control sample. Control sample Sample 4 R1 Sample 4 R2 TP FP FN P F M TP FP FN P F M A R A R A R A R1,R2,R which reduces the influence of individual CNVs that may occur in one of the subjects. As seen in section 2.5, the control sample contains inconsistencies, but the inconsistencies can be disclosed by visualizing the read-depth of both the test sample and the control sample along with the log 2 ratio. However, it would be better to construct a new control sample. As the heritage of the test subjects in the control group is not known, it could be possible that the test subjects chosen were not diverse enough, resulting in a population bias. In a newly constructed control sample a larger variety of people should be considered. 3.2 Alignment Allowing mismatches when aligning reads to the reference genome, will result in less data loss than when aligning in a more strict fashion and allows for SNPs to occur. The analysis itself, however, showed a small preference for this setting, but having more samples will make this conclusion more stable. 3.3 Read-tower removal After aligning the reads to the reference genome, one of the filtering steps was to reduce the influence of extremely large stacks of reads (mostly near centromere and telomere regions), that we refer to as read-towers. We created a filter that aims to delete there towers by using the fact that the reads in the tower largely overlap (details can be found in the Methods section 5.3). Probably these towers result from the way the reference genome is constructed. Repetitive regions in the reference genome are avoided, which is why the centromere and telomere regions are not included in the reference genome. When one of the repetitive regions that is located in the centro- and telomeres is included in the reference genome this will result in the alignment of a large number of reads that are sequenced all over the centro- and telomeres, but have only one location to be mapped to. 3.4 Limitations of acgh and NGS analysis Both acgh and NGS detection of CNV regions have some limitations [Holstege, 2010]. As a duplication can occur in tandem to the original but can also be located at another chromosome or position, this could be of interest. Both acgh as the NGS pipeline will not distinguish between both types of the duplication, as it maps towards a reference genome. The NGS read data can be used to determine the orientation of the duplication, but if the orientation is the same as the original it will not offer more information than the acgh analysis. An aberration that does not result in a copy number change will not be detected by both methods. An example can be a translocation, where a region is transferred to another chromosome or another location, or a inversion, where the region is inverted. As both methods look at small sequences of DNA, it does not distinguish the origin of the sequence. As DWAC-seq calculates breakpoints (changes) in coverage to determine a CNV region by using the statistics of the whole chromosome, the detection of large chromosomal aberrations will be limited (detection of whole chromosome aberrations will not be possible). If such detection is diagnostically relevant, other algorithms should be applied, for example WISECONDOR [Straver et al., 2012], which detects large (whole chromosome or subchromosomal) copy number changes. 3.5 Reliability of the proposed rule-filtering Four samples were used to generate rules to reduce the number of false positives. Besides that the rule-filtering is not tested on an independent test sample, it is probably also constructed with not enough data to reliably state that these rules are working optimally on other test samples. More samples are needed to make these rules suitable for all patients. 3.6 Reducing coverage experiments Simulating a reduced coverage by uniformly sampling reads does not take into account all effects that occur when sequencing a sample. The read-towers are a nice example for the fact that the uniform sampling is probably incorrect, as the height of the tower has no correlation with the sequencing depth. 3.7 Effects of changing the coverage of test and control sample It is not easy to determine an optimum window size that can take into account the influence of the coverage of both the test sample and control sample used. The windows of DWAC-seq are set by counting the number of reads in the control sample, setting the window and afterwards counting the reads in the same window in the test sample. When the coverage of the control sample is increased, multiplying the window size by the factor of the increase in coverage is not enough. An explanation could be that there are other parameters that influence the determination of the best settings. It could be that in this specific case the threshold is not optimal for example. The difference in output can also be the cause of the increasing difference in coverage of the test sample and 10

13 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis control sample. The normalization is not good enough to point out differences between the high coverage control sample and the low coverage test sample. 3.8 Recognizing clinical calls To determine whether a detected acgh region is relevant, an analyst investigates whether the CNV overlaps (part of) a gene or another biological relevant genomic region. If such cases, multiple literature databases are consulted to see whether there are publications available that link the gene to the symptoms of the patient. Hence, the clinical relevance of a call can not be directly stated upon detection of a region, as this relevance is not captured in the signal. Of the detected regions found by acgh, some can be dismissed as noise, but there is not a well defined protocol as to the exact definition of this noise; it largely depends on experience of the analyst. This makes it hard to determine what makes a detected region good, as it involves the interpretation by the analyst. It would be interesting to investigate the differences of reasonings between clinicians. Clearly, this also has influence on the examination of the use of NGS data to make CNV calls. We advise to set up a larger screen and include multiple independent clinical evaluations of the detected regions to come to a grounded conclusion. 3.9 Recommendations for the clinic For application of the proposed method in the clinic some factors need to be considered and precautions need to be taken. The first hypothesis that the control sample should be sequenced multiplexed with the test sample, which could eliminate temperature and GCcontent biases, still holds. It is expensive to sequence a (relatively) high coverage control sample with each test sample. In this paper we have shown that it is possible to use a previously sequenced control sample for the analysis. However, before applying this in the clinic, the differences between multiplexing the control sample and using a pre-sequenced standard control sample should be studied further. When a choice is made for a pre-sequenced standard pool, the extend of the variances that occur during sequencing should be described and kept in mind. In the experiments described a female control sample is used for test samples of male and female patients. For future applications we advise to use a male control for male patients, for the reason that analysis of the X- and Y-chromosomes will be more reliable. In the male samples used in this study, relatively more aberrations on the X-chromosome were found. The number of false positives in the X-chromosome for the male patients is expected to decrease considerably by using the male control sample. To accurately see if a CNV is originated from the test or control sample, the read-depth data as well as the log 2 ratio have to be used. Only using the log 2 ratio results in the illusion that a CNV occurs in the test sample, while it is situated in the control sample. Another factor that can be taken into account using all three plots, is the readdepth in the CNV region. If this is very low, the CNV is not reliable. This often results in a very variable log 2 ratio. Most of these cases will be removed by the rule-filtering. The parameters of DWAC-seq are optimized using the data that was available, which results in a bias towards the four samples that were used. To reduce the effects of the bias, the determination of the best parameters should be performed again, once a standard in coverage of the test and control sample is determined for the use in the clinic. The proposed rule-filtering was based on only four test samples, again resulting in a bias towards these specific samples. More samples should be analyzed to see if the rules are correct and if there is a possibility to fine-tune the rules. When introducing the technique in the clinic both the acgh and NGS analyses run in parallel for some period of time. This will result in a lot of data that can be used to increase the knowledge of the exact behavior of the NGS method. Using this data, a trainingand test set can be made and the rule filter can be validated and fine-tuned. For detection of whole chromosomal aberrations additional detection strategies should be included in the NGS pipeline. A quality control system should be developed to check the output of the NGS pipeline. As shown in the results section, the optimal window size is not very stable and depends on the coverage of both test and control sample. To make sure the right parameters are used, the number of total detected regions should be low, so we can expect a low number of false positives. 4 CONCLUSION We have shown that using shallow NGS data can be used to replace acgh data for clinical diagnostics. We proposed a pipeline that uses DWAC-seq to detect copy number aberrations and we proposed a rule-filtering to remove unreliable calls based on the account of reads mapped to these areas. The study is a first proof-of-principle of adopting NGS for diagnostic purposes. Before application in the clinic many more samples need to be analyzed, especially to reveal the right settings (parameters and coverage) but also to optimize the rule-filtering. An important conclusion from this work is also that, even for acgh data, the visualization of the signals for the test and control samples is extremely informative about the underlying phenomena that resulted in a detected aberration. This also showed that the gain in resolution by the NGS data with respect to the acgh data is the greatest benefit and will eventually improve clinical diagnostics completely. 5 METHODS 5.1 Data The blood samples of the patient are prepped for sequencing using the TrustSeq protocol of Illumina [Illumina, Inc., 2011]. A single-end 50 run is performed on an Illumina HiSeq 2000 sequencer. Four test samples are used with coverage ranging from 0.2 to 2.4 (see table S2) and two control samples were available as input of the two controlbased tools. The first control sample contains a pool of 27 samples of healthy pregnant women. The second control sample is the same control sample used for the acgh experiments. This second control sample is of commercial origin: control DNA of 100 anonymous females created by Kreatech Diagnostics [Kre, 2010]. This second control set is more reliable than the first set, as in the first set around 5% of the reads will be of fetal origin [Lo et al., 1997] and the detection of CNVs is not limited to pregnant women or their unborn child. The second control sample is sequenced multiplexed with one of the test samples. This is done as shown in figure 8. This results in multiple control samples and two runs of sample 4 (see table S2 for more details). 11

14 D.M. van Beek et al S R1 A R1 S R2 A R2 A R3 R1 R2 R3 Fig. 8. Illustration of the way sample 4 (S) is multiplexed with the control sample (A). In two of the three runs (R1 and R2), sample 4 (red) is multiplexed with the control sample (green). In one run only the control sample is sequenced. Beside from control sample A an external control sample is used, depicted by (B). This control sample is a pool of DNA of 27 pregnant women. 5.2 Mapping Alignment of the data is done using BWA [Li and Durbin, 2009]. The reads are aligned in two different ways; the first allowing no mismatches ( Strict alignment). The second way allows one mismatch per read ( Lenient alignment). The reference genome used for alignment is the 1000 genome Project GRCh37 reference genome [1000 Genomes Project, 2009]. This reference genome is chosen for its compatibility with BWA, as it only contains one copy of the Y-chromosome. The Ensembl reference genome assembly [Ensembl, 2012] contains the Y-chromosome in two parts, which resulted in problems running BWA and samtools. BWA places reads that can be mapped to different regions on the genome at one of those regions randomly and assigns a mapping quality (MQ) of zero to these reads. These non-unique reads are removed using samtools [Li et al., 2009] to reduce their influence on the read-depth. Samtools is also used to remove PCR duplicates to exclude the possible effect of PCR on our data analysis. 5.3 Removal of read-towers The aligned data sets showed large stacks of reads at specific locations, especially around the centromere and telomeres, known as read-towers. The towers are made of many reads (in some samples over reads) that overlap to a very large extend. First the towers are detected by taking together reads that have a starting position of at most N bases from each other. Towers are deleted when the number of reads in a tower exceeds a threshold T tower. The algorithm is described in Algorithm 1. For successful removal of only towers, the setting of the parameters is crucial. Both depend on the coverage of the sample and supplement 6.4 shows a way to calculate them based on a given coverage. Here the following settings are being used: N = 4, T tower is calculated for each sample separately as described in supplement section CNV detection tools Four tools were used for detecting CNVs in the NGS data. The tools are selected in a way that both read-depth and control-based methods can be studied. CNVnator [Abyzov et al., 2011] and RDXplorer [Yoon et al., 2009] are based on read-depth analysis and only require the test sample. CNVnator requires setting a window size and does not need additional parameters. RDXplorer needs the reference genome that is used for alignment of the sample as input. Again a window size needs to be set, but only a size of 10 and 100 bases are readily available. Both tools run per chromosome and do not calculate statistics for the whole genome. DWAC-seq [Koval and Guryev, 2011] and CNV-seq [Xie and Tammi, 2009] are both based on the use of a control sample. DWAC-seq has multiple parameters that can be set. Also here a window size needs to be set, but it is important to note that the window is dynamic in terms of the number of bases as the window size is specified by the number of reads per window in B Data: Aligned and sorted reads and their positions Define: Minimum shift (N) and Threshold (T tower) Result: Reads located in read-towers are deleted for read in sample do if chromosome current read == chromosome previous read AND N<startposition current read - startposition previous read then save reads in list previous chr = current chr end if not (chromosome current read == chromosome previous read) AND N<start position current read - start position previous read then if size of list >T tower then delete reads in list except for the first read end end print reads in list to output end Algorithm 1: Algorithm to remove read-towers that appear after the reads are aligned to the reference genome (see supplement figure S1 for example). the control sample. The other parameters that need to be set is a threshold and a confidence level that is used in finding the breakpoints. The breakpoints are validated by bootstrapping over a sliding window and allowing breakpoints only if the confidence level is met. This is the only tool that filters on uniqueness of a read-placement in the alignment (this will not be used, as the data is preprocessed to only contain uniquely aligned reads). CNV-seq needs no additional parameters. Some parameters can be changed, such as the threshold for calling, p-value and the window size. The algorithm tries to determine the window size for each sample by using the log 2 -threshold and the p-value. All four tools differ in complexity, user friendliness and processing of the data (see table S3 for all settings used). Some need additional preprocessing of the input data and some adaptation of the code. Other tools that were considered (cn.mops [Klambauer et al., 2012], CNAnorm [Gusnanto et al., 2012] and BIC-seq [Xi et al., 2011]) were developed with another goal in mind (mostly detecting copy number changes in tumor material), and we were either not able to get them running in the right way, or they gave results that were uninterpretable. 5.5 Determining correctness of results The copy number changes (calls) found in the acgh data are considered standard of truth and are used to classify the NGS calls as either true positives, TP (NGS calls overlap with acgh calls), false positives, FP (NGS calls do not overlap acgh calls), false negatives (for a region that is called by acgh no NGS call is found), true negatives, TN (both acgh and NGS do not call a region). Determining the overlap between detected regions is not trivial. Therefore four different methods are considered to determine the right class. These methods are illustrated in figure 9. The four methods are discussed in more detail in the supplement. To find the best settings for DWAC-seq, the region-based nucleotide method is used. This method takes the fact that regions detected by NGS are often smaller and may overlap one acgh detected region into account. This method determined the best parameters settings by analyzing its Receiver Operating Curve (ROC, see 6.6 for details). The correctness of the regions detected by DWAC-seq is determined by using the region-based method, because it could easily be compared to the detected regions made by both the acgh and NGS pipelines. 12

15 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis 1) Nucleotide-based method TN TN TN FP TP TP FN FP FN FP TP TN TP TN FP TP FN TN FN FP acgh NGS 2) Region-based nucleotide method TP TN FP TN TN FN TP TN TP TN TP if: Overlap > 50% TN FN TP FP acgh NGS 3) Gene-based method acgh NGS Gene track 4) Region-based method acgh NGS TP 1x TP TN FP TN FN TN TP FP No gene: Not considered 1x TN 1x FP 1x 1x TN FN 1x TN 1x TP 1x TN 1x TP 1x TN 1x TP if: Overlap > 50% 1x TN FN TP FP 1x FN 1x TP 1x FP Fig. 9. Four method that are used to determine whether NGS calls are right or wrong for each position of the genome. Top panel: the nucleotide-based method. This method considers each nucleotide separately and labels each position with the resulting class. Panel in 2nd row: region-based method. In this method, the class is determined for the longest possible stretch of overlapping calls. The overlap should be larger than 50% of the smallest call, otherwise the nucleotide-based method is used. Panel in the 3rd row: gene-based method. This method is similar to the nucleotide-based method, but only considers the part of the genome that lies in genomic regions. The RefSeq genetrack from UCSC is used to determine genomic regions (downloaded July 30, 2012). Bottom panel: region-based method. This method scores regions by counting how many CNV regions overlap. It does not score per nucleotide, but it scores by a call. This means that when adding all false and true positives and negatives it will not result in the total number of nucleotides in the genome. 5.6 Rule-filtering The detected regions based on the NGS data with DWAC-seq are filtered to determine which calls are reliable and which are not and thus are probably not true copy number changes. By filtering on the reliability of the calls (i.e. only keeping the reliable calls) the number of false positives is limited. The filtering is based on a set of rules that are derived from manually annotating the acgh and NGS calls by looking at the signals of the test sample and control sample. The combined set of acgh and NGS calls is divided into three classes: a positive set (true CNV), a negative set 3 (false CNV), and an undetermined set (uncertain if true or false CNV). The undetermined set contains calls that are hard to classify into the positive or negative set because of differences in signal (for example one part shows an increase in signal, the rest of the call has a very variable log 2 ratio because of low coverage). 3 Note that this negative set is slightly different from negatives when comparing acgh and NGS calls. Here one of the methods has called this region (positive), but after manual annotation it was decided that there is not enough evidence to call this region. We looked at several statistics for each class to find distinguishing properties. These statistics included, amongst others: the read-depth of the test sample and control sample as compared to the average signal of the chromosome and the whole genome; the variance of the log 2 ratio is calculated and compared to the average variance of the whole chromosome and genome, and the variance of the read-depth signal. An important statistic turned out to be the number of empty bins. An empty bin is a bin (part of the genome, in our case of 1000 bases) that contains no mapped reads and can be a measure for unmappable regions due to repetitive sequences. Regions in the negative class appeared to have a lot of empty bins. In the positive set there are also CNVs with empty bins, but these CNVs are homozygous deletions or duplications, having a decent control sample coverage, but almost no coverage in the sample. The rule-fulter for the NGS calls then becomes: If test and control sample do not contain any empty bins, assign to positive call set. If both test and control sample contain empty bins: If the total percentage of empty bins in both test and control sample is less than 20%, assign to positive call set. Otherwise the call is assigned to negative call set. If test OR control sample contains empty bins, but the other sample contains no empty bins: If the percentage of coverage of the control sample is higher than 50%, assign to positive call set. Otherwise the call is assigned to the negative call set. 5.7 Merging of calls Sometimes if a large aberration is present, DWAC-seq detects this aberration into multiple calls that slightly overlap. After application of the rule filter, the calls that are overlapping are merged to create a more reliable view of the true number of calls that are made by the CNV detection tool. In the Results section the effect of the merging is often displayed in a separate merged column, to see the difference between the calls that pass the filter and the actual number of calls after merging. ACKNOWLEDGEMENT We would like to acknowledge Desiree Steenbeek for assistance with all data requests regarding acgh, Quinten Waisfisz for assistance with the generation of the NGS data, Daoud Sie and Bauke Ylstra for the inspiration ([Smeets et al., 2011]) and Roy Straver for the collaboration on the development of the RETRO filter and inspiring brain-storm sessions. REFERENCES 1000 Genomes Project. Sanger ftp 1000 genomes reference, 10th October Alexej Abyzov, Alexander E. Urban, Michael Snyder, and Mark Gerstein. Cnvnator: An approach to discover, genotype, and characterize typical and atypical cnvs from family and population genome sequencing. Genome Research, 21: , Agilent Technologies. Agilent scan control 8.5.1, a. Agilent Technologies. Feature extraction 11.0, b. Jeffrey A. Bailey, Zhiping Gu, Royden A. Clark, Knut Reindert, Rhea V. Samonte, Stuart Schwartz, Mark D. Adams, Eugene W. Myers, Peter W. Li, and Evan E. Eichler. Recent segmental duplications in the human genome. Science, 297(5583): , August Yval Benjamini and Terence P. Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Research, February Peter J. Campbell, Philip J. Stephens, Erin D. Pleasance, Sarah O Meara, Heng Li, Thomas Santarius, Lucy A. Stebbings, Sarah Edkins, Claire Hardy, Jon W. Teague, Andrew Menzies, Ian Goodhead, Daniel J. Turner, Christopher M. Clee, Michael A. Quail, Antony Cox, Clive Brown, Richard Durbin, Matthew E. Hurles, Paul A.W. Edwards, Graham R. Bignell, Michael R. Stratton, and P. Andrew Futreal. 13

16 D.M. van Beek et al Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics, 40(6): , CERN. Root package. URL root.cern.ch. K. Darvishi. Application of nexus copy number software for cnv detection and analysis. Current Protocols in Human Genetics, April Ensembl. Ensembl human genome sequence ftp, January ENZO Life Sciences, Inc. Product data sheet enz cgh labeling kit for oligo arrays. Lars Feuk, Andrew R. Carson, and Stephen W. Scherer. Structural variation in the human genome. Nature Reviews Genetics, 7:85 97, February Arief Gusnanto, Henry M. Wood, Yudi Pawitan, Pamela Rabbitts, and Stefano Berri. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics, 28 (1):40 47, Henne Holstege. Gebruik van array in prenatale diagnostiek in nederland John Hunter, Darren Dale, and Michael Droettboom. Matplotlib. Illumina, Inc. De novo assembly using illumina reads, URL technote_denovo_assembly_ecoli.pdf. Illumina, Inc. Truseq( TM ) rna and dna sample preparation kits v2, April URL datasheets/datasheet_truseq_sample_prep_kits.pdf. Günter Klambauer, Karin Swarzbauer, Andreas Mayr, Djork-Arné Clevert, Andreas Mitterecker, Ulrich Bodenhofer, and Sepp Hochreiter. cn.mops: Mixture of poissons for discovering copy number variations in next genration sequencing data. Nucleic Acids Research, 40, Jan O. Korbel, Alexander E. Urban, Jason P. Affourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M. Kim, Dean Palejev, Nicholas J. Carriero, Lei Du, Bruce E. Taillon, Zhoutao Chen, Andrea Tanzer, A.C. Eugenia Saunders, Jianxiang Chi, Fengtang Yang, Nigel P. Carter, Matthew E. Hurles, Sherman M. Weissman, Timothy T. Harkins, Mark B. Herstein, Michael Egholm, and Michael Snyder. Paired-end mapping reveals extensive structural variation in the human genome. Science, 328: , October Slavik Koval and Victor Guryev. Dwac-seq dynamic window approach for cnv detection using sequencing tag density, February 10, User Guide Megapool Reference DNA EA-100M (male) & EA-100F (female). Kreatech Diagnostics, Vlierweg 20, 1032 LG Amsterdam, version 1.0 edition, March Hugo Y.K. Lam, Michael J. Clark, Rui Chen, Rong Chen, Georges Natsoulis, Maeve O Huallachain, Frederick E. Dewey, Lukas Habegger, Euan A. Ashley, Mark B. Gerstein, Atul J. Butte, Hanlee P. Ji, and Michael Snyder. Performance comparison of whole-genome sequencing platforms. Nature Biotechnology, 30:78 82, June H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map (sam) format and samtools. Bioinformatics, 25:2078 9, Heng Li and Richard Durbin. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics, 25(14): , Y.M. Lo, N. Cobetta, P.F. Chamberlain, V. Rai, I.L. Sargent, C.W.G. Redman, and J.S. Wainscoat. Presence of fetal dna in maternal plasma and serum. The Lancet, 350 (9076): , Daniel Pinkel, Richard Segraves, Damir Sudar, Steven Clark, Ian Poole, David Kowbel, Colin Collins, Wen-Lin Kuo, Chira Chen, Ye Zhai, Shanaz H. Dairkee, Britt-marie Ljung, Joe W. Gray, and Donna G. Albertson. High resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20: , October James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, and Jill P. Mesirov. Integrative genomics viewer. Nature Biotechnology, 29:24 26, Serge J. Smeets, Ulrike Harjes, Wessel N. van Wieringen, Daoud Sie, Ruud H. Brakenhoff, Gerrit A. Meijer, and Bauke Ylstra. To dna or not to dna? that is the question, when it comes to molecular subtyping for the clinic! Clinical Cancer Research, 17: , June R. Straver, H. Holstege, E.A. Sistermans, C.B.M. Oudejans, and M.J.T. Reinders. Detection of fetal copy number aberrations by shallow sequencing of maternal blood samples. Unpublished, Lisenka E.L.M. Vissers, Bert B.A. de Vries, and Joris A. Veltman. Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis. Journal of Medical Genetics, 47: , Ruibin Xi, Angela G. Hadjipanayis, Lovelace J. Luquette, Tae-Min Kim, Eunjung Lee, Jianhua Zhang, Mark D. Johnson, Donna M. Muzny, David A. Wheeler, Richard A. Gibbs, Raju Kucherlapati, and Peter J. Park. Copy number variation detection in whole-genome sequencing data using the bayesian information criterion. Proceedings of the National Academy of Sciences, 108(46):E1128 E1136, Chao Xie and Martti T. Tammi. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 10(80), Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, Kenny Ye, and Jonathan Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19: ,

17 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis 6 SUPPLEMENT 6.1 Data generation on acgh All data used in this research were blood samples. The samples were analyzed using 4x180k Agilent microarrays following VU medical center s standard operating procedure (SOP). Concentration of DNA is verified before analysis. Labeling is done using ENZO DNA labeling kit [ENZO Life Sciences, Inc.]. After preparation, the array is scanned using Agilent Scan Control [Agilent Technologies, a] and Feature Extraction [Agilent Technologies, b]. 6.2 NGS data generation The samples were sequenced on an Illumina HiSeq 2000, and prepared using the TrustSeq Protocol of Illumina [Illumina, Inc., 2011]. The reads were filtered on Illumina chastity [Illumina, Inc., 2010] to ensure high read quality. The chastity filter looks at the cluster of the read on the flow cell and calculates a score per base (the ratio of the intensity of the highest signal divided by the sum of the two greatest signals). It is a measure that indicates how well the cluster is recognizable from the surrounding clusters. If there has been interference, the read will not pass the filter. For all samples, the read length was the same (51 base pairs). When coverage is mentioned, the following equation is used: Coverage = Number of reads Length of read Size of genome Unless stated otherwise, the coverage is calculated after quality filtering, PCR duplicate and tower removal. The size of the genome is determined by counting all the A, T, C and G bases in the chromosomes of the 1000 genome project reference genome GRCh37 (hg19) resulting in bases. The samples contain certain CNVs that are of a special interests (clinical calls) due to potential pathogenic properties and are described in table S1. All samples were obtained in the clinic. The patients had symptoms of intellectual disability and the array results were generated to find potential phenotype related CNVs. 6.3 Alignment Alignment of the reads is done using BWA [Li and Durbin, 2009] version Two settings were used: conservative alignment, which allows no mismatches and less stringent alignment, which allows one mismatch per read. One mismatch allows for mapping of reads containing SNPs. The reference genome used for alignment is obtained from the 1000 genome project; GRCh37 found on the Sanger FTP [1000 Genomes Project, 2009]. PCR duplicates can be observed as a side-effect of the construction of the library for sequencing. The duplicates are the result of the amplification bias that is introduced when using the Polymerase Chain Reaction technique, but it can also be the result of simply sequencing the same region twice. Due to the uncertainty of its origin, the exact duplicate reads were removed using Samtools rmdup [Li et al., 2009] version BWA aligns reads that can be mapped to more than one location on the genome to one of these locations randomly. It gives such a read a mapping quality (MQ) of 0. As the origin of these reads are unknown and their random mapping does not provide additional information for our purpose, these non-unique reads are deleted. This is done using Samtools, making use of the MQ of 0. The resulting BAM-files contains only uniquely mapped reads. The coverage after mapping and filtering for all test and control samples are listed in table S2. The Integrated Genome Viewer [Robinson et al., 2011] is used to check the alignment and read-depth between the described postprocessing steps. 6.4 Read-tower removal After the removal of the PCR duplicates and non-unique reads, the formation of several peaks in the read-depth can be observed (figure S1). This phenomenon, referred to as read-towers, are stacks of reads that largely overlap and are situated mostly near centromeres and telomeres, but can also be observed at other locations in several chromosomes. The read-tower forming strongly influences the statistics of the sample; they may contain more than reads, which influences the normalization of the rest of the chromosome. The centro- and telomeres contain large repeat-regions. These repeat-regions can not be assembled correctly and are not included in the reference genome. A possible explanation of the illunderstood phenomenon of read-tower formation may be that the reference genome still contains a part of a repeat-region. This region is sequenced several times, but the reads have only one spot to be aligned to, the stack at this location becomes very high. The positioning of the read-towers is very variable over different samples, so an approach had to be developed to reduce the influence of the towers based on the statistics of the data RETRO filter The REad-Tower RemOval filter, RETRO, is developed to reduce the influence of the read-towers on the data. As the samples are shallow sequenced and have a low coverage, the filter applied can make use of the two distinguishing properties of the read-towers; the reads mostly overlap and there are a lot of reads assembled in a small region (the height of the tower). The pseudo-code of the filtering can be found in Algorithm 1 (in main text). In words: reads with a starting position within N base pairs after the current read are added to the current stack of read in the tower. This step is repeated with the next read until a read is encountered with a starting point distance larger than N. The stack that is created after this process is kept if the number of reads in the stack is below the threshold T value, otherwise all reads but the first are deleted from the data set Optimizing parameters To determine the optimal settings for the RETRO-filter, first the chance that a tower could occur in a random data set is calculated. We assume is that the reads are spread uniformly over the genome. To determine the chance for any read to be within N base pairs from the next read, the distance between the current en next read is modeled by using the Poisson distribution. The expected distance between any two successive reads can be calculated by dividing the length of the genome by the total number of reads. (expected value) can then be described as in Equation S1. = Read count Genome length (S1) 15

18 D.M. van Beek et al Table S1. Overview of the CNVs of interest (clinical calls) for the clinic that are located in the available samples. The column Detected states whether the region is detected by DWAC-seq with the use of the best parameters for the sample. One CNV in sample 3 was found in 7 separate calls using acgh and manually merged. Location Type Length Probes Probe Median Gene Detected Sample 1 22q12.3 Deletion 60 kb LARGE Yes Sample 2 3q12.3 Duplication 65 kb ZPLD1 Yes Sample 3 5p15.1-p13.3 Triplication 12.9 Mb 752 (7 reg.) Multiple Yes 5p13.2 Duplication 78 kb C1QTNF3-AMACR No 5p13.2 Duplication 148 kb No genes No 15q24.3 Deletion 182 kb SCAPER Yes Sample 4 4p16.1 Duplication 110 kb SORCS2 Yes 5q15-q23.3 Deletion 34 Mb Multiple Yes 7p14.1 Deletion 238 kb C7orf10 Yes Xp11.22 Duplication 69 kb Multiple Yes Xq21.2 Homozygous Deletion 70 kb DACH2 Yes Table S2. Coverage (number of times the genome is sequenced) of the samples right after alignment ( Aligned ) and after quality filtering, PCR duplicate and tower removal ( Filtered ). Two different alignment settings are used (0/1 MM). Control A is sequenced on three different lanes, in two of these lanes multiplexed with sample 4 (see 8 for details). Control B is an external control set created from data of the VUmc project to detect aberrations of fetal DNA in the blood of the mother, and contains data of all healthy samples from two runs (27 samples in total). Aligned (0MM) Aligned (1MM) Filtered (0MM) Filtered (1MM) Sample Sample Sample Sample 4 R Sample 4 R Sample 4 R1, Control A R Control A R Control A R Control A R1,2, Control B The chance for any read to have the next read starting within N base pairs after the starting point of the current read is described by Equation S2. P (n apple N) = NX i=0 i e i! (S2) The n describes the actual distance between any two successive reads and N is the parameter set in advance that indicates the maximum distance between the starting points of the reads, which is used to create stacks as described above. The chance that a stack occurs with a height (the number of reads in the stack) of T or higher can be found in Equation S3. P (t T )=P (n apple N) T 1 (S3) In the above equation T is the set threshold that is used for deleting a read-tower from the data, t is the current stack height. -1 is introduced to correct for the fact that the calculation starts with a current read and thus always has a height of 1. Combining all of the above results in Equation S4. NX P (t T )=( i=0 i e i! ) T 1 (S4) Using Equation S3 the expected number of times a read-stack is removed that was formed randomly instead of being part of an actual read-tower can be approximated for each combination of N and T. As it is preferred to only remove read-towers (and not stacks of reads that may occur randomly), this expected number of removals of randomly created towers should be as close to zero as possible. The chance to have zero reads as part of a read tower should be as great as possible. Equations S5 and S6 describe this: E = P (t T ) R (S5) With E as the expected number of towers found in the sample containing R reads. C = P (t T ) R (S6) With C as the chance for a read tower to occur anywhere in the sample containing R reads. Settings should be picked keeping E 16

19 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis Fig. S1. The formation of read-towers can be clearly observed in IGV. The reads that are stacked contain a large overlap and are mainly positioned around the centromere and telomere regions. The lower part of the figure shows the individual reads that occur in this read-tower. The colors indicate that the read is aligned to the reference genome (depicted as the straight grey line from left to right) allowing one mismatch (so the base on the read is different than the base on the reference genome). Above the line that represents the reference genome, the read-depth signal is depicted. This signal is cut by IGV because of the very large numbers of reads aligned in this region. between 0 and 1, making the chance that a random set of reads is removed very low while keeping the script sensitive enough to remove abnormal behavior. 6.5 CNV detection Detection of CNVs is done using several NGS tools, each of them discussed below in more detail. Most tools take the sample BAMfile [Li et al., 2009] as input. Some tools also need a BAM-file of the control sample that is used to determine CNVs. The parameters that can be tuned for each tool can be found in table S CNVnator CNVnator [Abyzov et al., 2011] version uses the read-depth of the test sample to find CNVs in the data. The tool is dependent on the ROOT package ([CERN] version 5.32/01) and Samtools. Using an intermediary.root file the tools analyzes the data and returns a list of CNVs. The tool divides the data into samesize bins and counts the reads in each bin, creating a read-depth signal. This signal is then corrected for bias introduced by GCcontent. Partitioning of this signal is performed using a mean-shift technique (described in [Abyzov et al., 2011]). Adjacent regions are merged if the signal is similar. A CNV is called if the segment s mean read-depth signal deviates more than 25% from the genomic average RDXplorer RDXplorer [Yoon et al., 2009] version 2.0, release 3 is run at a window size of 100 bases. The tool uses these windows to assemble a read-depth signal by counting the mapped reads in each window. Only two window sizes can be chosen; 10 and 100 bases. The testing in the paper is performed using high coverage data (30x) and they assume that at this coverage the reads positioned in a window approaches a normal distribution. The application of the tool on low coverage data still assumes this distribution., which might cause misbehavior of RDXplorer. For each window a GC- m content correction is applied using r i = r i m GC, where r i is the read count in the window i, m the median of the read count over all windows and m GC the median of the windows with similar GC-content as window i. This corrected read-depth is then analyzed using event-wise testing. The read count of a window is converted into a Z-score ( r i µ ) and the upper- and lower-tail probabilities are determined. A window is called by looking at a set of consecutive windows. For each window in this set the upper- and lower-tail probabilities that lie below ( FPR L/l )1/l are listed, the maximum of the resulting probabilities is selected as an unusual event. The FPR can be set in advance, L is the number of windows in the chromosome and l the number of windows in the consecutive set. The size of the consecutive set is increased in multiple iteration steps. Some filtering steps are performed to get rid of small deletions and duplications before returning the results DWAC-seq DWAC-seq [Koval and Guryev, 2011] version 0.56 is used to detect CNVs in the sample data using a control sample. The input data should be in BAM-file format. As the tool works for each chromosome separately, a script is made to analyze the whole genome. The output contains 5 different files; the file with detected CNVs is sorted by relevance. The start and stop position of the CNV, as well as the number of reads in the test and control sample is stated. The score given to each CNV is the ratio of test and control counts divided by the median of the average reads per segment in between breakpoints. The window size is dynamic and determined by a constant number of reads per window. Using static windows has the disadvantage that the same window is used for low- and highcovered regions. The use of a dynamic window reduces the influence of GC-content as it automatically increases the window when encountering low coverage regions. The windows are set by counting the number of reads in the control sample. After setting the window, the number of reads in the corresponding window in the test sample are counted. The next step is the segmentation of the data; this is done using the CUSUM Monte- Carlo method. This method find breakpoints by bootstrapping: by taking random values of the read-depths of a sliding window (containing multiple windows), a cumulative sum is calculated. Each element is corrected by a weight (average ratio of the readdepth in the set of windows), and when the cumulative sum reaches a threshold value, a breakpoint is found. Regions between breakpoints with similar signal are merged and the resulting CNV-regions are fine-tuned by looking at all the relevant start and stop positions CNV-seq CNV-seq [Xie and Tammi, 2009] version is selected because of its capacity to include a control sample for the analysis of the test sample. The read-depth data of both test and control sample are analyzed using a sliding window defined by a number of bases. A statistical model is used to reduce the effect of sequencing biases such as GC-content. The size of the sliding window is calculated by the tool itself on a per-sample basis. The number of reads in a specific 17

20 D.M. van Beek et al window depends on the total number of reads, the length of the genome and the size of the window, and it is assumed that the number of reads are Poisson distributed. The ratio test/control per window is corrected by the total readdepth of both samples. The probability of this copy number ratio occurring randomly is calculated and the p-value is computed (see for exact procedure [Xie and Tammi, 2009]), after which the calls are given in the output file. 6.6 Finding the best parameters for DWAC-seq To find the best parameters that result in good performance of DWAC-seq, four scoring methods were developed (figure 9). Two parameters were analyzed and optimized: the calling threshold, T and the window size, W. These parameters were varied while the other parameters were set at default and kept constant. The threshold was varied between 0.1 and 0.5, the window size from 100 to 5000 reads per window. Changing these parameters allowed to make a Receiver Operating Curve (ROC), with which the optimal setting could be derived. The ROC plots the false positive and true positive rates that are defined as follows: False positive rate = FP FP + TN True positive rate = TP TP + FN The plots in figures S2 and S3 show the four different scoring measures used to score the parameter settings. These measures all have advantages and disadvantages. In theory the datapoint closest to (0,1) is considered optimal. Not all scoring methods show a ROC behavior that approaches this ideal point. The region-based method shows in both figures S2 and S3 a diagonal and not a curve. This shows that using this measure, no trade-off can be made between the number of false positives and true positives. The gene-based method is based on the idea that only CNVs in genetic reasons will be relevant in the clinic. Expected is that not a lot of CNVs lie in these regions, which is the explanation of the abrupt leveling of the curve in both plots. From these observations we conclude that these two measures seem least reliable to evaluate the CNV detection tools. Looking at figure S2, the nucleotide-based and region-based nucleotide methods both show a behavior that approaches the (0,1) point. The most optimal threshold setting seems to be This can be concluded from both evaluation methods, although the nucleotide-based method bends more pronounced towards the (0,1) point. Experiments with changing window sizes using values of 100, 500, 1000, 2000, 3000, 4000 and 5000 (an fixing T =0.8), shows that a higher window size results in a lower the number of calls. The results of this experiment can be found in figure S3. The same observation can be made when we focus on the different evaluation methods. In this case the nucleotide-based method is very variable and difficult to use for concluding an optimum window size. In general, the window size shows very variable results, which is due to differences in coverage, but also because of the type, length and depth of the CNVs that can occur. It seems that there is no definite optimum window to determine over the samples. When looking at the best scoring window size per sample, a window size of 1000 seems to give the best results. When running the data with the best parameters T =0.25 and W =1000, the number of calls that are made is low. The numbers for the samples 1 to 4 are 1, 4, 9 and 8 calls. The number of acgh calls that are overlapping with these calls are 3, 4, 5 and 9. Hence, almost all calls are true positives, but a lot of calls are missed. Therefore, we decided to decrease the window size to 100, where more calls are generated (and the ROC performance is still relatively good). These calls are further analyzed and filtered to get a data set with probable CNVs. It should be noted that this setting is chosen using control sample A R1. Changing the coverage of both test and control sample has a large influence on the choice of the window size. 6.7 Visualization Visualization of the read-depths of the test sample and control sample is done using Matplotlib [Hunter et al.]. The genome is divided into bins of 1000 base pairs and the number of reads in both test and control sample is counted (read is counted in the window based on its start position). The read-depths of the bins are divided by the total number of reads in the sample. For calculation of the log 2 ratio signal, when an empty bin occurs (no reads are aligned in this region), the bin is skipped (and the bin is deleted from the log 2 ratio signal). The resulting plot contains both the read-depth of test and control sample and the log 2 ratio. Apart from this analysis, the number of empty bins over a region is also plotted. This clearly shows which regions are unmappable. The visualization of the acgh output is done in similar way. The signals of the test and control sample (green and red fluorescence signals) is plotted. As the actual analysis of the acgh is performed by looking at the log 2 ratio (ratio i = log 2( signal g,i signal r,i )). Thesholds for calling are also plotted at ( big loss ), -0.3 ( loss ), 0.3 ( gain ) and 1.0 ( high gain ). We also plot the probe density, to show which regions of our calls contain (almost) no probes. If no probes are located and the NGS tool calls a region, it could be an explanation for the differences in output. For comparative purposes, the acgh plots are displayed in figures S4 to S Rule-filtering Manual annotation of the combined calls as generated by analyzing the acgh and NGS data resulted in three sets of CNVs; a positive, a negative, and an undetermined set. The annotation is performed by looking at the signal information derived from the test and control sample. The separation of the negative and positive set mainly lies in the coverage in the region of the CNV. If there is (almost) no coverage, the CNV can not be trusted. Based on this observation, rules were developed that could separate the two classes. These rules can be found in the main text Mainly the rule filter deletes calls that contain empty bins and have less than 20% coverage as compared to the overall coverage of the chromosome. If there are empty bins located in only one sample, but not in the other, a homologous deletion could be detected in this case. If the coverage of the control sample is above 50% of the expected coverage, this call is kept. The filter effectively deletes low coverage and unreliable calls from our set, reducing the number of false positives considerably. 6.9 Relevance of CNV calls The tools that are used all apply different algorithms to produce calls for CNVs. The number of calls that are based on NGS data 18

Nog meer weergeven