# Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis

1 Master s Thesis Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis Thesis Committee: Prof.dr.ir. M.J.T. Reinders Dr.ir. J. de Ridder Dr. C.P. Botha, MSc Dr. E.A. Sistermans Dr. M.M. Weiss Ir. J. Nijkamp Daphne M. van Beek Author Student number Thesis supervisors Prof.dr.ir. M.J.T. Reinders Dr. M.M. Weiss Dr. E.A. Sistermans Date October 23, 2012

2 Preface This report is made as part of the Master s Thesis project of the master Computer Science, track Bioinformatics at the Delft University of technology. The main focus of this document lies on the paper that is written as the result of my research on the detection of copy number variations using next generation sequencing data. In the future it is likely that next generation sequencing techniques will replace the current array-comparative genomic hybridization technique that is currently used in clinics. For next generation sequencing data to replace this technique, the minimal coverage required for competitive detection should be known; this was the focus of my work. Besides the main paper, a supplement is provided to give some additional information about the research that was done. A work document is also included, in this document the progress and observations made during the project are registered. The Master s thesis project was done at the Bioinformatics Lab at Delft University of Technology in collaboration with the department of Clinical Genetics of the VU Medical Center in Amsterdam. Acknowledgements Thanks to Janneke Weiss, Marcel Reinders and Erik Sistermans for their advice, supervision and all the interesting discussions. Thanks to Desiree Steenbeek, Daoud Sie, Quinten Waisfisz and Bauke Ylstra for the tips and their help on both acgh and NGS issues I encountered during the project. Thanks to Roy Straver for the collaboration in the starting phase of our thesis projects. It was great that we could merge our ideas and even invented our own, new alignment method (now we only have to make sure the code works)! 1

5 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis A Next Generation Sequencing Pipeline Reference data R1 CNV detect R2 R3 R1,2,3 Manual reannotation calls acgh calls Read data Alignment Pre-processing Rule-filtering Validation BWA B Labeling Sample Array CGH Pipeline Labeling Reference Hybridization on Microarray Scanning Microarray Analysis by Feature Extraction Calls generated by Nexus Fig. 1. Next generation sequencing and achg pipelines for the detection of CNVs. The steps described are explained in more detail in the methods and results section. In figure A, orange indicates a list of choices that is evaluated. For example; four CNV detection tools are tested and an optimal tool and optimal settings are chosen based on ROC-analysis; the alignment of the reads is optimized to allow for SNPs. A manual annotation of all the CNVs generated by acgh and the NGS pipeline is performed and the results are used to develop a rule-filter to reduce false positives. In figure B, the general overview of the acgh technology is displayed. the sample. This is done by comparing results between multiplexed and stand-alone test and control samples. 2.2 Comparing acgh and NGS results We would like that the outcome of the detected CNV regions based on NGS data resemble as much as possible the regions detected by acgh. Therefore we define the following situations. True positives (TP) are defined as the number of CNV regions found by acgh that are also found by the NGS CNV detection tools (an overlap of the detected regions is observed). If multiple NGS regions overlap one acgh region, it will be counted as one TP. False positives (FP) are detected regions that are found by the NGS tool, but not by acgh. False negatives (FN) are acgh regions not found by the NGS tool. The first and second criteria for selection are slightly correlated, the more regions the NGS tool detects (TP and FP combined), the higher the chance that the detected regions overlap randomly with the acgh data (TP). 2.3 DWAC-seq is the best CNV detection tool for analyzing the NGS data Two types of tools are evaluated in this section; read-depth based (CNVnator [Abyzov et al., 2011], RDXplorer [Yoon et al., 2009]) and control-based (DWAC-seq [Koval and Guryev, 2011], CNV-seq [Xie and Tammi, 2009]). We compare these four CNV detection tools for the NGS data using the following criteria: 1. Calls as reported by the acgh analysis should be found using the NGS data. 2. Number of false positives and false negatives should be low, as a large number indicates that the tool s performance is not good enough. 3. Run time should be no more than a day. These four CNV detection tools are applied to four samples (see Methods for details) and table 1 shows an overview of the outcomes for each of these four samples. The first striking observation is that the number of total calls 1 differs quite a lot between the tools (from about 50 to even ). A high number of false positives is, however, not trustworthy and likely do not indicate true CNV regions 2. For that reason DWAC-seq and RDXplorer are the best candidates as they generate the lowest amounts of calls. Between DWAC-seq and RDXplorer, DWAC-seq has the highest number of true positives (48% over 25%) and is, therefore, selected as best CNV detection tool. The second observation is that samples 1 and 2 show different behavior when compared to samples 3 and 4. The number of total calls of sample 3 and 4 is a lot higher (twice as high or more). This is due to the difference in coverage; sample 1 and 2 have a coverage of 0.4 and 0.5 fold respectively, while samples 3 and 4 have a much higher coverage; 1.4 and 3.6 fold respectively. The detected CNV regions within the acgh data have been subjected to clinical analysis. Some of these regions have been indicated as being relevant for diagnosis, they are classified in the following types: Type II, probably/possibly benign; Type III, probably/possibly benign and Type IV, pathogenic. These regions should thus not be missed when using the NGS data for CNV detection. The calls overlapping with these regions can be found between parenthesis in the columns in table 1. As can be seen CNVnator performs best: it finds all of these regions. DWAC-seq performs second best, only missing two regions. As the number of false positive is very high for CNV-nator, we still make the choice for DWAC-seq (see supplement section 6.9 for more details about the two missed clinical calls). 2.4 Finding the best parameter settings for DWAC-seq DWAC-seq is optimized by considering two parameters of the tool that can be varied: threshold, T, that the method uses for deciding whether the signal is strong enough to make a CNV call, and the size of the window, W, which is defined in the number of reads (instead of number of bases) and drives the minimal size of the regions that can be detected (see Methods for details). We varied the threshold between 0.1 and 0.5, and the window size between 100 and 5000 base pairs and for each setting we compared the results with the detected regions according to the acgh data. The resulting ROC-curves and the accompanying analysis can be found in the Supplement (6.6). These ROC-curves show that the setting of parameter T is not very sensitive, whereas varying the window size showed large variability over all samples. Based on 1 To indicate a detected CNV region the term call is used. 2 As the real CNV regions are not known care should be taken with drawing the conclusion that false positive calls are no real CNVs. However, as the acgh analysis and also two of the four CNV detection tools indicate a low amount of CNV regions, it is likely that those regions actually will not be true CNV regions. 3

6 D.M. van Beek et al Table 1. Number of true positives (TP), total number of calls (TP + FP) and the run time for the four selected NGS CNV detection tools. The control sample used by DWAC-seq and CNV-seq is A R1 (supplement S2). The test samples are aligned to the reference genome using the strict setting (see section 5.2). The first columns indicate the total number of calls made when using the acgh data (CGH calls) and the number of regions that were found to relevant for diagnosis after clinical analysis (Clinical calls). The results of CNVnator are calculated for a window size of 1 and 10 kb respectively. In between the parenthesis are the number of calls that overlap with the clinically relevant calls as found by using the acgh data. For CNVnator and RDXplorer only calls larger than 1 kb are kept. CNV-seq calculates for each sample an optimal window size. The (t) column states the total number of call regions that are found and the (c) column states the number of calls after combining regions. The runtime is an approximation of the running of the algorithm on data with a coverage of approximately 1x. Sample CGH calls Clinical calls CNVnator RDXplorer DWAC-seq CNV-seq 1 kb 10 kb 100 bp 100 bp (t) (c) TP (1) 17 (1) 1 (0) 17 (1) 15 (1) (1) 27 (1) 3 (0) 13 (1) 17 (0) (4) 30 (4) 8 (1) 25 (2) 26 (2) (5) 20 (4) 9 (0) 18 (5) 19 (4) Total calls Run time (1x) ± 2 hours ± 4 hours ± 10 hours ± 30 min these observations, we set T equal to 0.25 and chose the smallest window size for which the number of true positives is nearly maximal while the false positives are still relatively low (W =100 reads per window). The window size is dependent on the coverage of the sample, but we also observed that the window size heavily depends on the coverage of the control (table 5). This is because DWAC-seq uses the control sample for determining the location of the windows (see supplement section 6.5.3). As can be seen in table 5, when the coverage of the control sample increases from 1.3 to 4.5 (3.5 x higher coverage), the number of calls rise considerably (from 69 to 211 for sample 3) when the window size is kept constant. Note that the window size is defined by the number of reads, which means that when the coverage increases that a window size fixed in number of reads decreases in terms of bases. That implies that smaller regions can be detected, but which in turn increases the chance for false positives. Hence, the window size needs to be set dependent on the coverage of the control sample (i.e. larger window size when coverage is higher). Previous studies revealed that an average of 12 copy number variants per individual are present [Feuk et al., 2006]. We have used this in order to reason that when a method finds much higher numbers they are likely to be false. In the acgh results, the average number of CNVs found is 28, which complies with the findings of Feuk et al. As the resolution of NGS is higher than acgh, we expect to find more abberated regions. Additionally DWAC-seq tends to separate large abberated regions into smaller ones, which also would generate more detected regions. With the chosen settings of DWQCseq, we find around 50 abberated regions which we think is in good agreement with what we would expect reasoning from the findings of Feuk et al. 2.5 The composition of the control sample has a large influence on the found CNVs To reliably compare the acgh and NGS results, the control sample that is used in the acgh analysis is also used as a control sample when using DWAC-seq. Hence, the higher the coverage of the control sample, the more accurate results are produced by DWACseq. Control sample A (see table??) that is used for both pipelines is a commercial product by Kreatech Diagnostics [Kre, 2010]. The control sample is genomic DNA isolated from whole blood samples of 100 female anonymous donors. To reliably perform CNV detection, the control sample should be diverse enough to exclude CNVs that are particular for the heritage of just one person. As DWAC-seq uses the control sample in order to detect regions that have an aberrated copy number, variations can be found when the test sample varies, but also when the control sample varies. An example is depicted in figure 2. An explanation of the decrease in the read-depth of the control sample can be that there is a deletion (with respect to the reference genome) located at the position in the control samples. Note that this deletion should be consistent over the 100 persons from which the control sample is created, indicating a population bias for the control sample. Hence, when the test sample is not of the same population we will make false calls, as is the case for this example. Ideally for these experiments, the control should consist of DNA of a large number of individuals originated from all sorts of backgrounds. Creating such a control sample is, however very expensive. Therefore, we conclude that it is essential to visualize the read-depths of both the test sample, as well as the control sample for every call made by the CNV detection tools. 2.6 The strictness of the alignment of the reads does not have a large influence on the found CNVs Most experiments are performed using both the strict (no mismatches allowed) and lenient (one mismatch allowed) alignment methods (see Methods for details). Both alignment methods are compared to see which one obtains an optimal result. The run time of both alignments is not very different. Table 2 gives an overview of the CNV detection results when using both alignment methods. Ideally for one of the alignment methods to be superior to the other, 4

9 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis Normalized RD Sample Normalized RD Control median 1e e Fig. 6. An acgh call, depicted by the vertical yellow lines, that is not found by DWAC-seq (sample 2). The plots show that a large number of bins in the plotted region is empty (indicating that no reads have been mapped to this area). The variance in the read-depth is high in both the test sample and control sample and the log 2 ratio does not show a clear increase or decrease. Although the log 2 ratio of the acgh data indicates a deletion (see figure 7), also here the evidence in the acgh is not very convincing (only few probes that also show variable signal). We conclude that in this case the NGS analysis is more reliable than the acgh data and that this region can not be considered a CNV. 2.8 Filtering the calls generated by DWAC-seq on quality improves the reliability of the calls As the acgh calls are inconsistent to use, a reliable standard of truth has been made. This is done by manually annotating all calls generated by both the acgh and the NGS pipelines. Table 3 shows an overview of the results of the detections based on the NGS data when comparing to this manual annotated set of regions. The annotation of the calls results in distinguishing features between the positive and negative set: bins that remain empty because no reads can be mapped in this region; and bins with a very low coverage as compared to the average coverage of the chromosome. We used these features to develop a rule-filter that is used as a post-processing step after running the DWAC-seq analysis (see Methods for details). The results of the manual annotation and the rule filtering are displayed in table 3. Using this rule-filter, good calls are distinguished from bad calls by following two considerations: 1) homologous deletions result in complete coverage in the control sample and mostly empty bins in the test sample, and 2) good calls will generally have a low percentage of empty bins, as coverage is needed to accurately define a call as a CNV. Hence, if the coverage is 100%, it is likely that the call is correct, and when the coverage drops also the reliability of the call drops. Fig. 7. The acgh data for the region discussed in figure 6 (top to bottom: log 2 ratio (blue); test sample signal (green); control sample signal (red) and p-value of the log 2 ratio (black)). The blue line (top panel) indicates the threshold for calling a gain, the pink for a loss. There are clearly two probes with a log 2 ratio below the threshold value. Comparing this figure with figure 6, we conclude that the region can not be considered as a true CNV, as the coverage is low (which could be an indication of repetitive sequences) and a large number of empty bins are observed around the region of interest. Table 3 shows that the rule-filtering step eliminates, on average, 80% of the negative CNV set from the calls. Of the total number of calls an average of 30% is deleted, reducing the number of false positives considerably. Of the positive set some calls are deleted due to the filtering, but this is limited to 6% on average. Making the rules stricter results in more calls from the positive set that are deleted, which can be an option if the number of total calls is still too large. After the rule-filtering, the overlapping calls are merged. This merging provides a number of calls that gives a more reliable overview of what is actually found with DWAC-seq. It reduces the number of total passed calls because of small overlapping regions at the end and beginning of certain calls. We want to merge these regions because they represent one big CNV. After the filtering and merging, the resulting calls that are detected are ranging between a length of 17 kb and 33.7 Mb, which is the detection limit for the NGS pipeline when applied on the four test samples available. The actual detection limit can deviate below and above these numbers, it is possible to detect regions smaller than 17 kb and larger than 33.7 Mb, but this can not be shown with the data available. The detection limit depends on the size of the window that is used, which depends on the coverage of the control sample (in this case around 4000 bases per window when assuming uniform distribution of the reads). The lower limit of the detection lies somewhere between 4 and 17 kb, as it is not possible to detect regions smaller than one window. 7

10 D.M. van Beek et al Table 3. Results of the rule-filter applied on the detected regions from the NGS data, when comparing them to the manual annotated set of calls. The manual annotation split the combined calls based on the NGS and acgh data in three classes: positive set (actual CNVs), negative set (no CNVs, or too low coverage), and a undetermined set (hard to determine the class). The numbers in the columns Pos. (positive), Neg. (negative) and Und. (undetermined) are the detected regions that are assigned to the corresponding set. All calls that are generated by acgh and NGS are assigned to a class. The Total row displays the total number of calls, and how these calls are manually annotated. The Passed row shows how many calls of the three sets pass the filter for this sample, the Deleted row the calls that do not pass the filter. The total number of calls after rule-filtering and merging is given in the first column. Sample Class Count Pos. Neg. Und. Sample 1 Total Merged: 30 Passed 43 (71.7%) Deleted 17 (28.3%) Sample 2 Total Merged: 23 Passed 30 (58.8%) Deleted 21 (41.2%) Sample 3 Total Merged: 43 Passed 71 (83.5%) Deleted 14 (16.5%) Sample 4 R1 Total Merged: 40 Passed 43 (75.4%) Deleted 14 (24.6%) Determining the minimal coverage To determine the minimal coverage for the NGS data to still reliably detect aberrated regions we set up a simulation in which the coverage is artificially reduced before applying the detection tools. This is done using sample 3 and 4, of which sample 4 is sequenced in two different runs (sample 4 R1 and sample 4 R2). The original coverage of 3.6, 1.2 and 1.0 for all samples, respectively are reduced, by uniform sampling, in steps of two to a coverage between 0.1 and The resulting calls are analyzed and observations can be found in table 4. The lower coverages show an increase in the total number of calls (increase in false positives). As the same control sample is used, the positioning of the windows and the window size are exactly the same for all runs. There does not seem to be a large decrease in quality of the output (the number of false negatives) for a coverage between 0.1 and 0.35, but a large increase of false positives is observed. Note that a large part of the false positives appear on the X-chromosome, which could be due to the use of a female control sample while samples 3 and 4 are male. The clinically relevant CNVs (i.e. those annotated with Type II to IV), are found for all coverages displayed in table 4, with the exception of two calls in sample 3 that were also missed in the previous experiments (see for a description of the missed CNVs the supplement). The lowest coverage tested in the previous experiment is 0.43x after mapping and pre-processing steps. Lower coverage could also be possible, the reduction experiments show that there is still a reasonable low number of false positives encountered at lower coverage. There seems to be a turning point for sample 3, sample 4 R1 and sample 4 R2 between , and respectively. This turning point is determined by looking at the number of calls after merging, this number is kept below 50. Concluding from this reduction experiment, the lowest coverage that can be used for the detection of CNVs with reduced NGS data lies at 0.23x The coverage for the control sample should be close to the coverage of the test sample Sample 4 and the control sample are sequenced in multiple runs (see for more details figure 8), which allows to compare the results of two independent runs. Table 5, first row, shows the results and one can notice that the performance for both runs of sample 4 are more or less the same. Even when the data of the two runs is combined, the performance stays the same. We also analyzed the runs by using all measurements done with the combined control sample, A R1,2,3 (which results in an increase of the control sample coverage to 4.5x). The result is shown in the second row in table 5, which shows that the false positives increase strongly. This is probably due to the fact that the size of the window is incorrect because the window size is expressed in the number of reads aligned per window and the coverage of the control sample drastically increased. Therefore we tried two alternative settings for the window size, 350 and 700. It can be observed that the number of false positives do drop but at the same time the number of true positives also drops, so that the performance eventually is not improved when using the combined data for the control sample. Also the number of merged calls (last column in table 5) shows that combining data for the control sample does not help as the number of merged calls when using the same coverage for the control sample as for the test sample is closest to the expected number of calls (somewhere between 30 and 50, see section 2.4) Multiplexing test and control sample As multiple factors can influence the process of sequencing, we expect that the best results will be possible when the test sample and the control sample are sequenced in the same lane (multiplexing, for details see figure 8). The setting in which the corresponding test and control sample are multiplexed are colored red in table 6. We expect that multiplexing reduces biases that can occur due to differences in sequencing conditions such as differences in temperature or chemical composition of the genetic material. Other bias can result from GC-content. GC-rich regions are known to influence the number of reads that are sequenced in a region [Benjamini and Speed, 2012]. In table 6 for sample 4 R2 the exact opposite of what we expect can be observed: a higher number of false negatives is observed than when using a control sample from another sequencing run (so no multiplexing). Sample 4 R1 does show the results as we expect: low number of false negatives and relatively low number of false positives for the case where the control sample is multiplexed with the test sample. As stated before, the number of TP, FP and FN are based on acgh analysis and is not very reliable. When looking at the results of the rule-filtering, for both samples the least data is deleted when using control sample A R2 and the lowest number of calls are passed. After merging the number of calls for sample 4 R2 in combination with control sample A R2 is very low, even lower than the number of acgh calls that are made. We conclude that the advantages of multiplexing test and control samples can not be directly derived from table 6. It seems that 8

11 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis Table 4. Detection results for sample 3 and 4 when the coverage is artificially reduced by uniform sampling. A R1 is used as control sample (coverage of 1.3x). The X: states the number of false positives that are called on the X-chromosome. The total number of calls made on the acgh can be calculated by adding the numbers of TP and FN. There is a difference when adding the TP, FP and FN and the passed and filtered calls. This difference is explained by the fact that a TP is counted only once, even if more than one NGS call overlaps an acgh call. Samples 1 and 2 are added for reference purposes. Sample 3 Coverage TP FP FN Passed Filtered Merged No reduction 3.63x (X: 8) Reduction 1a 0.45x (X: 7) Reduction 1b 0.45x (X: 9) Reduction 2a 0.23x (X: 11) Reduction 2b 0.23x (X: 14) Reduction 3a 0.11x (X: 45) Reduction 3b 0.11x (X: 45) Sample 4 R1 Coverage TP FP FN Passed Filtered Merged No reduction 1.24x (X: 7) Reduction 1a 0.35x (X: 11) Reduction 1b 0.35x (X: 13) Reduction 2a 0.17x (X: 16) Reduction 2b 0.17x (X: 18) Reduction 3a 0.09x (X: 39) Reduction 3b 0.09x (X: 42) Sample 4 R2 Coverage TP FP FN Passed Filtered Merged No reduction 1.05x (X: 7) Reduction 1a 0.26x (X: 22) Reduction 1b 0.26x (X: 28) Reduction 2a 0.13x (X: 32) Reduction 2b 0.13x (X: 26) Reduction 3a 0.07x (X: 53) Reduction 3b 0.07x (X: 42) Sample x (X: 1) Sample x (X: 0) Table 5. The effect of the change in coverage of test and control sample. As sample 4 is sequenced in two, and the control sample is sequenced in three runs (see figure 8 for details), the data can be used separately and merged to see the effects of the changing coverage on the results. Combinations are made using the two runs from sample 4, control sample A R1 and control sample A R1,2,3 (see section S2). For control sample A R1,2,3 three different window sizes are used. TP FP FN Passed Filtered Merged Control A R1 Sample 4 R (75.4%) 14 (24.6%) 31 Sample 4 R (68.5%) 17 (31.5%) 28 Sample 4 R1, (73.8%) 16 (26.2%) 34 Control A R1,2,3 Sample 4 R (75.1%) 70 (24.9%) 161 W =100 Sample 4 R (75.2%) 79 (24.8%) 171 Sample 4 R1, (76.6%) 54 (23.4%) 131 Control A R1,2,3 Sample 4 R (67.0%) 28 (33.0%) 47 W =350 Sample 4 R (59.3%) 35 (40.7%) 36 Sample 4 R1, (65.4%) 27 (34.6%) 40 Control A R1,2,3 Sample 4 R (64.8%) 19 (35.2%) 27 W =700 Sample 4 R (56.9%) 22 (43.1%) 19 Sample 4 R1, (62.5%) 18 (37.5%) 22 the use of a non-multiplexed control sample can perform as good as a multiplexed sample. As the extend of the bias resulting from sequencing on different runs is not yet known in detail, more data should be generated to determine the necessity to multiplex test and reference sample. 3 DISCUSSION 3.1 The control group should be genetically diverse The control sample used in the analyses is of commercial origin and is suitable because it contains a pool of 100 test subjects, 9

15 Detection of CNVs using Shallow Whole Genome NGS Data to replace acgh Analysis 1) Nucleotide-based method TN TN TN FP TP TP FN FP FN FP TP TN TP TN FP TP FN TN FN FP acgh NGS 2) Region-based nucleotide method TP TN FP TN TN FN TP TN TP TN TP if: Overlap > 50% TN FN TP FP acgh NGS 3) Gene-based method acgh NGS Gene track 4) Region-based method acgh NGS TP 1x TP TN FP TN FN TN TP FP No gene: Not considered 1x TN 1x FP 1x 1x TN FN 1x TN 1x TP 1x TN 1x TP 1x TN 1x TP if: Overlap > 50% 1x TN FN TP FP 1x FN 1x TP 1x FP Fig. 9. Four method that are used to determine whether NGS calls are right or wrong for each position of the genome. Top panel: the nucleotide-based method. This method considers each nucleotide separately and labels each position with the resulting class. Panel in 2nd row: region-based method. In this method, the class is determined for the longest possible stretch of overlapping calls. The overlap should be larger than 50% of the smallest call, otherwise the nucleotide-based method is used. Panel in the 3rd row: gene-based method. This method is similar to the nucleotide-based method, but only considers the part of the genome that lies in genomic regions. The RefSeq genetrack from UCSC is used to determine genomic regions (downloaded July 30, 2012). Bottom panel: region-based method. This method scores regions by counting how many CNV regions overlap. It does not score per nucleotide, but it scores by a call. This means that when adding all false and true positives and negatives it will not result in the total number of nucleotides in the genome. 5.6 Rule-filtering The detected regions based on the NGS data with DWAC-seq are filtered to determine which calls are reliable and which are not and thus are probably not true copy number changes. By filtering on the reliability of the calls (i.e. only keeping the reliable calls) the number of false positives is limited. The filtering is based on a set of rules that are derived from manually annotating the acgh and NGS calls by looking at the signals of the test sample and control sample. The combined set of acgh and NGS calls is divided into three classes: a positive set (true CNV), a negative set 3 (false CNV), and an undetermined set (uncertain if true or false CNV). The undetermined set contains calls that are hard to classify into the positive or negative set because of differences in signal (for example one part shows an increase in signal, the rest of the call has a very variable log 2 ratio because of low coverage). 3 Note that this negative set is slightly different from negatives when comparing acgh and NGS calls. Here one of the methods has called this region (positive), but after manual annotation it was decided that there is not enough evidence to call this region. We looked at several statistics for each class to find distinguishing properties. These statistics included, amongst others: the read-depth of the test sample and control sample as compared to the average signal of the chromosome and the whole genome; the variance of the log 2 ratio is calculated and compared to the average variance of the whole chromosome and genome, and the variance of the read-depth signal. An important statistic turned out to be the number of empty bins. An empty bin is a bin (part of the genome, in our case of 1000 bases) that contains no mapped reads and can be a measure for unmappable regions due to repetitive sequences. Regions in the negative class appeared to have a lot of empty bins. In the positive set there are also CNVs with empty bins, but these CNVs are homozygous deletions or duplications, having a decent control sample coverage, but almost no coverage in the sample. The rule-fulter for the NGS calls then becomes: If test and control sample do not contain any empty bins, assign to positive call set. If both test and control sample contain empty bins: If the total percentage of empty bins in both test and control sample is less than 20%, assign to positive call set. Otherwise the call is assigned to negative call set. If test OR control sample contains empty bins, but the other sample contains no empty bins: If the percentage of coverage of the control sample is higher than 50%, assign to positive call set. Otherwise the call is assigned to the negative call set. 5.7 Merging of calls Sometimes if a large aberration is present, DWAC-seq detects this aberration into multiple calls that slightly overlap. After application of the rule filter, the calls that are overlapping are merged to create a more reliable view of the true number of calls that are made by the CNV detection tool. In the Results section the effect of the merging is often displayed in a separate merged column, to see the difference between the calls that pass the filter and the actual number of calls after merging. ACKNOWLEDGEMENT We would like to acknowledge Desiree Steenbeek for assistance with all data requests regarding acgh, Quinten Waisfisz for assistance with the generation of the NGS data, Daoud Sie and Bauke Ylstra for the inspiration ([Smeets et al., 2011]) and Roy Straver for the collaboration on the development of the RETRO filter and inspiring brain-storm sessions. REFERENCES 1000 Genomes Project. Sanger ftp 1000 genomes reference, 10th October Alexej Abyzov, Alexander E. Urban, Michael Snyder, and Mark Gerstein. Cnvnator: An approach to discover, genotype, and characterize typical and atypical cnvs from family and population genome sequencing. Genome Research, 21: , Agilent Technologies. Agilent scan control 8.5.1, a. Agilent Technologies. Feature extraction 11.0, b. Jeffrey A. Bailey, Zhiping Gu, Royden A. Clark, Knut Reindert, Rhea V. Samonte, Stuart Schwartz, Mark D. Adams, Eugene W. Myers, Peter W. Li, and Evan E. Eichler. Recent segmental duplications in the human genome. Science, 297(5583): , August Yval Benjamini and Terence P. Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Research, February Peter J. Campbell, Philip J. Stephens, Erin D. Pleasance, Sarah O Meara, Heng Li, Thomas Santarius, Lucy A. Stebbings, Sarah Edkins, Claire Hardy, Jon W. Teague, Andrew Menzies, Ian Goodhead, Daniel J. Turner, Christopher M. Clee, Michael A. Quail, Antony Cox, Clive Brown, Richard Durbin, Matthew E. Hurles, Paul A.W. Edwards, Graham R. Bignell, Michael R. Stratton, and P. Andrew Futreal. 13

16 D.M. van Beek et al Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics, 40(6): , CERN. Root package. URL root.cern.ch. K. Darvishi. Application of nexus copy number software for cnv detection and analysis. Current Protocols in Human Genetics, April Ensembl. Ensembl human genome sequence ftp, January ENZO Life Sciences, Inc. Product data sheet enz cgh labeling kit for oligo arrays. Lars Feuk, Andrew R. Carson, and Stephen W. Scherer. Structural variation in the human genome. Nature Reviews Genetics, 7:85 97, February Arief Gusnanto, Henry M. Wood, Yudi Pawitan, Pamela Rabbitts, and Stefano Berri. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics, 28 (1):40 47, Henne Holstege. Gebruik van array in prenatale diagnostiek in nederland John Hunter, Darren Dale, and Michael Droettboom. Matplotlib. Illumina, Inc. De novo assembly using illumina reads, URL technote_denovo_assembly_ecoli.pdf. Illumina, Inc. Truseq( TM ) rna and dna sample preparation kits v2, April URL datasheets/datasheet_truseq_sample_prep_kits.pdf. Günter Klambauer, Karin Swarzbauer, Andreas Mayr, Djork-Arné Clevert, Andreas Mitterecker, Ulrich Bodenhofer, and Sepp Hochreiter. cn.mops: Mixture of poissons for discovering copy number variations in next genration sequencing data. Nucleic Acids Research, 40, Jan O. Korbel, Alexander E. Urban, Jason P. Affourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M. Kim, Dean Palejev, Nicholas J. Carriero, Lei Du, Bruce E. Taillon, Zhoutao Chen, Andrea Tanzer, A.C. Eugenia Saunders, Jianxiang Chi, Fengtang Yang, Nigel P. Carter, Matthew E. Hurles, Sherman M. Weissman, Timothy T. Harkins, Mark B. Herstein, Michael Egholm, and Michael Snyder. Paired-end mapping reveals extensive structural variation in the human genome. Science, 328: , October Slavik Koval and Victor Guryev. Dwac-seq dynamic window approach for cnv detection using sequencing tag density, February 10, User Guide Megapool Reference DNA EA-100M (male) & EA-100F (female). Kreatech Diagnostics, Vlierweg 20, 1032 LG Amsterdam, version 1.0 edition, March Hugo Y.K. Lam, Michael J. Clark, Rui Chen, Rong Chen, Georges Natsoulis, Maeve O Huallachain, Frederick E. Dewey, Lukas Habegger, Euan A. Ashley, Mark B. Gerstein, Atul J. Butte, Hanlee P. Ji, and Michael Snyder. Performance comparison of whole-genome sequencing platforms. Nature Biotechnology, 30:78 82, June H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map (sam) format and samtools. Bioinformatics, 25:2078 9, Heng Li and Richard Durbin. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics, 25(14): , Y.M. Lo, N. Cobetta, P.F. Chamberlain, V. Rai, I.L. Sargent, C.W.G. Redman, and J.S. Wainscoat. Presence of fetal dna in maternal plasma and serum. The Lancet, 350 (9076): , Daniel Pinkel, Richard Segraves, Damir Sudar, Steven Clark, Ian Poole, David Kowbel, Colin Collins, Wen-Lin Kuo, Chira Chen, Ye Zhai, Shanaz H. Dairkee, Britt-marie Ljung, Joe W. Gray, and Donna G. Albertson. High resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20: , October James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, and Jill P. Mesirov. Integrative genomics viewer. Nature Biotechnology, 29:24 26, Serge J. Smeets, Ulrike Harjes, Wessel N. van Wieringen, Daoud Sie, Ruud H. Brakenhoff, Gerrit A. Meijer, and Bauke Ylstra. To dna or not to dna? that is the question, when it comes to molecular subtyping for the clinic! Clinical Cancer Research, 17: , June R. Straver, H. Holstege, E.A. Sistermans, C.B.M. Oudejans, and M.J.T. Reinders. Detection of fetal copy number aberrations by shallow sequencing of maternal blood samples. Unpublished, Lisenka E.L.M. Vissers, Bert B.A. de Vries, and Joris A. Veltman. Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis. Journal of Medical Genetics, 47: , Ruibin Xi, Angela G. Hadjipanayis, Lovelace J. Luquette, Tae-Min Kim, Eunjung Lee, Jianhua Zhang, Mark D. Johnson, Donna M. Muzny, David A. Wheeler, Richard A. Gibbs, Raju Kucherlapati, and Peter J. Park. Copy number variation detection in whole-genome sequencing data using the bayesian information criterion. Proceedings of the National Academy of Sciences, 108(46):E1128 E1136, Chao Xie and Martti T. Tammi. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 10(80), Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, Kenny Ye, and Jonathan Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19: ,

