Skip to content
BY 4.0 license Open Access Published by De Gruyter September 8, 2023

Ultra-short cell-free DNA fragments enhance cancer early detection in a multi-analyte blood test combining mutation, protein and fragmentomics

  • Fenfen Wang , Xinxing Li , Mengxing Li , Wendi Liu , Lingjia Lu , Yang Li , Xiaojing Chen , Siqi Yang , Tao Liu , Wen Cheng , Li Weng , Hongyan Wang , Dongsheng Lu , Qianqian Yao , Yingyu Wang , Johnny Wu , Tobias Wittkop , Malek Faham , Huabang Zhou EMAIL logo , Heping Hu , Hai Jin , Zhiqian Hu , Ding Ma and Xiaodong Cheng ORCID logo EMAIL logo

Abstract

Objectives

Cancer morbidity and mortality can be reduced if the cancer is detected early. Cell-free DNA (cfDNA) fragmentomics emerged as a novel epigenetic biomarker for early cancer detection, however, it is still at its infancy and requires technical improvement. We sought to apply a single-strand DNA sequencing technology, for measuring genetic and fragmentomic features of cfDNA and evaluate the performance in detecting multiple cancers.

Methods

Blood samples of 364 patients from six cancer types (colorectal, esophageal, gastric, liver, lung, and ovarian cancers) and 675 healthy individuals were included in this study. Circulating tumor DNA mutations, cfDNA fragmentomic features and a set of protein biomarkers were assayed. Sensitivity and specificity were reported by cancer types and stages.

Results

Circular Ligation Amplification and sequencing (CLAmp-seq), a single-strand DNA sequencing technology, yielded a population of ultra-short fragments (<100 bp) than double-strand DNA preparation protocols and reveals a more significant size difference between cancer and healthy cfDNA fragments (25.84 bp vs. 16.05 bp). Analysis of the subnucleosomal peaks in ultra-short cfDNA fragments indicates that these peaks are regulatory element “footprints” and correlates with gene expression and cancer stages. At 98 % specificity, a prediction model using ctDNA mutations alone showed an overall sensitivity of 46 %; sensitivity reaches 60 % when protein is added, sensitivity further increases to 66 % when fragmentomics is also integrated. More improvements observed for samples representing earlier cancer stages than later ones.

Conclusions

These results suggest synergistic properties of protein, genetic and fragmentomics features in the identification of early-stage cancers.

Introduction

Cancer is a leading cause of death worldwide, with overall cancer death rates exceeding two per 1,000 males and one per 1,000 females in some countries [1, 2]. Although overall cancer mortality rates have steadily fallen over the past decades, absolute numbers of cancer deaths continue to rise due to growing and aging populations [3]. Cancer morbidity and mortality can be reduced if the cancer is detected early, before the initial tumor metastasizes [4]. The development of multi-cancer early detection tests could reduce all cancer-related deaths by as much as 26 % [5].

Tumor cells shed proteins, micro-vesicles, RNA, DNA and other molecules into the bloodstream [6], enabling cancer detection by minimally invasive liquid biopsy tests. Although protein biomarkers such as cancer antigen 125, carcinoembryonic antigen, and carbohydrate antigen 19-9 have been used to monitor cancer progression and recurrence, the use of such markers for cancer early detection has been precluded by their poor specificity in the screening setting [7]. Even prostate-specific antigen, which is widely used to screen for prostate cancer, has a high false-positive rate [8]. Circulating tumor DNA (ctDNA) carrying cancer specific mutations has proven to be a remarkable biomarker for cancer treatment selection, cancer prognosis and monitoring [9], [10], [11], [12]. The amount of ctDNA present in the blood varies depending on cancer types, and it correlates well with cancer stages [13]. As recurring cancer mutations encompass only a small portion of the genome, a small panel can be used to effectively capture mutation signals from ctDNA and be part of a multi-analyte test to boost performance for cancer early detection [14], [15], [16], [17].

In recent years, cell-free DNA (cfDNA) fragmentomics, or the analysis of nonrandom cfDNA fragmentation patterns, has emerged as a novel source of biomarkers that offers additional biological insights [18], [19], [20], [21], [22]. Most of the studies used the conventional double strand DNA (dsDNA) library preparation, which mainly recovers double strand cfDNA fragments >100 bp after end repair – and inferred cfDNA fragmentation through analysis of nucleosome occupancy. Single-stranded DNA (ssDNA) library preparation methods, have been shown to capture a more complete fragmentome than dsDNA library preparation method, and uncovered a population of ultra-short cfDNA fragments (<100 bp) derived from subnucleosomal protein binding events [23], [24], [25], [26], [27]. However, most of the ssDNA library preparation workflow is laborious, involving multiple rounds of ligation and purification, which may lead to significant sample loss [26, 28].

We previously developed an efficient, single-strand based next generation sequencing assay for detecting actionable tumor mutations in ctDNA which called CLAmp-seq (Circular Ligation Amplification and sequencing) [29]. It utilizes concatemer error correction to remove both sequencing and polymerase errors. CLAmp-seq uses rolling circle amplification (RCA) to make copies of circularized ssDNA molecules before sequencing. Unlike PCR, which can amplify polymerase errors from error-containing copies, RCA creates replicates independently from the original molecules hence it will not propagate polymerase error. In addition, CLAmp-seq offers a highly efficient and streamlined ssDNA library preparation workflow that effectively extracts both mutation and fragmentomics information from the same analyte.

In this study, we explored a highly sensitive and specific multi-analyte blood test that incorporates cfDNA and protein for cancer screening. We co-opted CLAmp-seq for the ctDNA mutation and cfDNA fragmentomic analysis, demonstrating that this method not only detects cancer mutations at high sensitivity and specificity, but also preserves short cfDNA fragments that carry unique cancer signals, which is not observed in dsDNA library preparation. Overall, our proof-of-concept data highlight the utility of cfDNA fragmentomics and mutation detection by CLAmp-seq in combination with protein markers for multiple cancer early detection.

Materials and methods

CLAmp-seq

Plasma cfDNA sequencing was performed using CLAmp-seq as described previously [29]. Briefly, single-stranded cfDNA molecules were denatured and circularized by intra-molecular ligation, followed RCA using primers for target genes. The CLAmp-seq libraries were processed at the Shanghai Yunsheng Medical Laboratory Co. Ltd and sequenced on MGISEQ-2000 (MGI Tech Co. Ltd, Shenzhen, P.R. China) in single-end 400 bp mode.

Variant calling, filtering, and annotation

Variant call was described previously [29]. Briefly, the reads were aligned to the human reference genome (hg19) and the following three criteria were used when calling a variant: (1) the base call is different from the reference at the position of interest; (2) the difference is consistent between the tandem copies of the fragment sequence; and (3) the difference is supported by at least two molecules. Potential false positive variants could be further filtered out based on a baseline model to increase specificity of variant calling. A position-specific background model of how frequently non-reference alleles were observed in controls was built based on the healthy samples in the “baseline” group. Scoring for each variant call was then performed based on comparison to the background model. Conceptually, higher scores would be assigned if a non-reference allele was observed with sufficient coverage at a position where that allele was rarely observed in controls.

White blood cells (WBCs) sequencing of samples with positive mutation scores allowed us to filter out somatic mutations from clonal hematopoiesis of indeterminate potential. If a cfDNA variant was observed in genimic DNA (gDNA) of WBCs, and if the allelic fraction (AF) for the cfDNA variant was at least 10-fold higher than the AF in WBCs for the cfDNA variant, the cfDNA variant would be kept for subsequent analyses. The remaining variants were annotated with a custom pipeline, which compared the variants to those in the Single Nucleotide Polymorphism Database (dbSNP) and TCGA. These were then used as features for predictive modeling, as described in the next section.

Predictive modeling

The cancer samples were divided into two groups – training and validation. Cancer types comprised of colorectal cancer, gastric cancer, hepatocellular carcinoma, and lung cancer. The healthy samples were divided in the three groups – baseline, training and validation. The baseline group was necessary for establishing the position-specific background mutation model as described in the previous section. The training and validation groups were used for modeling and validating the models.

The predictive model was first built based on a training set of samples. The training set was randomly selected from the set of all samples, with the positive and control samples being age matched. Each sample was annotated with features derived based on the mutation, protein, fragmentomics sequencing, and fragment size data. Among the features, we considered whether a sample had variants previously reported in dbSNP, TCGA, and the Catalogue Of Somatic Mutations In Cancer databases. Additionally, the normalized depth of fragments with multiple size ranges was estimated for each site. A set of 263 samples was used to train, parameterize, and evaluate the robustness of a model using these features based on our platform. An independent set of 258 samples was used to test and estimate the performance of our final classifier.

Initial classification using protein and mutation information was performed using logistic regression. Protein and mutation features were standardized using mean and standard deviation derived from training data alone. The same standardization and trained model were then applied on the validation set. For the final classifier we combined the prediction probabilities from the logistic regression with fragmentomics features on the subset of data for which we collected the additional information. We conducted supervised training with a multivariate gradient boosting classifier, an ensemble method that combines and optimizes weaker models [30, 31], to differentiate between healthy and cancer samples. The performance of the models was assessed based on receiver operating characteristic (ROC) curve. Feature selection of fragmentomic depths were performed by choosing the feature set that maximized sensitivity at a fixed specificity level. The final model was validated using a separate validation set of samples.

Results

Performance of mutation and protein for cancer detection

Blood samples were collected from a multi-cancer cohort representing six cancer types (colorectal, esophageal, gastric, liver, lung, and ovarian cancers) and healthy individuals (Supplemental Table S1). A total of 1,039 subjects’ blood samples were evaluated in this study: 364 with untreated cancer and 675 age-matched healthy controls (Figure 1). Patient characteristics are described in Table 1. cfDNA samples from 120 healthy controls were used for establishing baseline for mutation detection (Figure 1). The remaining 919 participants were randomly assigned into two groups: a training set (n=444) of 185 healthy controls and 259 cancer patients (stage I: 32.43 %; stage II: 25.87 %; stage III: 33.98 %; stage IV: 7.72 %) and an independent validation set (n=475), including 105 cancer patients (stage I: 33.33 %; stage II: 27.62 %; stage III: 29.52 %; stage IV: 9.52 %) and 370 healthy subjects. Training and validation sets were generally comparable with respect to age, gender, stage, and cancer type (Table 1).

Figure 1: 
Design of the study scheme. For the mutation and protein test model, enrolled healthy subjects were randomly assigned into baseline, training and validation set. Patient samples from six different cancer types across stage I–IV were randomly split into training and validation sets. For testing the performance of fragmentomics in combination with protein and mutation, a subset of the healthy and cancer samples was randomly picked and split into training and validation sets. For cancer patients the blood was collected before surgery.
Figure 1:

Design of the study scheme. For the mutation and protein test model, enrolled healthy subjects were randomly assigned into baseline, training and validation set. Patient samples from six different cancer types across stage I–IV were randomly split into training and validation sets. For testing the performance of fragmentomics in combination with protein and mutation, a subset of the healthy and cancer samples was randomly picked and split into training and validation sets. For cancer patients the blood was collected before surgery.

Table 1:

Patient characteristics of the multi-cancer cohort.

Variables All subjects (n=919) Training cohort (n=444) Validation cohort (n=475)
Healthy controls Cancer patients Healthy controls Cancer patients Healthy controls Cancer patients
(n=555) (n=364) (n=185) (n=259) (n=370) (n=105)
Gender n (%)
 Female 320 (57.66) 140 (38.46) 106 (57.30) 96 (37.07) 214 (57.84) 44 (41.90)
 Male 235 (42.34) 224 (61.54) 79 (42.70) 163 (62.93) 156 (42.16) 61 (58.10)
Age, years
 Mean, SD 56.46 (9.36) 59.82 (11.66) 57.83 (9.24) 59.81 (11.60) 55.77 (9.36) 59.85 (11.87)
 <50 96 (17.30) 61 (16.76) 23 (12.43) 44 (16.99) 73 (19.73) 17 (16.19)
 50≤Age<60 269 (48.47) 115 (31.59) 91 (49.19) 81 (31.27) 178 (48.11) 34 (32.38)
 60≤Age<70 149 (26.85) 105 (28.85) 53 (28.65) 81 (31.27) 96 (25.95) 24 (22.86)
 ≥70 41 (7.39) 83 (22.80) 18 (9.73) 53 (20.46) 23 (6.22) 30 (28.57)
Cancer stage n (%)
 I 119 (32.69) 84 (32.43) 35 (33.33)
 II 96 (26.37) 67 (25.87) 29 (27.62)
 III 119 (32.69) 88 (33.98) 31 (29.52)
 IV 30 (8.24) 20 (7.72) 10 (9.52)
Cancer type n (%)
 Colorectal cancer 69 (18.96) 56 (21.62) 13 (12.38)
 Esophageal cancer 42 (11.54) 27 (10.42) 15 (14.29)
 Gastric cancer 83 (22.80) 57 (22.01) 26 (24.76)
 Hepatocellular carcinoma 57 (15.66) 42 (16.22) 15 (14.29)
 Lung cancer 65 (17.86) 44 (16.99) 21 (20)
 Ovarian cancer 48 (13.19) 33 (12.74) 15 (14.29)

cfDNA was extracted from at least 3 mL of plasma samples. The median concentration of plasma cfDNA from cancer patients was significantly higher than that observed in healthy individuals (p<0.001, Supplemental Figure S1A). Among the six cancer types, concentration of plasma cfDNA was highest in liver cancer and lowest in esophageal cancer (Supplemental Figure S1B).

cfDNA was sequenced with the 18 gene panel using CLAmp-seq as described, which has been shown to detect variants with high sensitivity and specificity [29]. For plasma samples in which variants were detected, matched gDNA from WBCs was sequenced to filter out somatic mutations from clonal hematopoiesis of indeterminate potential (Supplemental Table S1). The protein biomarkers were measured by the Roche Cobas platform and Millipore Luminex platform using a total of 0.5 mL of each plasma sample. Logistic regression-based prediction models were trained on mutation and protein data from the training set and tested on the validation set (Supplemental Table S1). The model achieved consistent performance between the training and validation sets. The ROC curve showed an 89.2 % area under the curve (AUC) (95 % CI: 86.9–91.5 %) (Figure 2A). The observed specificity of the model was similar between the training (98.92 %, 95 % CI: 96.15–99.87 %) and validation sets (97.57 %, 95 % CI: 95.43–98.88 %). Sensitivity was also consistent between the training and validation sets in all cancer types. Overall, the test had 60.71 % sensitivity (95 % CI: 55.49–65.76 %) at 98.02 % (95 % CI: 96.48–99.01 %) specificity (Figure 2B). Sensitivities by cancer stage were generally comparable between the training and validation sets (Figure 2C). In the entire cohort, the mutation plus protein model detected 35.29 % of stage I cancers (95 % CI: 26.76–44.58 %), 67.71 % (95 % CI: 57.39–76.90 %) of stage II cancers, 73.95 % (95 % CI: 65.11–81.56 %) of stage III cancers, and 86.67 % (95 % CI: 69.28–96.24 %) of stage IV cancers. Sensitivities by cancer type ranged from 38.46 % (lung cancer) to 96.49 % (hepatocellular carcinoma at 98.02 % specificity) (Figure 2D). Sensitivity of individual cancer type by stage was shown in Supplemental Figure S2.

Figure 2: 
Performance of the combined ctDNA mutation and protein model for early cancer detection. (A) Receiver operator characteristic (ROC) curve of the multi-cancer cohort. (B) Comparison of performance for different datasets. (C) Sensitives of the training and validation sets by stage. (D) Sensitivity of the whole cohort by cancer types. HCC, hepatocellular carcinoma; OC, ovarian cancer; CRC, colorectal cancer; EC, esophageal cancer; GC, gastric cancer; LC, lung cancer.
Figure 2:

Performance of the combined ctDNA mutation and protein model for early cancer detection. (A) Receiver operator characteristic (ROC) curve of the multi-cancer cohort. (B) Comparison of performance for different datasets. (C) Sensitives of the training and validation sets by stage. (D) Sensitivity of the whole cohort by cancer types. HCC, hepatocellular carcinoma; OC, ovarian cancer; CRC, colorectal cancer; EC, esophageal cancer; GC, gastric cancer; LC, lung cancer.

Figure 3A shows sensitivity of each cancer type by stage. Most of the cancers in the lung cancer subgroup were stage I and stage II cancers (72.3 %) and were detected with relatively low sensitivity, prompting us to explore test performance by lung cancer subtype. Early-stage lung squamous cell carcinoma (LSCC) was more easily detected than early-stage lung adenocarcinoma (LUAD) [12, 32]. The ctDNA mutation detection ratio was lower in LUAD (28.57 %) than that in LSCC and other lung cancer types including small cell lung cancer (60.87 %, p=0.011) (Figure 3B, Supplemental Table S2). The relatively low performance in lung cancers is likely due to the higher proportion of LUAD (64.6 %) over LSCC and other lung cancer types in our lung cancer cohort (28.57 vs. 56.52 %, p=0.027, Figure 3C, Supplemental Table S2).

Figure 3: 
Sensitivity of individual cancer type by stage. (A) Sensitivity (the dot) at 98.02 % specificity of the entire cohort with 95 % confidence intervals (the line) is reported by stage and cancer types. (B) Circulating tumor DNA (ctDNA) mutation detection rate of different histological types of lung cancer. (C) Sensitivities of different histological types by the ctDNA mutation and protein model. The numbers in parentheses below the X-axis represent the sample sizes of the datasets. LUAD, lung adenocarcinoma; LSCC, lung squamous cell carcinoma; SCLC, small cell lung cancer.
Figure 3:

Sensitivity of individual cancer type by stage. (A) Sensitivity (the dot) at 98.02 % specificity of the entire cohort with 95 % confidence intervals (the line) is reported by stage and cancer types. (B) Circulating tumor DNA (ctDNA) mutation detection rate of different histological types of lung cancer. (C) Sensitivities of different histological types by the ctDNA mutation and protein model. The numbers in parentheses below the X-axis represent the sample sizes of the datasets. LUAD, lung adenocarcinoma; LSCC, lung squamous cell carcinoma; SCLC, small cell lung cancer.

CLAmp-seq captures subnucleosomal cfDNA fragments in regulatory region

CLAmp-seq is a single strand library preparation method that can capture both double strand and single strand populations of cfDNA. To compare how the cfDNA fragment profile recovered by CLAmp-seq differs from that recovered by a conventional double-stranded library preparation method, we sequenced two samples, one from a healthy control and the other from a patient with advanced colorectal cancer, using 30× whole-genome sequencing (WGS) using both methods. With the healthy control sample, both CLAmp-seq and dsDNA library sequencing yielded a characteristic dominant peak at ∼167 bp (Figure 4A), corresponding to the length of DNA wrapped around a single nucleosome (147 bp) plus a 20 bp linker [33]. CLAmp-seq detected a greater proportion of small fragments than the dsDNA method, in particular, the fragments that are less than 100 bp, which are largely missing in the dsDNA library (Figure 4A). As a whole, the control size profile captured by CLAmp-seq resembles those captured by other ssDNA methods that used a similar column-based cfDNA extraction protocol [24, 27, 34, 35]. In the cancer sample, although both the dsDNA method and CLAmp-seq revealed a shift toward smaller sized fragments compared to the control samples, this shift was much more dramatic with CLAmp-seq (Figure 4A). The average size difference between the control and cancer samples was 16.05 ± 0.13 bp with the double-stranded protocol, and 25.84 ± 0.06 bp with the CLAmp-seq protocol. Overall, these data suggest that CLAmp-seq captures a more comprehensive cfDNA profile than dsDNA library preparation and reveals a more significant size difference between cancer and healthy cfDNA fragments.

Figure 4: 
cfDNA framentominc features detected by CLAmp-seq. (A) Fragment size distribution of cfDNA samples prepared with CLAmp-seq and double-stranded library. Frequency is the fraction of reads for each fragment size relative to all reads for each sample. Solid line: Fragment size distribution of a representative subject with CRC prepared with single strand library (orange) and double strand library (blue). Dash line: Fragment size distribution of a healthy subject (blue) and a subject with late-stage CRC (orange) prepared with single-stranded library. (B) Peak annotation in small fragments. Left: Genomic distribution; right: Colocalization with transcription factor (TFs) and histone modifications. (C) Median normalized coverage for each gene group based on gene expression rank of whole blood expression. (D) Average fragment frequency profiles over 253 healthy samples and 18, 42, 38, 4 samples for stages I–IV with CRC cancer, respectively.
Figure 4:

cfDNA framentominc features detected by CLAmp-seq. (A) Fragment size distribution of cfDNA samples prepared with CLAmp-seq and double-stranded library. Frequency is the fraction of reads for each fragment size relative to all reads for each sample. Solid line: Fragment size distribution of a representative subject with CRC prepared with single strand library (orange) and double strand library (blue). Dash line: Fragment size distribution of a healthy subject (blue) and a subject with late-stage CRC (orange) prepared with single-stranded library. (B) Peak annotation in small fragments. Left: Genomic distribution; right: Colocalization with transcription factor (TFs) and histone modifications. (C) Median normalized coverage for each gene group based on gene expression rank of whole blood expression. (D) Average fragment frequency profiles over 253 healthy samples and 18, 42, 38, 4 samples for stages I–IV with CRC cancer, respectively.

Given that CLAmp-seq detected a substantial proportion of cfDNA fragments shorter than 120 bp in both the cancer and control samples, we further investigated the biological functions of these subnucleosomal fragments. We analyzed cfDNA WGS data from 153 healthy donors at depths ranging from 5 to 20×. We randomized the samples into three groups to avoid batch effects. We then aggregated samples within each group and performed peak calling for ultrashort fragments (size range 70–90 bp) using HOMER (http://homer.ucsd.edu/homer/). We selected peaks that (1) had an FDR<0.05 (see Methods); 92) that were present in two or three of the groups; and (3) that had a normalized coverage over the top 25 % coverage threshold (coverage cpm=1.80). A total of 287,684 peaks met these criteria. Of these, ∼66 % of these peaks localized to promoter transcription start site (TSS) regions, 14 % to intronic regions, and 11 % to intergenic regions (Figure 4B). Intersecting these data with the Encyclopedia of DNA Elements (ENCODE, https://www.encodeproject.org/) and Gene Transcription Regulation Database (GTRD, https://gtrd.biouml.org/) databases showed that 50 % of peaks colocalized with both histone modifications sites and transcription factor (TF) binding sites as revealed by chromatin immunoprecipitation sequencing (ChIP-seq) data, 35 % colocalized with transcription factor binding sites (TFBSs) alone, 6 % with histone modifications sites alone, and 9 % with neither (Figure 4B), indicating most of the subnucleosomal fragment peaks were associated with regulatory elements.

To determine whether the subnucleosomal fragment peaks correlated with gene expression in whole blood, we narrowed our focus to peaks within a 200 bp window around TSSs. We identified 19,122 genes with peaks within the TSS window and subsequently sorted them based on expression levels in blood obtained from the Genotype-Tissue Expression (GTEx) resource. We observed a strong positive correlation between mean/median subnucleosomal fragment coverage and gene expression level (Figure 4C, R=0.92, p=3.43E−07). This correlation held when the TSS windows were expanded to 10 kb (R=0.87). Collectively, these data suggest that the subnucleosomal cfDNA fragment peaks are regulatory element “footprints” that participate in regulating gene expression.

Fragmentomic features separate cancer and noncancer samples

The association between subnucleosomal cfDNA fragment peaks and gene expression led us to explore whether these peaks could be used to detect cancer, given that gene expression is often dysregulated in cancer. To test this hypothesis, we sequenced 253 control and 102 colorectal cancer (CRC) cfDNA samples and compared differential fragment coverages around CTCF binding sites, which are ubiquitous in the human genome and participate in gene regulation and chromatin organization. For each individual sample, we independently aggregated the relative depths of small (<80 bp) and large (>80 bp) fragments from the top 5,000 CTCF binding sites based on GTRD. At the center position of the CTCF binding sites, the short (<80 bp) cfDNA fragments displayed an exquisite peak, whereas long (>80 bp) fragments displayed a valley shape (Figure 4D), which is consistent with CTCF protection and a lack of nucleosome occupancy. Interestingly, when compared the healthy control samples to the CRC samples, the short fragment peaks observed at CTCF binding sites in healthy samples were absent in samples from patients with stage IV CRC (Figure 4D). Short fragment coverages for stage I, II, and III CRC samples were intermediate between those of healthy and stage IV CRC samples (Figure 4D). We also observed subtle differences in the normalized coverage of long fragments between control samples and stage IV CRC samples, but to a much less extent when compared with the short fragment peaks. Similar phenotype has been observed in other TFBSs such as KDM5B, SRF, etc. These data highlight the fragmentomic signals derived from subnucleosomal fragment peaks that separate cancer and control samples.

Fragmentomics improves cancer detection

Having identified fragmentomic differences in small fragment coverage around TFBSs between cancer and noncancer samples, we sought to leverage such features as biomarkers to improve the sensitivity for cancer detection. 1,902 TFBSs that showed differential coverage in subnucleosomal fragment peaks between cancer and healthy cfDNA samples in shallow WGS data were selected for panel design. We sequenced 521 control and cancer cfDNA samples using CLAmp-Seq with the TFBS panel and then calculated the normalized depth of fragment for multiple size ranges at each TFBS. To test the ability of these metrics to distinguish healthy and cancer samples, we first trained, parameterized, and evaluated the robustness of the model on an initial set of 263 samples (Supplemental Table S3). With this training set, we conducted 10-fold cross-validation to compute features based on normalized depth, as described previously. Features were hierarchically clustered into 20 groups, and those with the highest AUC from each group were selected as inputs for the training classifier. Subsequently, a final model was trained on all samples and tested on an independent validation set of 258 samples (Supplemental Table S3). For this set of samples, mutation and protein achieved the same performance (60 % sensitivity at 98 % specificity, Table 2) as we reported in Figure 2B (60.71 % sensitivity at 98 % specificity). Adding fragmentomics improved the overall sensitivity of the test to 66 % at 98 % specificity, with greater improvements for samples representing earlier cancer stages than later ones (Table 2).

Table 2:

Performance of classifiers using mutation only, mutation and protein, mutation and protein plus fragmentomics signal by cancer stages.

Cohort Stages Subjects analyzed Sensitivity
Mutation Mutation + protein Fragmentomics + mutation + protein Mutation Mutation + protein Fragmentomics + mutation + protein
Cancer patients I 18 28 % 28 % 33 % 39 % 44 % 56 %
II 22 36 % 59 % 68 % 50 % 68 % 68 %
III 25 56 % 76 % 76 % 72 % 84 % 84 %
IV 5 100 % 100 % 100 % 100 % 100 % 100 %
All 70 46 % 60 % 66 % 59 % 71 % 79 %

Healthy individuals 188 98 % specificity 90 % specificity

Discussion

Here, we reported a multi-analyte cancer screen test that combines mutation, protein, and fragmentomics signals to maximize sensitivity and specificity. In this study, we leveraged a simple yet highly efficient ssDNA sequencing technology, CLAmp-seq, that is robust and sensitive for both variant detection and cfDNA fragment profiling. By capturing substantially more subnucleosomal (<100 bp) cfDNA fragments than conventional dsDNA methods, CLAmp-Seq provided a comprehensive and native fragmentome for predictive modeling. Adding CLAmp-seq fragmentomics to protein and gene mutation improved the detection rate by ∼15 % in patients with early-stage (stage I–II) cancer.

High sensitivity and specificity are requirements for biomarkers for early cancer detection. While somatic mutations are highly specific to cancer cells, detecting somatic mutations in cfDNA from early-stage cancer is challenging due to the low tumor fraction and high noise introduced by library preparation and sequencing. Cohen et al. showed enhanced mutation detection via complicated assay workflow including sample partitioning into 12 reactions with unique molecular identifiers (UMIs) based multiplex PCR together with sophisticated algorithm [15]. By contrast, CLAmp-seq features a simple workflow that does not require sample partitioning and can simultaneously remove library preparation and sequencing artifacts through concatemer based error correction, achieving a false-positive rate as low as 10−6 [29]. High specificity of ctDNA mutation signals allow room for inclusion of additional analytes in blood to enhance performance. Several groups have shown that multi-analyte biomarker tests detect cancer with higher sensitivity than their single-analyte counterparts [14], [15], [16], [17]. In the present study, adding a protein biomarker panel to a mutation panel improved overall detection sensitivity from 46 to 60 %, and the addition of cfDNA fragmentomics further improved the sensitivity to 66 %.

Recent applications of single-stranded library preparation to liquid biopsy have increased our understanding of ultrashort fragments in cfDNA [23, 25, 26]. For example, Hudecova et al. discovered a sizable fraction of ∼50 bp cfDNA fragments by using magnetic beads in lieu of columns when extracting DNA from plasma [27]. These ultrashort fragments were associated with G4-rich promoters and were depleted in plasma samples from cancer patients. The data presented in this study, generated using the ssDNA method CLAmp-seq, — corroborate and extend these findings, showing a loss of short (<80 bp) fragment coverage not only at G4-rich TSSs, but at TFBSs across the genome in a pan-cancer cohort. Further analysis may yield new insights into the epigenomic footprints present in cfDNA fragments while improving their utility as biomarkers for diagnostic purposes.

Cost should also be considered when developing a screening test [36]. The cost of an early screen test can be minimized by simplifying workflows and testing the most informative biomarkers from multiple analytes. CLAmp-seq features a simple amplicon workflow that can potentially generate variant and fragment data from a single reaction. Our proof-of-concept study demonstrates that cfDNA fragmentomics can enhance the sensitivity of cancer detection when combined with mutation and protein biomarkers. In addition, fragmentomics and protein markers may provide information for inferring tissue-of-origin in a multi-cancer screening setting. A large cohort for training and validation will be required to fully realize the clinical utility of such a test.


Corresponding authors: Huabang Zhou, Department of Hepatobiliary Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, No. 225, Changhai Road, Yangpu District, Shanghai 200438, P.R. China, E-mail: ; and Xiaodong Cheng, Gynecological Oncology Department, Women’s Hospital, Zhejiang University School of Medicine, No. 1 Xuebu Road, Hangzhou, Zhejiang 310006, P.R. China; Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women’s Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, P.R. China; Zhejiang Provincial Clinical Research Center for Obstetrics and Gynecology, Hangzhou, P.R. China; and Zhejiang Provincial Key Laboratory of Traditional Chinese Medicine for Reproductive Health Research, Hangzhou, P.R. China, E-mail:

Fenfen Wang, Xinxing Li, Mengxing Li and Wendi Liu contributed equally to this work.


Funding source: the National Natural Science Foundation of China

Award Identifier / Grant number: 82273348, 81672890

Funding source: Clinical Research Incubation Program of Tongji Hospital

Award Identifier / Grant number: ITJ(ZD)2104

Funding source: Key talent introduction project of Tongji Hospital

Funding source: the Key R&D Program of Zhejiang Province

Award Identifier / Grant number: 2019C03010

Funding source: Shanghai Natural Science Foundation Project

Award Identifier / Grant number: 21ZR1458200

Funding source: the National Key R&D Program of China

Award Identifier / Grant number: 2022YFC2704200

Acknowledgments

We are grateful to all the subjects and also appreciate the members of AccuraGen (Shanghai, P.R. China) for their technical support.

  1. Research ethics: This study was complied with all relevant national regulations, institutional policies and is in accordance with the tenets of the Helsinki Declaration (as revised in 2013), and was approved by the Institutional Review Boards of Women’s Hospital School of Medicine Zhejiang University (IRB-20230026-R).

  2. Informed consent: This retrospective study was approved by the Ethics Committee of Women’s Hospital School of Medicine Zhejiang University to exempt the Informed consent of subjects.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. FFW, XXL, MXL, WDL: data Curation, formal analysis, investigation, resources, writing – original draft, writing – review and editing. FFW, XXL, MXL, WDL: data acquisition, formal analysis, writing – original draft, writing – review and editing. LJL, YL, XJC, SY, TL, WC, HW: data acquisition, writing – review and editing. LW: interpretation and discussion of data, writing – original draft, writing – review and editing. HYW: sequencing, formal analysis, writing – review and editing. DSL: formal analysis, software, writing – review and editing. QQY: formal analysis, visualization, writing – review and editing. YYW, JW, TW: bioinformatics framework, writing—review and editing. MF: concept and design of the study, interpretation and discussion of results, writing—review and editing. HHZ, HPH: conceptualization, interpretation of data, supervision, writing – review and editing. DM: conceptualization, interpretation and discussion of results, writing – review and editing. ZQH, HJ, XDC: conceptualization, supervision, project administration, funding acquisition, writing – review and editing.

  4. Competing interests: Hongyan Wang, Dongsheng Lu, Qianqian Yao are employees of Shanghai YunSheng Medical Laboratory Co., Ltd. Li Weng, Yingyu Wang, Johnny Wu, Tobias Wittkop, Malek Faham are employees or consultant of AccuraGen Inc. Other authors have no conflict of interest to declare.

  5. Research funding: This study was supported by the National Key R&D Program of China (2022YFC2704200), the National Natural Science Foundation of China (82273348, 81672890), the Key R&D Program of Zhejiang Province (2019C03010), Shanghai Natural Science Foundation Project (21ZR1458200), Key talent introduction project of Tongji Hospital (2021), Clinical Research Incubation Program of Tongji Hospital [ITJ(ZD)2104].

  6. Data availability: Due to national legislation, specifically the Administrative Regulations of the People’s Republic of China on Human Genetic Resources, access to raw sequencing data is restricted, no additional raw sequencing data is publicly available at this time. The raw sequencing data can only be made available following approval from the Ministry of Science and Technology of the People’s Republic of China. The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

1. Torre, LA, Siegel, RL, Ward, EM, Jemal, A. Global cancer incidence and mortality rates and trends – an update. Cancer Epidemiol Biomarkers Prev 2016;25:16–27. https://doi.org/10.1158/1055-9965.epi-15-0578.Search in Google Scholar PubMed

2. Islami, F, Ward, EM, Sung, H, Cronin, KA, Tangka, FKL, Sherman, RL, et al.. Annual report to the nation on the status of cancer, Part 1: national cancer statistics. J Natl Cancer Inst 2021;113:1648–69. https://doi.org/10.1093/jnci/djab131.Search in Google Scholar PubMed PubMed Central

3. Islami, F, Siegel, RL, Jemal, A. The changing landscape of cancer in the USA – opportunities for advancing prevention and treatment. Nat Rev Clin Oncol 2020;17:631–49. https://doi.org/10.1038/s41571-020-0378-y.Search in Google Scholar PubMed

4. Loud, JT, Murphy, J. Cancer screening and early detection in the 21(st) century. Semin Oncol Nurs 2017;33:121–8. https://doi.org/10.1016/j.soncn.2017.02.002.Search in Google Scholar PubMed PubMed Central

5. Hubbell, E, Clarke, CA, Aravanis, AM, Berg, CD. Modeled reductions in late-stage cancer with a multi-cancer early detection test. Cancer Epidemiol Biomarkers Prev 2021;30:460–8. https://doi.org/10.1158/1055-9965.epi-20-1134.Search in Google Scholar

6. van der Pol, Y, Mouliere, F. Toward the early detection of cancer by decoding the epigenetic and environmental fingerprints of cell-free DNA. Cancer Cell 2019;36:350–68. https://doi.org/10.1016/j.ccell.2019.09.003.Search in Google Scholar PubMed

7. Duffy, MJ, Diamandis, EP, Crown, J. Circulating tumor DNA (ctDNA) as a pan-cancer screening test: is it finally on the horizon? Clin Chem Lab Med 2021;59:1353–61. https://doi.org/10.1515/cclm-2021-0171.Search in Google Scholar PubMed

8. Kilpelainen, TP, Tammela, TL, Roobol, M, Hugosson, J, Ciatto, S, Nelen, V, et al.. False-positive screening results in the European randomized study of screening for prostate cancer. Eur J Cancer 2011;47:2698–705. https://doi.org/10.1016/j.ejca.2011.06.055.Search in Google Scholar PubMed

9. Lanman, RB, Mortimer, SA, Zill, OA, Sebisanovic, D, Lopez, R, Blau, S, et al.. Analytical and clinical validation of a digital sequencing panel for quantitative, highly accurate evaluation of cell-free circulating tumor DNA. PLoS One 2015;10:e0140712. https://doi.org/10.1371/journal.pone.0140712.Search in Google Scholar PubMed PubMed Central

10. Woodhouse, R, Li, M, Hughes, J, Delfosse, D, Skoletsky, J, Ma, P, et al.. Clinical and analytical validation of FoundationOne Liquid CDx, a novel 324-Gene cfDNA-based comprehensive genomic profiling assay for cancers of solid tumor origin. PLoS One 2020;15:e0237802. https://doi.org/10.1371/journal.pone.0237802.Search in Google Scholar PubMed PubMed Central

11. Tie, J, Wang, Y, Tomasetti, C, Li, L, Springer, S, Kinde, I, et al.. Circulating tumor DNA analysis detects minimal residual disease and predicts recurrence in patients with stage II colon cancer. Sci Transl Med 2016;8:346ra92. https://doi.org/10.1126/scitranslmed.aaf6219.Search in Google Scholar PubMed PubMed Central

12. Abbosh, C, Birkbak, NJ, Wilson, GA, Jamal-Hanjani, M, Constantin, T, Salari, R, et al.. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature 2017;545:446–51. https://doi.org/10.1038/nature22364.Search in Google Scholar PubMed PubMed Central

13. Bettegowda, C, Sausen, M, Leary, RJ, Kinde, I, Wang, Y, Agrawal, N, et al.. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci Transl Med 2014;6:224ra24. https://doi.org/10.1093/neuonc/nou206.24.Search in Google Scholar

14. Imperiale, TF, Ransohoff, DF, Itzkowitz, SH, Levin, TR, Lavin, P, Lidgard, GP, et al.. Multitarget stool DNA testing for colorectal-cancer screening. N Engl J Med 2014;371:187–8. https://doi.org/10.1056/nejmoa1311194.Search in Google Scholar

15. Cohen, JD, Li, L, Wang, Y, Thoburn, C, Afsari, B, Danilova, L, et al.. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science 2018;359:926–30. https://doi.org/10.1126/science.aar3247.Search in Google Scholar PubMed PubMed Central

16. Cohen, JD, Javed, AA, Thoburn, C, Wong, F, Tie, J, Gibbs, P, et al.. Combined circulating tumor DNA and protein biomarker-based liquid biopsy for the earlier detection of pancreatic cancers. Proc Natl Acad Sci USA 2017;114:10202–7. https://doi.org/10.1073/pnas.1704961114.Search in Google Scholar PubMed PubMed Central

17. Yang, Z, LaRiviere, MJ, Ko, J, Till, JE, Christensen, T, Yee, SS, et al.. A multianalyte panel consisting of extracellular vesicle miRNAs and mRNAs, cfDNA, and CA19-9 shows utility for diagnosis and staging of pancreatic ductal adenocarcinoma. Clin Cancer Res 2020;26:3248–58. https://doi.org/10.1158/1078-0432.ccr-19-3313.Search in Google Scholar

18. Mouliere, F, Chandrananda, D, Piskorz, AM, Moore, EK, Morris, J, Ahlborn, LB, et al.. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci Transl Med 2018;10:eaat4921. https://doi.org/10.1126/scitranslmed.aat4921.Search in Google Scholar PubMed PubMed Central

19. Ulz, P, Perakis, S, Zhou, Q, Moser, T, Belic, J, Lazzeri, I, et al.. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat Commun 2019;10:4666. https://doi.org/10.1038/s41467-019-12714-4.Search in Google Scholar PubMed PubMed Central

20. Cristiano, S, Leal, A, Phallen, J, Fiksel, J, Adleff, V, Bruhm, DC, et al.. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 2019;570:385–9. https://doi.org/10.1038/s41586-019-1272-6.Search in Google Scholar PubMed PubMed Central

21. Sun, K, Jiang, P, Cheng, SH, Cheng, THT, Wong, J, Wong, VWS, et al.. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res 2019;29:418–27. https://doi.org/10.1101/gr.242719.118.Search in Google Scholar PubMed PubMed Central

22. Esfahani, MS, Hamilton, EG, Mehrmohamadi, M, Nabet, BY, Alig, SK, King, DA, et al.. Inferring gene expression from cell-free DNA fragmentation profiles. Nat Biotechnol 2022;40:585–97. https://doi.org/10.1038/s41587-022-01222-4.Search in Google Scholar PubMed PubMed Central

23. Zhu, J, Huang, J, Zhang, P, Li, Q, Kohli, M, Huang, CC, et al.. Advantages of single-stranded DNA over double-stranded DNA library preparation for capturing cell-free tumor DNA in plasma. Mol Diagn Ther 2020;24:95–101. https://doi.org/10.1007/s40291-019-00429-7.Search in Google Scholar PubMed PubMed Central

24. Hisano, O, Ito, T, Miura, F. Short single-stranded DNAs with putative non-canonical structures comprise a new class of plasma cell-free DNA. BMC Biol 2021;19:225. https://doi.org/10.1186/s12915-021-01160-8.Search in Google Scholar PubMed PubMed Central

25. Burnham, P, Kim, MS, Agbor-Enoh, S, Luikart, H, Valantine, HA, Khush, KK, et al.. Single-stranded DNA library preparation uncovers the origin and diversity of ultrashort cell-free DNA in plasma. Sci Rep 2016;6:27859. https://doi.org/10.1038/srep27859.Search in Google Scholar PubMed PubMed Central

26. Gansauge, MT, Gerber, T, Glocke, I, Korlevic, P, Lippik, L, Nagel, S, et al.. Single-stranded DNA library preparation from highly degraded DNA using T4 DNA ligase. Nucleic Acids Res 2017;45:e79. https://doi.org/10.1093/nar/gkx033.Search in Google Scholar PubMed PubMed Central

27. Hudecova, I, Smith, CG, Hansel-Hertsch, R, Chilamakuri, CS, Morris, JA, Vijayaraghavan, A, et al.. Characteristics, origin, and potential for cancer diagnostics of ultrashort plasma cell-free DNA. Genome Res 2022;32:215–27. https://doi.org/10.1101/gr.275691.121.Search in Google Scholar PubMed PubMed Central

28. Wales, N, Caroe, C, Sandoval-Velasco, M, Gamba, C, Barnett, R, Samaniego, JA, et al.. New insights on single-stranded versus double-stranded DNA library preparation for ancient DNA. Biotechniques 2015;59:368–71. https://doi.org/10.2144/000114364.Search in Google Scholar PubMed

29. Wang, L, Hu, X, Guo, Q, Huang, X, Lin, CH, Chen, X, et al.. CLAmp‐seq: a novel amplicon‐based NGS assay with concatemer error correction for improved detection of actionable mutations in plasma cfDNA from patients with NSCLC. Small Methods 2020;4:1900357. https://doi.org/10.1002/smtd.201900357.Search in Google Scholar

30. Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;29:1189–232, 44. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar

31. Chen, T, Guestrin, C, editors. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016.10.1145/2939672.2939785Search in Google Scholar

32. Zhang, B, Niu, X, Zhang, Q, Wang, C, Liu, B, Yue, D, et al.. Circulating tumor DNA detection is correlated to histologic types in patients with early-stage non-small-cell lung cancer. Lung Cancer 2019;134:108–16. https://doi.org/10.1016/j.lungcan.2019.05.034.Search in Google Scholar PubMed

33. Alcaide, M, Cheung, M, Hillman, J, Rassekh, SR, Deyell, RJ, Batist, G, et al.. Evaluating the quantity, quality and size distribution of cell-free DNA by multiplex droplet digital PCR. Sci Rep 2020;10:12564. https://doi.org/10.1038/s41598-020-69432-x.Search in Google Scholar PubMed PubMed Central

34. Sanchez, C, Snyder, MW, Tanos, R, Shendure, J, Thierry, AR. New insights into structural features and optimal detection of circulating tumor DNA determined by single-strand DNA analysis. npj Genom Med 2018;3:31. https://doi.org/10.1038/s41525-018-0069-0.Search in Google Scholar PubMed PubMed Central

35. Troll, CJ, Kapp, J, Rao, V, Harkins, KM, Cole, C, Naughton, C, et al.. A ligation-based single-stranded library preparation method to analyze cell-free DNA and synthetic oligos. BMC Genom 2019;20:1023. https://doi.org/10.1186/s12864-019-6355-0.Search in Google Scholar PubMed PubMed Central

36. Cheryl Herman, M. What makes a screening exam “Good”? AMA Journal of Ethics 2006;8:34–7. https://doi.org/10.1001/virtualmentor.2006.8.1.cprl1-0601.Search in Google Scholar PubMed


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cclm-2023-0541).


Received: 2023-05-22
Accepted: 2023-08-21
Published Online: 2023-09-08
Published in Print: 2024-01-26

© 2023 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 10.5.2024 from https://www.degruyter.com/document/doi/10.1515/cclm-2023-0541/html
Scroll to top button