Skip to content
Publicly Available Published by De Gruyter September 19, 2022

Diagnosis of hepatocellular carcinoma based on salivary protein glycopatterns and machine learning algorithms

  • Zhen Tang , Fan Zhang , Yuan Wang , Chen Zhang , Xia Li , Mengqi Yin , Jian Shu , Hanjie Yu , Xiawei Liu , Yonghong Guo EMAIL logo and Zheng Li ORCID logo EMAIL logo

Abstract

Objectives

Hepatocellular carcinoma (HCC) is difficult to diagnose early and progresses rapidly, making it one of the most deadly malignancies worldwide. This study aimed to evaluate whether salivary glycopattern changes combined with machine learning algorithms could help in the accurate diagnosis of HCC.

Methods

Firstly, we detected the alteration of salivary glycopatterns by lectin microarrays in 118 saliva samples. Subsequently, we constructed diagnostic models for hepatic cirrhosis (HC) and HCC using three machine learning algorithms: Least Absolute Shrinkage and Selector Operation, Support Vector Machine (SVM), and Random Forest (RF). Finally, the performance of the diagnostic models was assessed in an independent validation cohort of 85 saliva samples by a series of evaluation metrics, including area under the receiver operator curve (AUC), accuracy, specificity, and sensitivity.

Results

We identified alterations in the expression levels of salivary glycopatterns in patients with HC and HCC. The results revealed that the glycopatterns recognized by 22 lectins showed significant differences in the saliva of HC and HCC patients and healthy volunteers. In addition, after Boruta feature selection, the best predictive performance was obtained with the RF algorithm for the construction of models for HC and HCC. The AUCs of the RF-HC model and RF-HCC model in the validation cohort were 0.857 (95% confidence interval [CI]: 0.780–0.935) and 0.886 (95% CI: 0.814–0.957), respectively.

Conclusions

Detecting alterations in salivary protein glycopatterns with lectin microarrays combined with machine learning algorithms could be an effective strategy for diagnosing HCC in the future.

Introduction

Liver cancer is one of the malignant tumors with the highest mortality rate worldwide, and its incidence is growing worldwide. Hepatocellular carcinoma (HCC) accounts for approximately 80% of all cases of liver cancer [1, 2]. Generally, the diagnosis of HCC is made by cytology or histology. Following breakthroughs in understanding HCC-specific radiological characteristics during phasic vascular perfusion of contrast during cross-sectional imaging with computed tomography (CT) and magnetic resonance imaging (MRI), the diagnosis of HCC in cirrhotic patients may now be confirmed reliably without biopsy. However, the radiological diagnosis suffers from a low specificity in predicting benign lesions [3]. Previous studies have evaluated the diagnostic effect of several serum markers for HCC such as alpha-fetoprotein and Des-gamma-carboxy prothrombin. Regrettably, these serum biomarkers may be less sensitive and specific in the highest-risk patients [4], [5], [6]. Therefore, it is particularly necessary to develop a new diagnostic method to distinguish between hepatic cirrhosis (HC) and HCC to improve the effectiveness of surveillance.

Glycosylation plays an important role in cancer cells, bypassing cell division checkpoints, avoiding death signals and immune surveillance, and migrating to metastatic sites, thereby promoting tumor growth [7]. Numerous studies have found a strong link between glycosylation in serum and tumor initiation, progression, and metastasis [8]. Notably, saliva is one of the most complex and versatile body fluids which is considered to be an important source of biological information for the detection of human diseases [9]. The glycoforms of salivary proteins can dynamically reflect the physiological and pathological conditions associated with many human diseases. Previous studies have suggested that the elderly individuals have strongest resistance to influenza A virus partly by presenting more terminal α2-3/6-linked sialic acid residues in their saliva, which bind with the influenza viral hemagglutinations, which is an important reference for studying virus-susceptible populations [10]. In our prior work, we showed that the proportion of fucosylated N-glycans was higher in the HCC group than in any other group (healthy volunteers [HV], HC), by using matrix-assisted laser desorption ionization time-of-flight/time-of-flight mass spectrum [11]. However, the enzymatic or chemical stripping of glycans from proteins before mass spectra (MS) profiling limits the reliable detection and identification of complete glycans and is time-consuming [12]. In recent years, lectin microarrays have evolved into an effective tool for researching glycosylation due to their high throughput, sensitivity, and efficiency [13], [14], [15], [16]. Furthermore, it is worth mentioning that machine learning techniques use statistical methods to infer relationships between patient attributes and outcomes in large datasets and have been successfully applied for early diagnosis and prognosis monitoring of cancer [17, 18], which inspires us for a fuller and more efficient use of the data generated by lectin microarrays.

The purpose of this study was to detect the expression levels of salivary glycopatterns in 118 saliva samples from HV, HC, and HCC using lectin microarrays, which were combined with machine learning algorithms to construct the diagnostic model as well as to evaluate the performance of the model with 85 independent saliva samples. A flow chart of our experimental design is shown in Figure 1.

Figure 1: 
Workflow diagram for constructing diagnostic models based on lectin microarrays and machine learning algorithms.
Figure 1:

Workflow diagram for constructing diagnostic models based on lectin microarrays and machine learning algorithms.

Materials and methods

Study approval and patient cohorts

The collection and use of all human whole saliva for the research presented here were approved by the Ethical Committee of Northwest University (Xi’an, China) and the Ethical Committee of Second Affiliated Hospital of Xi’an Jiaotong University (Xi’an, China). This study was conducted following the ethical guidelines of the Declaration of Helsinki. Exclusion criteria included current smoking or pregnancy, history of hypertension, asthma, diabetes, or digestive system diseases other than the liver. The discovery cohort consisted of 105 saliva samples (HV=35, HC=40, and HCC=43) and a total of 85 independent saliva samples (HV=31, HC=28, and HCC=26) from the validation cohort were used to evaluate the predictive performance of the diagnostic model. All liver tissues were histologically examined, and these were confirmed by senior pathologists. Information regarding the clinical samples is summarized in Table 1.

Table 1:

Baseline characteristics of healthy volunteers and patients with hepatic cirrhosis and hepatocellular carcinoma.

Characteristics Discovery cohort Validation cohort p-Value
HV (35) HC (40) HCC (43) Total (118) HV (31) HC (28) HCC (26) Total (85)
Age, years 0.45
<50 17 20 11 48 (23.65%) 20 13 7 40 (19.70%)
≥50 18 20 32 70 (34.48%) 11 15 19 45 (22.17%)
Sex 1.00
Female 16 11 6 33 (16.26%) 13 8 3 24 (11.82%)
Male 19 29 37 85 (41.87%) 18 20 23 61 (30.05%)
AFP, ng/mL 0.42
<20 35 37 20 92 (45.32%) 31 26 14 71 (34.98%)
≥20 0 3 23 26 (12.81%) 0 2 12 14 (6.90%)
HBsAg 0.47
Negative 35 10 2 47 (23.15%) 31 7 1 39 (19.21%)
Positive 0 30 41 71 (34.98%) 0 21 25 46 (22.66%)
Tumor size, cm
<5 39 39 24 24
≥5 4 4 2 2
Cancer embolus
Absence 0 0 2 2
Presence 43 43 24 24
  1. Statistical significance was determined using the Chi-squared statistical test. HV, healthy volunteers; HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; AFP, alpha-fetoprotein; HBsAg, hepatitis B surface antigen; –, no statistics.

Sample collection and preparation

The collection protocol has been described in previous literature [19]. All donors were asked to avoid eating, drinking, smoking, or using oral hygiene products for at least 1 h before sample collection. The whole saliva (about 1 mL) was collected and placed on ice and the Protease Cocktail Inhibitor (Sigma Aldrich, United States) was added to the saliva immediately after collection. The whole saliva was then centrifuged at 10,000 g for 15 min at 4 °C. The supernatant was collected and stored at −80 °C. Before incubation, the salivary proteins were labeled with Cy3 fluorescent dye (GE Healthcare, Buckinghamshire, England) and purified using Sephadex G-25 columns. Subsequently, the Cy3-labeled salivary proteins were quantified and stored at −20 °C in the dark until use.

Lectin microarrays and data pre-processing

The lectin microarrays were produced as previously reported [10]. Briefly, the lectin microarrays were produced using 37 lectins with different binding preferences covering N- and O-linked glycans, and information on all the lectins contained in the lectin microarrays was presented in Supplementary Table S1. Each lectin was spotted in triplicate per block, with quadruplicate blocks on one slide. After immobilization, the slides were blocked with a blocking buffer containing 2% bovine serum albumin (BSA) in 1 × PBS (0.01 mol/L phosphate buffer containing 0.15 mol/L NaCl, pH=7.4) for 1 h and rinsed twice with 1 × PBS. Then the blocked slide was incubated with Cy3-labeled salivary proteins diluted in 0.6 mL of incubation buffer containing 2% (w/v) BSA, 500 mM glycine, and 0.1% Tween-20 in 1 × PBS in the chamber at 37 °C for 3 h in a rotisserie oven set at 4 rpm. After incubation, the microarray was rinsed twice with 1 × PBS containing 0.2% Tween-20 (PBST) for 5 min each and finally rinsed in 1 × PBS before drying.

The microarrays were scanned using the Genepix 4000B confocal scanner (Axon Instruments, United States) set at 70% photomultiplier tube and 100% laser power. The generated images were analyzed at 532 nm for Cy3 detection by Genepix software (version 6.0, Axon Instruments Inc., Sunnyvale, CA). The original microarray data were normalized by using the median normalization method.

Statistical comparison

Statistical comparison and visualization were conducted using R software (version 4.2.1). Significant differences were assessed by Kruskal–Wallis with Dunn’s multiple comparisons tests, and a p-value of 0.05 or less was considered statistically significant. In addition, hierarchical clustering was done and the results were plotted as a heatmap.

Feature selection and model building

Features selection is an important step in the construction of the model, which further reduces the number of variables used in the prediction model by evaluating changes in performance by adding or removing variables [20]. We employed three methods to cross-sectionally compare the classification performance of different algorithms on the dataset with Least Absolute Shrinkage and Selector Operation (LASSO), Support Vector Machine (SVM) recursive feature elimination (SVM-RFE), and Boruta feature selection algorithm [21], [22], [23]. Subsequently, LASSO, SVM, and Random Forest (RF) algorithms were selected based on the feature selection variables to construct HC and HCC diagnostic models, respectively. All the above processes of feature selection and model construction were carried out in the R software (version 4.2.1). In addition, we used the R package glmnet (version 4.1-4) to apply the LASSO model, the R package e1071 (version 1.7-11) to build the SVM model, and the R packages Boruta (version 7.0.0) and randomForest (version 4.7-1.1) to apply the RF model.

Evaluation of model performance

An independent validation cohort of 85 saliva samples was used to evaluate the performance of the diagnostic model. The performance of classification models was evaluated by a series of evaluation indicators, including accuracy, specificity, and sensitivity [24]. We use the receiver operating characteristic (ROC) curve to assess the predictive ability of the models, where the area under the curve (AUC) value represents the predictive probability of the model, and the confidence intervals (CI) of AUCs were computed using the DeLong method. Generally, the accuracy of a model is classified based on AUC either as low (0.50<AUC≤0.70) or moderate (0.70< AUC≤0.90) or high (AUC>0.90) [25]. Principal component analysis (PCA) was performed in R using the scatterplot3d (version 0.3-41) and factoextra (version 1.0.7) package.

Results

Alteration of salivary glycopatterns in the saliva from HC and HCC patients

To determine whether the salivary protein glycopatterns expression levels were associated with HC and HCC patients, a total of 118 individual saliva samples (HV=35, HC=40, and HCC=43) were independently examined by lectin microarrays in the discovery cohort. The layout of the lectin microarrays and glycopatterns of Cy3-labeled salivary proteins from HV, HC, and HCC subjects bound to the lectin microarrays are shown in Figure 2A. To build a hierarchy of clusters, the normalized fluorescent intensities (NFIs) for each lectin were distributed in the heatmap by unsupervised clustering method. Overall, the salivary glycopatterns of the HC group were comparable to that of the HCC group, but the salivary glycopatterns of the HV group were considerably different from that of the HC or HCC groups (Figure 2B). The glycan-specificities of 37 lectins and the mean NFIs of each group (mean ± standard deviation, SD) were summarized in Supplementary Table S2.

Figure 2: 
Glycopatterns of salivary glycoproteins from HV, HC, and HCC using lectin microarrays, respectively.
(A) The layout of the lectin microarray. A total of 37 lectins were dissolved in the recommended buffer and spotted on lectin microarray; each lectin was spotted in triplicate per block. Detection of glycopatterns in HV, HC and HCC by lectin microarrays; yellow frames indicate significant differences between the NFIs in the HV and HC or HCC groups, and gray frames indicate significant differences between the NFIs in the HC and HCC groups. (B) Heatmap of hierarchical clustering analysis. Analysis of unsupervised hierarchical clustering using the Euclidean distance of lectin microarray data from the discovery cohort showed differences between groups; blue indicates lower and red indicates higher expression values. HV, healthy volunteers; HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; NFIs, normalized fluorescent intensities.
Figure 2:

Glycopatterns of salivary glycoproteins from HV, HC, and HCC using lectin microarrays, respectively.

(A) The layout of the lectin microarray. A total of 37 lectins were dissolved in the recommended buffer and spotted on lectin microarray; each lectin was spotted in triplicate per block. Detection of glycopatterns in HV, HC and HCC by lectin microarrays; yellow frames indicate significant differences between the NFIs in the HV and HC or HCC groups, and gray frames indicate significant differences between the NFIs in the HC and HCC groups. (B) Heatmap of hierarchical clustering analysis. Analysis of unsupervised hierarchical clustering using the Euclidean distance of lectin microarray data from the discovery cohort showed differences between groups; blue indicates lower and red indicates higher expression values. HV, healthy volunteers; HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; NFIs, normalized fluorescent intensities.

Lectin signal patterns were divided into three categories to assess whether the glycopatterns of the salivary glycoproteins were altered between HV, HC, and HCC subjects: (1) results showing significant increases in NFIs (fold change≥1, p<0.05), (2) results showing significant decreases in NFIs (fold change<1, p<0.05), and (3) results showing almost even level in NFIs (no significant difference). All the results based on fold change in pairs (with p-values lower than 0.05) with the NFIs of each lectin from HV, HC, and HCC subjects were shown in Table 2. The box plot only displayed the Bonferroni corrected p-values with significant differences between the groups (Figure 3). Overall, the results of statistical analysis revealed that the NFIs of 22 lectins (e.g., HHL, WFA, and PHA-E) showed significantly altered between the three groups. There were 19 lectins (e.g., HHL, WFA, and LEL) and 20 lectins (e.g., AAL, LTL, and DSA) that revealed significant differences in the salivary proteins glycopatterns from the patients with HC and HCC compared with HV, respectively. Only four lectins revealed significant differences in salivary proteins glycopatterns between HC and HCC patients.

Table 2:

Altered glycopatterns from each group upon statistical difference analysis.

Lectin Fold change (A/B)a Dunn’s multiple comparisons (p-values)b
HC/HV HCC/HV HCC/HC HV vs. HC HV vs. HCC HC vs. HCC
HHL 1.344 1.596 1.187 0.0194 <0.0001 ns
WFA 0.757 0.813 1.074 0.0165 ns ns
PHA-E 1.915 1.830 0.956 <0.0001 <0.0001 ns
PTL-I 1.735 1.873 1.079 0.0004 <0.0001 ns
SJA 1.777 1.981 1.115 <0.0001 <0.0001 ns
PNA 0.550 0.542 0.984 <0.0001 <0.0001 ns
AAL 0.876 1.194 1.364 ns 0.0129 0.0009
LTL 1.450 1.721 1.187 ns 0.0048 ns
MPL 1.454 1.371 0.943 0.0012 0.0097 ns
LEL 0.745 0.830 1.113 0.0075 ns ns
GSL-I 1.584 1.590 1.004 <0.0001 <0.0001 ns
DBA 1.496 1.432 0.957 0.0001 0.0005 ns
BS-I 0.559 0.640 1.145 <0.0001 0.0292 ns
PTL-II 1.883 1.883 1.000 0.0001 <0.0001 ns
DSA 1.183 1.418 1.199 ns <0.0001 ns
VVA 0.713 0.501 0.703 0.0195 <0.0001 0.0127
NPA 2.247 1.986 0.884 <0.0001 <0.0001 ns
ACA 2.273 2.880 1.267 <0.0001 <0.0001 0.0317
WGA 0.467 0.388 0.831 <0.0001 <0.0001 ns
MAL-I 2.381 2.033 0.854 <0.0001 <0.0001 ns
BPL 3.076 2.682 0.872 <0.0001 <0.0001 ns
PHA-E+L 2.249 1.846 0.821 <0.0001 <0.0001 0.0408
  1. aData were represented as relative fold change; fold-change>1 implies upregulation, and fold-change<1 implies downregulation. bp-Values by Kruskal-Wallis with the Dunn’s post hoc test (ns, not significant). HV, healthy volunteers; HC, hepatic cirrhosis; HCC, hepatocellular carcinoma.

Figure 3: 
Comparison of salivary glycopatterns differences among healthy volunteers and patients with hepatic cirrhosis and hepatocellular carcinoma.
The p-values were calculated by Kruskal–Wallis test with Dunn’s post hoc. *p<0.05, **p<0.001, ***p<0.001 was considered significant.
Figure 3:

Comparison of salivary glycopatterns differences among healthy volunteers and patients with hepatic cirrhosis and hepatocellular carcinoma.

The p-values were calculated by Kruskal–Wallis test with Dunn’s post hoc. *p<0.05, **p<0.001, ***p<0.001 was considered significant.

The results showed that the high-Man and Manα1-6Man binder HHL and NPA, the αGalNAc and GalNAcα1-3 (Fucα1-2) Gal binder DBA, and the T antigen and GalNAc binder MPL exhibited significantly increased NFIs in HC and HCC compared with HV (all fold change≥1.344, p≤0.0194). The bisecting GlcNAc and bi/tri-antennary N-glycans binders PHA-E and PHA-E+L, the Galβ1-3GalNAcα-Ser/Thr (T antigen) binder PTL-I and PTL-II and the T antigen binder ACA showed significantly increased NFIs in HC and HCC compared with HV (all fold change≥1.735, p≤0.0004). Moreover, the terminal in GalNAc and Gal binder SJA, the αGalNAc and αGal binder GSL-I, the Galβ1-3/4GlcNAc binder MAL-I, and the Galα1-3GalNAc binder BPL were associated with increased NFIs in HC and HCC compared with HV (all fold change≥1.584, p≤0.0001); however, there was no significant difference between and HC and HCC. The trend toward a significant increase in the NFIs of the Galβ1-4GlcNAc binder DSA and the Fucα1-3(Galβ1-4)GlcNAc binder LTL was observed only in HCC compared with HV (fold change=1.418 and 1.721, p≤0.0048), while no statistical difference was observed in the HC vs. HV and HC vs. HCC. Of note, the core-fucosylated glycans binder AAL showed significantly increased NFIs in HCC compared with HV (fold change=1.194, p=0.0129), and an increase in the NFIs of ACA and AAL was observed in HCC compared with HC, respectively (fold change=1.276 and 1.364, p=0.0317 and 0.0009).

The NFIs of the Galβ1-3GalNAcα binder PNA, the terminal α-Gal binder BS-I, and the multivalent Sia and (GlcNAc)n binder WGA decreased significantly in HC and HCC compared with HV (all fold change≤0.640, p≤0.0292). Also, the terminal GalNAcα/β1-3/6Gal binder WFA, the (GlcNAc)n and high Man-type N-glycans binder LEL, and the terminal GalNAc binder VVA had significantly increased NFIs in HC compared with HV (all fold change≤0.757, p≤0.195), and the NFIs of VVA also showed a significant decreasing trend in HCC compared with HV (fold change≤0.501, p≤0.0001). In addition, both PHA-E+L and VVA exhibited significantly decreased NFIs in HCC compared with HC (all fold change≤0.821, p≤0.0408).

Construction of HC and HCC models based on glycopattern abundances

Given our findings of altered salivary protein glycopattern in HC and HCC patients, we applied LASSO logistic regression to select the most useful markers from the 37 lectins and then constructed HC and HCC models (Supplementary Figures S1A, S2A and Supplementary Table S4). In the discovery cohort, the diagnostic accuracy of the LASSO-HC model was 0.839, with an AUC of 0.885 (95% CI: 0.826–0.943). The sensitivity and specificity were 0.769 and 0.975, respectively. The diagnostic accuracy of the LASSO-HCC model was 0.780, with an AUC of 0.865 (95% CI: 0.802–0.928). The sensitivity and specificity were 0.930 and 0.693, respectively (Table 3).

Table 3:

The performance of different models in terms of AUC, accuracy, specificity, and sensitivity.

Models Discovery cohort (n=118) Validation cohort (n=85)
AUC (95% CI) Ac Sp Se AUC (95% CI) Ac Sp Se
LASSO-HC 0.885 (0.826–0.943) 0.839 (99/118) 0.769 (60/78) 0.975 (39/40) 0.761 (0.651–0.870) 0.741 (63/85) 0.684 (39/57) 0.857 (24/28)
SVM-HC 0.988 (0.934–0.998) 0.907 (107/118) 0.859 (67/78) 1.000 (40/40) 0.784 (0.668–0.900) 0.741 (63/85) 0.737 (42/57) 0.750 (21/28)
RF-HC 1.000 (1.000–1.000) 1.000 (118/118) 1.000 (78/78) 1.000 (40/40) 0.857 (0.780–0.935) 0.765 (65/85) 0.667 (38/57) 0.964 (27/28)
LASSO-HCC 0.865 (0.802–0.928) 0.780 (92/118) 0.693 (52/75) 0.930 (40/43) 0.774 (0.676–0.873) 0.706 (60/85) 0.644 (38/59) 0.846 (22/26)
SVM-HCC 0.907 (0.847–0.967) 0.890 (105/118) 0.907 (68/75) 0.860 (37/43) 0.857 (0.769–0.944) 0.812 (69/85) 0.797 (47/59) 0.846 (22/26)
RF-HCC 1.000 (1.000–1.000) 1.000 (118/118) 1.000 (75/75) 1.000 (43/43) 0.886 (0.814–0.957) 0.859 (73/85) 0.881 (52/59) 0.808 (21/26)
  1. The optimal sensitivity and specificity cut-off point for each model was established by maximizing the Youden index. AUC, area under curve; CI, confidence interval; Ac, accuracy; Sp, specificity; Se, sensitivity; HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; RF, Random Forest; SVM, Support Vector Machine; LASSO, Least Absolute Shrinkage and Selector Operation.

Selection of appropriate variables using SVM-RFE for 10-fold cross-validation (Supplementary Figures S1B, S2B and Supplementary Table S4). After completing feature selection and tuning of parameters, we constructed the SVM-HC model and SVM-HCC model. In the discovery cohort, the diagnostic accuracy of the SVM-HC model was 0.907, with an AUC of 0.988 (95% CI: 0.934–0.998). The sensitivity and specificity were 1 and 0.859, respectively. The diagnostic accuracy of the SVM-HCC model was 0.890, with an AUC of 0.907 (95% CI: 0.847–0.967). The sensitivity and specificity were 0.860 and 0.907, respectively (Table 3).

A combination of the Boruta and RF algorithm was used to construct a diagnostic model for HC and HCC. Seven lectins (AAL, BS-I, NPA, etc.) and 11 lectins (HHL, PHA-E, SJA, etc.) were selected by the Boruta algorithm for RF-HC and RF-HCC, respectively, and the variables were ranked for importance (Supplementary Figures S1D, S2D). After tuning and optimizing the model, we calculated the mean decreasing Gini values for each feature in RF-HC and RF-HCC (Supplementary Figures S1C, S2C). The RF-HC and RF-HCC model achieved 100% accuracy in the discovery cohort (Table 3).

Comparison of the performance of the HC and HCC models in the validation cohort

In the HC validation cohort, the diagnostic accuracy of the LASSO-HC model was 0.839, with an AUC of 0.761 (95% CI: 0.651–0.870). The sensitivity and specificity were 0.857 and 0.684, respectively. The SVM-HC model had a diagnostic accuracy of 0.741, with an AUC of 0.784 (95% CI: 0.668–0.900) and sensitivity and specificity of 0.750 and 0.737, respectively. A comparison of the AUC values of the two models indicates that the diagnostic power of the SVM-HC model was slightly better than that of the LASSO-HC model. While the diagnostic accuracy of the RF-HC model was 0.765, its AUC was 0.857 (95% CI: 0.780–0.935), and its sensitivity and specificity were 0.667 and 0.964, respectively. Overall, combining the Boruta and RF algorithm resulted in the highest performance in terms of accuracy and AUC values in the discovery and validation cohort. LASSO-HC obtained the lowest AUC value of 0.761. Moreover, the SVM-HC model has the highest sensitivity among the other two models, but the accuracy is not as good as the RF-HC model. Given that the AUC value of the RF-HC model obtained the highest AUC value of 0.857 in the validation cohort among them, this proves that the RF-HC model has more potential to identify HC patients. In addition, we performed PCA in the discovery and validation cohorts using the features selected by RF-HC, and found that the most samples from HC were separated from the samples from other groups (HV and HCC) (Supplementary Figure S3A, B).

In the HCC validation cohort, the diagnostic accuracy of the LASSO-HCC model was 0.706, with an AUC of 0.774 (95% CI: 0.676–0.873). The sensitivity and specificity were 0.846 and 0.644, respectively. The SVM-HCC model had a diagnostic accuracy of 0.812, with an AUC of 0.857 (95% CI: 0.769–0.944), and sensitivity and specificity of 0.846 and 0.797, respectively. The RF-HCC model was validated in the validation cohort with accuracy, sensitivity, and specificity of 0.859, 0.808, and 0.881, respectively, with an AUC of 0.886 (95% CI: 0.814–0.957). In all, the RF algorithm obtained the best performance in the task of identifying HCC when features were selected using the Boruta methodology. Furthermore, we found that samples largely separated by sample type (HCC or Others) in 3D-PCA plot (Supplementary Figure S3C, D). The comparisons of ROC curves were shown in Figure 4 and model assessment metrics were summarized in Table 3.

Figure 4: 
The ROC curves of HC and HCC models in the discovery cohort and validation cohort.
HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; ROC, receiver operating characteristic.
Figure 4:

The ROC curves of HC and HCC models in the discovery cohort and validation cohort.

HC, hepatic cirrhosis; HCC, hepatocellular carcinoma; ROC, receiver operating characteristic.

Discussion

Post-translational modification of glycosylation is a non-templated but highly regulated process that changes rapidly in both physiological and pathological contexts [8]. To the best of our knowledge, there is currently no tool for classifying HC and HCC based on salivary protein glycopatterns. In this study, we first detected the alteration of salivary glycopatterns by lectin microarrays in 118 saliva samples, and then constructed HC and HCC diagnostic models based on different algorithms. Furthermore, we assessed the diagnostic performance of the models in an independent validation cohort of 85 saliva samples.

Altered branching and core fucosylation of N-glycans and sialylated N-glycans are associated with cell adhesion, infection, inflammation, and tumor metastasis [26], [27], [28], [29]. Our results indicated that the salivary glycoproteins recognized by the core-fucosylated glycans binder LTL and AAL were more highly expressed in patients with HCC [30, 31]. Hisashi Narimatsu’s team identified many glycoproteins carrying glycans with enhanced reactivity with AAL in HCC sera and hepatocellular carcinoma cell lines, in good agreement with our finding [32]. Overexpression of Tn, sTn, and T antigens occurred in many types of cancer and was involved in the metastatic process of tumor cells [33]. Significantly high expression of the DBA and ACA, which recognize Tn and T antigens, was found in the saliva of patients with HC and HCC. These results may be explained by the fact that human saliva contains large amounts of circulating exosomes released by cells or organs, and exosomes contain proteins and nucleic acids [34, 35], which in turn are covered abundantly by glycoproteins [36].

Further, we constructed diagnostic models for HC and HCC based on differences in salivary glycopatterns and found that the models constructed by combining the Boruta and RF algorithms outperformed the other two modeling strategies. It is worth mentioning that the Boruta feature selection, built around a RF and selecting features that have significantly more discriminatory power than randomly permuted features [37, 38], and the RF algorithm is a classifier consisting of an ensemble of tree-structured classifiers which has a high prediction performance and the possibility of overfitting is low [39, 40]. Compared to models built by other algorithms, models based on the RF algorithm have the advantage of being more robust than those built by other algorithms due to the integration of a large number of rules, and this advantage has been confirmed in all evaluation metrics for the current results. Although some studies have shown that machine learning techniques and data-driven approaches have great potential to build and improve predictive diagnostic models, there are some barriers to their adoption in medicine, such as poor interpretation and instability [41], [42], [43]. In the future, as the quality and quantity of biochemical test data improves, machine learning technology will play an increasingly important role in medical diagnostics to provide more intelligent advice to doctors.

It should be noted that there are some limitations to the current study. Firstly, our current work is an exploratory, preliminary result, and although an association between salivary glycopatterns and liver disease has been found, the sample size used is still relatively limited and we need more experiments to corroborate the link. On the other hand, it is well known that for diagnostic models built on machine learning, the larger the sample size, the better the robustness of the model. For these reasons, we cannot dismiss that potential confounders and selection bias due to its retrospective nature may result in an incorrect estimate of the association between the algorithm and clinical outcome. We will endeavor to enlarge the sample size in our future research to overcome this limitation. Furthermore, the underlying mechanism behind altered salivary glycopatterns in HCC is currently unknown and our subsequent work will focus on exploring the underlying causes of abnormal glycosylation of salivary glycoproteins in patients with HCC. In the future, we will combine data generated by glycoproteomic and other omics analyses with mathematical modeling research to further increase our understanding of liver disease phenotypes.

Conclusions

In conclusion, our preliminary study provided information on the salivary glycopatterns from HC and HCC, combined with machine learning algorithms to construct a diagnostic model of HC and HCC, which may contribute to understanding the complex physiological changes of HCC and provide a new screening strategy.


Corresponding authors: Yonghong Guo, The Infectious Disease Department, Gongli Hospital, Pudong New Area, Shanghai 200135, P.R. China, E-mail: ; and Zheng Li, Laboratory for Functional Glycomics, College of Life Sciences, Northwest University, Xi’an 710069, P.R. China, E-mail:

Funding source: Pudong New Area special Fund for Livelihood Research Project of Science and Technology Development Fund

Award Identifier / Grant number: PKJ2021-Y12

  1. Research funding: This work was supported by Pudong New Area Special Fund for Livelihood Research Project of Science and Technology Development Fund (PKJ2021-Y12).

  2. Author contribution: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: Authors state no conflict of interest.

  4. Informed consent: Informed consent was obtained from all individuals included in this study.

  5. Ethical approval: Research involving human subjects complied with all relevant national regulations, institutional policies and is in accordance with the tenets of the Helsinki Declaration (as revised in 2013), and has been approved by the Ethical Committee of Northwest University (Xi’an, China), the Ethical Committee of Second Affiliated Hospital of Xi’an Jiaotong University (Xi’an, China).

References

1. Yang, JD, Hainaut, P, Gores, GJ, Amadou, A, Plymoth, A, Roberts, LR. A global view of hepatocellular carcinoma: trends, risk, prevention and management. Nat Rev Gastroenterol Hepatol 2019;16:589–604. https://doi.org/10.1038/s41575-019-0186-y.Search in Google Scholar PubMed PubMed Central

2. Llovet, JM, Kelley, RK, Villanueva, A, Singal, AG, Pikarsky, E, Roayaie, S, et al.. Hepatocellular carcinoma. Nat Rev Dis Prim 2021;7:6. https://doi.org/10.1038/s41572-020-00240-3.Search in Google Scholar PubMed

3. Tang, A, Bashir, MR, Corwin, MT, Cruite, I, Dietrich, CF, Do, RKG, et al.. Evidence supporting LI-RADS major features for CT- and MR imaging-based diagnosis of hepatocellular carcinoma: a systematic review. Radiology 2018;286:29–48. https://doi.org/10.1148/radiol.2017170554.Search in Google Scholar PubMed PubMed Central

4. Spangenberg, HC, Thimme, R, Blum, HE. Serum markers of hepatocellular carcinoma. Semin Liver Dis 2006;26:385–90. https://doi.org/10.1055/s-2006-951606.Search in Google Scholar PubMed

5. Volk, ML, Hernandez, JC, Su, GL, Lok, AS, Marrero, JA. Risk factors for hepatocellular carcinoma may impair the performance of biomarkers: a comparison of AFP, DCP, and AFP-L3. Cancer Biomarkers 2007;3:79–87. https://doi.org/10.3233/cbm-2007-3202.Search in Google Scholar PubMed

6. Masuzaki, R, Karp, SJ, Omata, M. New serum markers of hepatocellular carcinoma. Semin Oncol 2012;39:434–9. https://doi.org/10.1053/j.seminoncol.2012.05.009.Search in Google Scholar PubMed

7. Reily, C, Stewart, TJ, Renfrow, MB, Novak, J. Glycosylation in health and disease. Nat Rev Nephrol 2019;15:346–66. https://doi.org/10.1038/s41581-019-0129-4.Search in Google Scholar PubMed PubMed Central

8. Peixoto, A, Relvas-Santos, M, Azevedo, R, Santos, LL, Ferreira, JA. Protein glycosylation and tumor microenvironment alterations driving cancer hallmarks. Front Oncol 2019;9:380. https://doi.org/10.3389/fonc.2019.00380.Search in Google Scholar PubMed PubMed Central

9. Lima, DP, Diniz, DG, Moimaz, SAS, Sumida, DH, Okamoto, AC. Saliva: reflection of the body. Int J Infect Dis 2010;14:e184–88. https://doi.org/10.1016/j.ijid.2009.04.022.Search in Google Scholar PubMed

10. Qin, Y, Zhong, Y, Zhu, M, Dang, L, Yu, H, Chen, Z, et al.. Age- and sex-associated differences in the glycopatterns of human salivary glycoproteins and their roles against influenza A virus. J Proteome Res 2013;12:2742–54. https://doi.org/10.1021/pr400096w.Search in Google Scholar PubMed

11. Qin, Y, Zhong, Y, Ma, T, Zhang, J, Yang, G, Guan, F, et al.. A pilot study of salivary N-glycome in HBV-induced chronic hepatitis, cirrhosis, and hepatocellular carcinoma. Glycoconj J 2017;34:523–35. https://doi.org/10.1007/s10719-017-9768-5.Search in Google Scholar PubMed

12. Dang, K, Zhang, W, Jiang, S, Lin, X, Qian, A. Application of lectin microarrays for biomarker discovery. ChemistryOpen 2020;9:285–300. https://doi.org/10.1002/open.201900326.Search in Google Scholar PubMed PubMed Central

13. Du, H, Yu, H, Yang, F, Li, Z. Comprehensive analysis of glycosphingolipid glycans by lectin microarrays and MALDI-TOF mass spectrometry. Nat Protoc 2021;16:3470–91. https://doi.org/10.1038/s41596-021-00544-y.Search in Google Scholar PubMed

14. Yu, H, Shu, J, Li, Z. Lectin microarrays for glycoproteomics: an overview of their use and potential. Expet Rev Proteonomics 2020;17:27–39. https://doi.org/10.1080/14789450.2020.1720512.Search in Google Scholar PubMed

15. Zou, X, Yao, F, Yang, F, Zhang, F, Xu, Z, Shi, J, et al.. Glycomic signatures of plasma IgG improve preoperative prediction of the invasiveness of small lung nodules. Molecules 2019;25:28. https://doi.org/10.3390/molecules25010028.Search in Google Scholar PubMed PubMed Central

16. Bojar, D, Meche, L, Meng, G, Eng, W, Smith, DF, Cummings, RD, et al.. A useful guide to lectin binding: machine-learning directed annotation of 57 unique lectin specificities. ACS Chem Biol 2022. https://doi.org/10.1021/acschembio.1c00689 [Epub ahead of print].Search in Google Scholar PubMed PubMed Central

17. Chabon, JJ, Hamilton, EG, Kurtz, DM, Esfahani, MS, Moding, EJ, Stehr, H, et al.. Integrating genomic features for non-invasive early lung cancer detection. Nature 2020;580:245–51. https://doi.org/10.1038/s41586-020-2140-0.Search in Google Scholar PubMed PubMed Central

18. Lundberg, SM, Nair, B, Vavilala, MS, Horibe, M, Eisses, MJ, Adams, T, et al.. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2018;2:749–60. https://doi.org/10.1038/s41551-018-0304-0.Search in Google Scholar PubMed PubMed Central

19. Liu, X, Yu, H, Qiao, Y, Yang, J, Shu, J, Zhang, J, et al.. Salivary glycopatterns as potential biomarkers for screening of early-stage breast cancer. EBioMedicine 2018;28:70–9. https://doi.org/10.1016/j.ebiom.2018.01.026.Search in Google Scholar PubMed PubMed Central

20. Patel, AJ, Tan, T-M, Richter, AG, Naidu, B, Blackburn, JM, Middleton, GW. A highly predictive autoantibody-based biomarker panel for prognosis in early-stage NSCLC with potential therapeutic implications. Br J Cancer 2022;126:238–46. https://doi.org/10.1038/s41416-021-01572-x.Search in Google Scholar PubMed PubMed Central

21. Tibshirani, R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar

22. Duan, K-B, Rajapakse, JC, Wang, H, Azuaje, F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans NanoBioscience 2005;4:228–34. https://doi.org/10.1109/tnb.2005.853657.Search in Google Scholar PubMed

23. Kursa, MB, Rudnicki, WR. Feature selection with the Boruta package. J Stat Software 2010;36:1–13. https://doi.org/10.18637/jss.v036.i11.Search in Google Scholar

24. Sokolova, M, Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf Process Manag 2009;45:427–37. https://doi.org/10.1016/j.ipm.2009.03.002.Search in Google Scholar

25. Swets, JA. Measuring the accuracy of diagnostic systems. Science 1988;240:1285–93. https://doi.org/10.1126/science.3287615.Search in Google Scholar PubMed

26. Ma, B, Simala-Grant, JL, Taylor, DE. Fucosylation in prokaryotes and eukaryotes. Glycobiology 2006;16:158R–84R. https://doi.org/10.1093/glycob/cwl040.Search in Google Scholar PubMed

27. Li, J, Hsu, H-C, Mountz, JD, Allen, JG. Unmasking fucosylation: from cell adhesion to immune system regulation and diseases. Cell Chem Biol 2018;25:499–512. https://doi.org/10.1016/j.chembiol.2018.02.005.Search in Google Scholar PubMed

28. Byrd-Leotis, L, Liu, R, Bradley, KC, Lasanajak, Y, Cummings, SF, Song, X, et al.. Shotgun glycomics of pig lung identifies natural endogenous receptors for influenza viruses. Proc Natl Acad Sci U S A 2014;111:E2241–50. https://doi.org/10.1073/pnas.1323162111.Search in Google Scholar PubMed PubMed Central

29. Taniguchi, N, Kizuka, Y. Glycans and cancer: role of N-glycans in cancer biomarker, progression and metastasis, and therapeutics. Adv Cancer Res 2015;126:11–51. https://doi.org/10.1016/bs.acr.2014.11.001.Search in Google Scholar PubMed

30. Gao, C, Hanes, MS, Byrd-Leotis, LA, Wei, M, Jia, N, Kardish, RJ, et al.. Unique binding specificities of proteins towards isomeric asparagine-linked glycans. Cell Chem Biol 2019;26:535–47. https://doi.org/10.1016/j.chembiol.2019.01.002.Search in Google Scholar PubMed PubMed Central

31. Hashim, OH, Jayapalan, JJ, Lee, C-S. Lectins: an effective tool for screening of potential cancer biomarkers. PeerJ 2017;5:e3784. https://doi.org/10.7717/peerj.3784.Search in Google Scholar PubMed PubMed Central

32. Kaji, H, Ocho, M, Togayachi, A, Kuno, A, Sogabe, M, Ohkura, T, et al.. Glycoproteomic discovery of serological biomarker candidates for HCV/HBV infection-associated liver fibrosis and hepatocellular carcinoma. J Proteome Res 2013;12:2630–40. https://doi.org/10.1021/pr301217b.Search in Google Scholar PubMed

33. Fu, C, Zhao, H, Wang, Y, Cai, H, Xiao, Y, Zeng, Y, et al.. Tumor-associated antigens: Tn antigen, sTn antigen, and T antigen. HLA 2016;88:275–86. https://doi.org/10.1111/tan.12900.Search in Google Scholar PubMed

34. Sun, Y, Liu, S, Qiao, Z, Shang, Z, Xia, Z, Niu, X, et al.. Systematic comparison of exosomal proteomes from human saliva and serum for the detection of lung cancer. Anal Chim Acta 2017;982:84–95. https://doi.org/10.1016/j.aca.2017.06.005.Search in Google Scholar PubMed

35. Sharma, S, Rasool, HI, Palanisamy, V, Mathisen, C, Schmidt, M, Wong, DT, et al.. Structural-mechanical characterization of nanoparticle exosomes in human saliva, using correlative AFM, FESEM, and force spectroscopy. ACS Nano 2010;4:1921–6. https://doi.org/10.1021/nn901824n.Search in Google Scholar PubMed PubMed Central

36. Melo, SA, Luecke, LB, Kahlert, C, Fernandez, AF, Gammon, ST, Kaye, J, et al.. Glypican-1 identifies cancer exosomes and detects early pancreatic cancer. Nature 2015;523:177–82. https://doi.org/10.1038/nature14581.Search in Google Scholar PubMed PubMed Central

37. Wu, G, Yang, P, Xie, Y, Woodruff, HC, Rao, X, Guiot, J, et al.. Development of a clinical decision support system for severity risk prediction and triage of COVID-19 patients at hospital admission: an international multicentre study. Eur Respir J 2020;56:2001104. https://doi.org/10.1183/13993003.01104-2020.Search in Google Scholar PubMed PubMed Central

38. Vitsios, D, Petrovski, S. Mantis-ml: disease-agnostic gene prioritization from high-throughput genomic screens by stochastic semi-supervised learning. Am J Hum Genet 2020;106:659–78. https://doi.org/10.1016/j.ajhg.2020.03.012.Search in Google Scholar PubMed PubMed Central

39. Jiang, P, Wu, H, Wang, W, Ma, W, Sun, X, Lu, Z. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 2007;35:W339–44. https://doi.org/10.1093/nar/gkm368.Search in Google Scholar PubMed PubMed Central

40. Bureau, A, Dupuis, J, Falls, K, Lunetta, KL, Hayward, B, Keith, TP, et al.. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005;28:171–82. https://doi.org/10.1002/gepi.20041.Search in Google Scholar PubMed

41. Huang, C, Murugiah, K, Mahajan, S, Li, S-X, Dhruva, SS, Haimovich, JS, et al.. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med 2018;15:e1002703. https://doi.org/10.1371/journal.pmed.1002703.Search in Google Scholar PubMed PubMed Central

42. Gillette, MA, Mani, DR, Uschnig, C, Pellé, KG, Madrid, L, Acácio, S, et al.. Biomarkers to distinguish bacterial from viral pediatric clinical pneumonia in a malaria-endemic setting. Clin Infect Dis 2021;73:e3939–48. https://doi.org/10.1093/cid/ciaa1843.Search in Google Scholar PubMed PubMed Central

43. Beheshti, I, Ganaie, MA, Paliwal, V, Rastogi, A, Razzak, I, Tanveer, M. Predicting brain age using machine learning algorithms: a comprehensive evaluation. IEEE J. Biomed. Health Inf. 2022;26:1432–40. https://doi.org/10.1109/jbhi.2021.3083187.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/cclm-2022-0715).


Received: 2022-08-03
Accepted: 2022-09-08
Published Online: 2022-09-19
Published in Print: 2022-11-25

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/cclm-2022-0715/html
Scroll to top button