Abstract
Objectives
According to international standards, clinical laboratories are required to verify the performance of assays prior to their implementation in routine practice. This typically involves the assessment of the assay’s imprecision and trueness vs. appropriate targets. The analysis of these data is typically performed using frequentist statistical methods and often requires the use of closed source, proprietary software. The motivation for this paper was therefore to develop an open-source, freely available software capable of performing Bayesian analysis of verification data.
Methods
The veRification application presented here was developed with the freely available R statistical computing environment, using the Shiny application framework. The codebase is fully open-source and is available as an R package on GitHub.
Results
The developed application allows the user to analyze imprecision, trueness against external quality assurance, trueness against reference material, method comparison, and diagnostic performance data within a fully Bayesian framework (with frequentist methods also being available for some analyses).
Conclusions
Bayesian methods can have a steep learning curve and thus the work presented here aims to make Bayesian analyses of clinical laboratory data more accessible. Moreover, the development of the application and seeks to encourage the dissemination of open-source software within the community and provides a framework through which Shiny applications can be developed, shared, and iterated upon.
Introduction
Verification of new assays forms a central part of clinical laboratory practice in order to formally confirm, through provision of objective evidence, that specified requirements for an assay’s performance have been fulfilled. Such analyses form a key part of clinical laboratory accreditation and a variety of guidelines exist that provide direction as to the experimental evidence that is required [1], [2], [3]. In general, these define experiments that are designed to assess an assay’s imprecision (stochastic error) and trueness (systematic error), with the latter being able to be tested through a variety of means (e.g. against a reference material or an external quality assurance scheme). These documents also provide recommendations for ways in which the data produced by these experiments should be analyzed, with the vast majority detailing methods within the frequentist paradigm of statistical analysis (e.g. t-tests, nested analysis of variance, and Passing–Bablok regression) [1], [2], [3]. These methods can easily provide the probability of having collected the verification data (or more extreme values), assuming a given null hypothesis to be true (
The Bayesian statistical paradigm leverages Bayes theorem to incorporate prior information into an analysis and directly calculate
Materials and methods
The open-source application shown here was developed within RStudio [18] and the R statistical computing environment [19] which are available to download for free at https://posit.co/downloads/ and https://cran.r-project.org/, respectively. The application was built through the use of the Shiny package (https://shiny.rstudio.com/) and makes use of a number of other R packages [14, 15, 20], [21], [22], [23], [24], [25], [26], [27]. The full codebase for the application can be found at https://github.com/ed-wilkes/veRification and is written in a modular way within the golem framework [28]. This means that application can be installed as an R package and run from a local instance of Posit or R on a local machine running a Linux, Windows, or macOS-based operating system, or hosted on a web server and accessed through a web browser. It is worth noting that the performance will depend on the hardware on which you are running R and RStudio. In addition, the application’s modularity enables interested users to fork the repository and easily develop and add new modules as they wish.
Results
Assessing assay imprecision
The assessment of an assay’s imprecision is perhaps one of the first steps of a method’s verification. In the UK, the Association for Clinical Biochemistry and Laboratory Medicine (ACB) recommends the measurement of at least two levels of internal quality control (IQC) material 5 times a day (spread throughout the day), across 5 different days [1]. Such an experiment allows the calculation of within- and between-day variability (referred to as repeatability and intermediate imprecision, respectively), alongside the total variation across the time course and the expected value. Analysis of these data is commonly performed with a nested ANOVA (analysis of variance), but can equivalently be modeled with a Bayesian linear varying-effects model [16] (Eqs. 4.1– 4.5) in order to make direct probability statements regarding the model’s parameters and propagate our uncertainty regarding their values into our inferences. Bayesian (and frequentist) varying-effects models can be fitted within the “Imprecision” tab. The Bayesian analyses uses the default, weakly informative prior distributions recommended by the rstanarm team [15], scaled to the input data (Eqs. 4.4 and 4.5, where
First, the user is prompted to input their data in .csv or .xls(x) format. The inputted data are then displayed to the user and the user prompted to interactively select the columns within the data that represent the days of measurement, the IQC levels, and the measurements themselves. The user can also select whether to test the estimated total laboratory CV against a given claim (e.g. from a manufacturer’s kit insert or another data source) (Figure 1A–C). Once these settings are chosen, the data are plotted with an interactive visualization and the modeling results are presented to the user when complete (Figure 2A). A number of basic checks of the Bayesian model’s validity are performed and the results of these (pass or fail) are also shown. These checks determine if the Markov chains (MCMC) have converged (parameter
Assessing trueness against external quality assurance materials
The trueness of an assay, defined as the closeness of agreement between the arithmetic mean of a large number of test results and the true or accepted reference value [32], is fundamental to its clinical utility. This is pragmatically tested through comparison of the values obtained using the assay in question to those reported through an external quality assurance (EQA) scheme. Ideally, this comparison encompasses a large range of clinically-relevant concentrations of the analyte and is preferably performed in duplicate to account for within-assay variability [1]. The application allows the user to analyze these data in a number of different ways within the “Trueness (EQA)” tab using both frequentist (ordinary least-squares, Deming, or Passing–Bablok regression analyses) or Bayesian methods, depending on the user’s preference (Figure 3A and B). The Bayesian linear regression models take the form shown in Eqs. 4.6– 4.10 (where
Assessing trueness against reference materials
In addition to the assessments vs. EQA material, the trueness of an assay can also be assessed by comparison to a reference material whose analyte concentration has been assigned through the use of a reference method (or appropriate substitute). Such analyses are typically performed through measurement of the reference material on 3–5 occasions in duplicate [1]. These data are then often analyzed through the use to assess the significance of the difference between the measured and assigned values in a Neyman–Pearson, frequentist framework. For reasons discussed elsewhere [5, 16], Bayesian methods provide a formal and more intuitive way through which the probability of a difference between the reference and test methods can be calculated. This is especially true in this case, where a relatively strong prior distribution for the results is already known (the assigned value ± an assigned uncertainty). Within the “Trueness (reference)” tab, the application prompts the user to upload their data, enter the assigned value and associated uncertainty, and choose the relevant columns of their uploaded data (Figure 4A and B). A model is fitted to the data using the assigned value and uncertainty as the prior distribution (Eqs. 4.18– 4.21, where
Assessing trueness against a reference assay
Comparisons of two methods for the measurement of a given measurand are commonly performed within clinical laboratories. The current UK ACB guidelines [1] suggest that at least 20 samples consisting of patient material should be assayed with each method in a timely manner (preferably in duplicate). The application allows the user to analyze these data in a number of different ways using both frequentist (ordinary least-squares, Deming, or Passing–Bablok regression analyses) and Bayesian methods, depending on the user’s preference, within the “Method comparison” tab (Figure 5A and B). As with the EQA data analysis, the Bayesian linear regression models here take the form shown in Eqs. 4.6– 4.10 (where
Assessing the diagnostic performance of a test
The final tool within the application is designed to assess the performance of a given test to correctly predict a binary categorical outcome using a continuous variable (e.g. predicting “healthy” vs “disease” through the measurement of a biomarker). This is often achieved through the use of receiver operating characteristic (ROC) analysis, which typically involves the derivation of a hard analyte threshold based on minimizing a loss function that balances a test’s sensitivity and specificity. This type of analysis is problematic in medicine, however, for several reasons. Firstly, as clinical decision makers, we are most often analyzing scenarios in which there is significant stochasticity in a given clinical outcome due to measurement error, biological variation, and sampling variability. As such, there is often a significant overlap between the two groups being compared and thus probability estimates are most appropriate to best quantify the tendency towards one group or the other – i.e. directly inferring and interpreting the forward probability,
As with the previous modules, the user is prompted to enter their data set in.csv or.xls(x) format and select the columns in their data that represent the analyte measurements, binary outcome measure, and which category of the outcome measure is considered a “positive” (Figure 6A and B). The results of the analysis are then generated within the “Plots and analysis” tab and include: (i) posterior draws of the expected value of the posterior predictive distribution vs the input data; (ii) a summary of the posterior distributions of the model’s parameters; and (iii) the results of a decision curve analysis (including 100 posterior draws of expected value of the posterior predictive distribution in gray in order to display uncertainty) (Figure 6C and D).
Discussion and conclusions
The verification of assays in clinical laboratories is of the utmost importance to ensure that given performance characteristics are achieved in a given laboratory’s hands prior to clinical use. Several documents provide guidance for the experiments required to achieve this [1], [2], [3]; however, the analysis of the produced data typically requires either the use of closed-source, accessible proprietary software, or open-source, freely available software that can have a steep learning curve and usually requires programming experience [14, 15, 24]. Moreover, the vast majority of these software – particularly those that are closed-source and/or proprietary – impose the frequentist paradigm of statistical analysis on the user. This is problematic due to the pitfalls associated with these methods discussed extensively elsewhere [4], [5], [6], [7], [8], [9], [10], [11], [12], [13, 16]. As such, there is a motivation to develop an open-source, free, and accessible application that allows end users to analyze verification data using the Bayesian – in addition to the frequentist, if so desired – statistical paradigm. Here, an application is presented that fulfils these requirements and can either be installed as an R package or, as with other Shiny applications, hosted on a web server if required. The full source-code is available on GitHub and, due to its development within the golem framework [28], users are easily able to adapt the code to their needs by editing or adding modules as they see fit.
-
Research funding: None declared.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: Authors state no conflict of interest.
-
Informed consent: Not applicable.
-
Ethical approval: Not applicable.
References
1. Khatami, Z, Hill, R, Sturgeon, C, Kearney, E, Breadon, P, Kallner, A. Measurement verification in the clinical laboratory: a guide to assessing analytical performance during the acceptance testing of methods (quantitative examination procedures) and/or analysers. Available from: https://www.acb.org.uk/asset/34B3F3F5%2DAF91%2D4B44%2DAF184C565EDC162B/ [Accessed 19 Jan 2023].Search in Google Scholar
2. Theodorsson, E. Validation and verification of measurement methods in clinical chemistry. Bioanalytical 2012;4:305–20. https://doi.org/10.4155/bio.11.311.Search in Google Scholar PubMed
3. Pum, J. A practical guide to validation and verification of analytical methods in the clinical laboratory. Adv Clin Chem 2019;90:215–81. https://doi.org/10.1016/bs.acc.2019.01.006.Search in Google Scholar PubMed
4. Colling, LJ, Szűcz, D. Statistical inference and the replication crisis. Rev Philos Psychol 2021;12:121–47. https://doi.org/10.1007/s13164-018-0421-4.Search in Google Scholar
5. van de Schoot, R, Depaoli, S, King, R, Kramer, B, Martens, K, Tadesse, MG, et al.. Bayesian statistics and modelling. Nat Rev Methods Primers 2021;1. https://doi.org/10.1038/s43586-020-00001-2.Search in Google Scholar
6. Gelman, A, Hennig, C. Beyond subjective and objective in statistics. J R Stat Soc Ser A Stat Soc 2017;180:967–1033. https://doi.org/10.1111/rssa.12276.Search in Google Scholar
7. Wasserstein, RL, Schirm, AL, Lazar, NA. Moving to a world beyond “p < 0.05”. Am Stat 2019;73:1–19. https://doi.org/10.1080/00031305.2019.1583913.Search in Google Scholar
8. McShane, BB, Gal, D, Gelman, A, Robert, C, Tackett, JL. Abandon statistical significance. Am Stat 2019;73:235–45. https://doi.org/10.1080/00031305.2018.1527253.Search in Google Scholar
9. van Zwet, EW, Cator, EA. The significance filter, the winner’s curse and the need to shrink. Stat Neerl 2021;75:1–16. https://doi.org/10.1111/stan.12241.Search in Google Scholar
10. Gelman, A, Tuerlinckx, F. Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput Stat 2000;15:373–90. https://doi.org/10.1007/s001800000040.Search in Google Scholar
11. Gelman, A, Carlin, J. Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect Psychol Sci 2014;9:641–51. https://doi.org/10.1177/1745691614551642.Search in Google Scholar PubMed
12. Gelman, A. The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Pers Soc Psychol Bull 2018;44:16–23. https://doi.org/10.1177/0146167217729162.Search in Google Scholar PubMed
13. Szűcs, D, Ioannidis, JPA. When null hypothesis significance testing is unsuitable for research: a reassessment. Front Hum Neurosci 2017;11:943. https://doi.org/10.3389/fnhum.2017.00390.Search in Google Scholar PubMed PubMed Central
14. Bürkner, PC. brms: an R package for Bayesian multilevel models using stan. J Stat Softw 2017;80:1–28. https://doi.org/10.18637/jss.v080.i01.Search in Google Scholar
15. Goodrich, B, Gabry, J, Ali, I, Brilleman, S. rstanarm: Bayesian applied regression modeling via Stan; 2022. R package version 2.21.3.Search in Google Scholar
16. Wilkes, EH. A practical guide to Bayesian statistics in laboratory medicine. Clin Chem 2022;68:893–905. https://doi.org/10.1093/clinchem/hvac049.Search in Google Scholar PubMed
17. Chang, W, Cheng, J, Allaire, JJ, Sievert, C, Schloerke, B, Xie, Y, et al.. shiny: web application framework for R; 2022. R package version 1.7.3.Search in Google Scholar
18. Posit Team. RStudio: integrated development environment for R. Boston, USA: Posit Software PBC; 2022.Search in Google Scholar
19. R Core Team. R. A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2022.Search in Google Scholar
20. Schuetzenmeister, A, Dufey, F. VCA: variance component analysis; 2022. R package version 1.4.5.Search in Google Scholar
21. Chang, W, Borges Ribeiro, B. shinydashboard: create dashboards with ‘Shiny’; 2022. R package version 0.7.2.Search in Google Scholar
22. Sali, A, Attali, D. shinycssloaders: add loading animations to a ‘shiny’ output while it’s recalculating; 2022. R package version 1.0.0.Search in Google Scholar
23. Merlino, A, Howard, P. shinyFeedback: display user feedback in Shiny apps; 2022. R package version 0.4.0.Search in Google Scholar
24. Manuilova, E, Schuetzenmeister, A. mcr: method comparison regression; 2022. R package version 1.3.0.Search in Google Scholar
25. Wickham, H, Averick, M, Bryan, J, Chang, W, McGowan, L, Francois, R, et al.. Welcome to the tidyverse. J Open Source Softw 2019;43:1686. https://doi.org/10.21105/joss.01686.Search in Google Scholar
26. Sievert, C. Interactive web-based data visualization with R, plotly, and shiny. New York: Chapman and Hall/CRC; 2020.10.1201/9780429447273Search in Google Scholar
27. Majowski, D, Mattan, SB, Lϋdecke, D. bayestestR: describing effects and their uncertainty, existence and significance within the Bayesian framework. J Open Source Softw 2019;40:1541. https://doi.org/10.21105/joss.01541.Search in Google Scholar
28. Fay, C, Guyader, V, Rochette, S, Girvard, C. golem: a framework for robust shiny applications; 2022. R package version 0.3.5.10.1201/9781003029878-2Search in Google Scholar
29. Gelman, A, Hill, J, Vehtari, A. Regression and other stories (analytical methods for social research). Cambridge: Cambridge University Press; 2020.10.1017/9781139161879Search in Google Scholar
30. McElreath, R. Statistical rethinking: a Bayesian course with examples in R and Stan. Florida, FL: CRC Press; 2020.10.1201/9780429029608Search in Google Scholar
31. Vehtari, A, Gelman, A, Simpson, D, Carpenter, B, Bürkner, PC. Rank-normalisation, folding, and localisation: an improved for assessing convergence of MCMC. Bayesian Anal 2021;16:667–718.10.1214/20-BA1221Search in Google Scholar
32. International Organization for Standardization. Medical laboratories: requirements for quality and competence (ISO Standard No. 15189:2022; 2022. Available from: https://www.iso.org/standard/76677.html.Search in Google Scholar
33. Altman, DG, Bland, JM. Measurement in medicine: the analysis of method comparison studies. The Statistician 1983;32:307–17. https://doi.org/10.2307/2987937.Search in Google Scholar
34. Dudoit, S, Yang, YH, Callow, MJ, Speed, TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 2002;12:111–39.Search in Google Scholar
35. Harrel, F. Statistical thinking – classification vs prediction. Available from: https://www.fharrell.com/post/classification/ [Accessed 19 Jan 2023].Search in Google Scholar
36. Harrell, F. Statistic thinking – clinicians’ misunderstanding of probabilities makes them like backwards probabilities such as sensitivity, specificity, and type I error. Available from: https://www.fharrell.com/post/backwards-probs/ [Accessed 19 Jan 2023].Search in Google Scholar
37. Vickers, AJ, Elkin, EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006;26:565–74. https://doi.org/10.1177/0272989x06295361.Search in Google Scholar PubMed PubMed Central
38. Vickers, AJ, van Calster, B, Steyerberg, EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res 2019;3s. https://doi.org/10.1186/s41512-019-0064-7.Search in Google Scholar PubMed PubMed Central
© 2023 the author(s), published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.