Skip to content
Publicly Available Published by De Gruyter November 1, 2021

Decision support or autonomous artificial intelligence? The case of wrong blood in tube errors

  • Christopher-John L. Farrell EMAIL logo

Abstract

Objectives

Artificial intelligence (AI) models are increasingly being developed for clinical chemistry applications, however, it is not understood whether human interaction with the models, which may occur once they are implemented, improves or worsens their performance. This study examined the effect of human supervision on an artificial neural network trained to identify wrong blood in tube (WBIT) errors.

Methods

De-identified patient data for current and previous (within seven days) electrolytes, urea and creatinine (EUC) results were used in the computer simulation of WBIT errors at a rate of 50%. Laboratory staff volunteers reviewed the AI model’s predictions, and the EUC results on which they were based, before making a final decision regarding the presence or absence of a WBIT error. The performance of this approach was compared to the performance of the AI model operating without human supervision.

Results

Laboratory staff supervised the classification of 510 sets of EUC results. This workflow identified WBIT errors with an accuracy of 81.2%, sensitivity of 73.7% and specificity of 88.6%. However, the AI model classifying these samples autonomously was superior on all metrics (p-values<0.05), including accuracy (92.5%), sensitivity (90.6%) and specificity (94.5%).

Conclusions

Human interaction with AI models can significantly alter their performance. For computationally complex tasks such as WBIT error identification, best performance may be achieved by autonomously functioning AI models.

Introduction

There is burgeoning interest in the application of artificial intelligence (AI) to laboratory medicine. Many promising AI models have been reported [1, 2], but the effect that laboratory staff have when they interact with these models is not understood. This interaction occurs when AI models are implemented in ‘decision support’ mode, where AI predictions are presented to laboratory staff who may add human insight to the model’s predictions before final decisions are reached [3]. It is unknown whether this human interaction strengthens or weakens AI predictions. If human interaction is a weakening influence, then it would be preferable to have AI models make decisions autonomously. It is also possible that human interaction is only advantageous for a subset of cases, such as those the AI model has low confidence classifying. In such cases, ‘semiautonomous’ modes of operation may be optimal, in which AI models refer samples for human review when the level of confidence is below a certain threshold, but acts autonomously when confidence is higher.

The issue has been brought into sharper focus recently with a report of AI models surpassing human-level performance on a task usually performed by laboratory staff, the post-analytical identification of wrong blood in tube (WBIT) errors [4]. These errors occur when the blood in a specimen tube is not from the patient identified on the label. WBIT errors have a frequency of one in every 1,300–3,500 specimens [5] and are responsible for approximately one third of laboratory-related patient safety incidents [6]. Because results from samples with WBIT errors are consistent with valid patient results, once these errors have evaded pre-analytical checks, they are notoriously difficult to identify. Delta checks are generally used to screen for these errors post-analytically, with results failing delta checks referred to laboratory staff to decide on further actions based on their assessment of the likelihood of error [7, 8].

Uncertainty exists as to the optimal workflow for WBIT error detection. Should the AI models that have been developed [4, 9, 10] be implemented in decision support mode so that laboratory staff contribute to final decision making, or is better performance achieved if they operate autonomously? In the following report, this question was addressed using an artificial neural network (ANN) model previously developed to identify WBIT errors on the basis of patients’ current and previous (within seven days) electrolyte, urea and creatinine (EUC) results [4]. The performance of the model in decision support mode was compared to performance in autonomous mode. The potential value of a semiautonomous mode was also evaluated by comparing the accuracy of the two approaches on samples the AI model had lower confidence classifying.

Materials and methods

Computer simulation of WBIT errors

De-identified patient sodium, potassium, chloride, bicarbonate, urea and creatinine results were extracted from the laboratory information system of a public hospital in Sydney, Australia for the period 01/01/2019 to 26/10/2020. Patient age, sex and previous EUC results were also obtained. Samples were excluded if there were no previous results within seven days or if results for all six analytes on both episodes of testing were not available. 141,396 sets of EUC results remained after applying exclusion criteria.

The simulation of WBIT errors involved randomly assigning 50% of results to have an error, with a record kept of which samples these were. WBIT errors were simulated by randomly switching patients’ current EUC results with those from another patient. The data were then randomly allocated into three subgroups: 80% (n=113,116) for AI model training, 10% (n=14,140) for AI model development and 10% (n=14,140) for performance testing.

AI model development and evaluation

The AI model used was an ANN developed previously for WBIT error detection [4]. It is described in detail here because the short report in which the model previously appeared does not provide a description. The ANN model was trained with the inputs age, sex, current and previous EUC results, as well as absolute and percentage delta values for each analyte. The final model architecture and parameters were selected to maximize accuracy on the development dataset. The model consisted of five hidden layers with 360, 360, 180, 90 and 45 units in each layer, respectively. Dropout regularization was employed for hidden layers one to three, with a drop-out rate of 0.3 for each. The hidden layers used rectified linear activation and the two output units used softmax activation. Training was performed over 25 epochs with a batch size of 128. Binary cross-entropy loss was minimized using the Adam optimizer with default parameters.

The final model, once selected, was run on the performance testing dataset and two outputs stored for each set of results: its prediction of whether an error was present or not and the probability it assigned to that prediction. The performance of the model operating autonomously was the benchmark against which decision support mode was evaluated. Therefore, the performance of the autonomous mode was evaluated only on the randomly selected subset of the performance testing dataset also classified by decision support mode, as described below.

Evaluation of decision support mode

The performance of the AI model in decision support mode was assessed using a purpose-built web application (app), which allowed anonymous participation of volunteers. Volunteers were invited from clinical chemistry staff working in a network of public hospital laboratories in New South Wales, Australia. For each volunteer, the app randomly selected 10 sets of results from the performance testing dataset and presented each in turn. Volunteers were provided with patient age and sex, the time and date of collection of the current and previous samples, plus the EUC results for both samples. These were displayed by the app in the format used by the network’s laboratory information system. Adjacent to the patient results, a text box with the heading ‘AI model’s prediction’, displayed a prediction (‘correct results’ or ‘wrong blood in tube error’) and the model’s ‘percentage confidence’ in the prediction. The latter metric was the probability the model had assigned to the prediction. Prior to performing the task, volunteers were informed that the AI model was approximately 92% accurate and it was being provided to them as a decision support tool.

The app also collected information supplied from volunteers regarding the grade at which they were employed (i.e. ‘Technical Officer’, ‘Scientific Officer’, ‘Senior Scientific Officer’, ‘Principal Scientist’, ‘Pathologist’ or ‘Other’), whether they worked full-time or part-time, the number of years they had been validating biochemistry results and the approximate proportion of their work time typically spent validating results (‘almost none’, ‘less than half’, ‘about half’, ‘more than half’ or ‘almost all’).

In addition to comparing the performance of decision support mode to autonomous mode, it was also possible to compare the performance of decision support mode to laboratory staff performing the task without AI decision support, which was previously reported [4].

Software and statistics

All aspects of the study were performed in the R environment using open-source packages [11]. This included the simulation of WBIT errors, the development of the web app (using the ‘shiny’ package) and the development of the AI model (‘keras’ package). Analysis of findings included the calculation of binomial confidence intervals (CIs, ‘Hmisc’ package) and construction of receiver operating characteristic (ROC) curves (‘pROC’ package).

The parameters of accuracy, sensitivity and specificity were calculated using the record of which samples had a switch made as the gold standard of whether or not a WBIT error was present. Accuracy of each method was calculated as the number samples classified correctly divided by the total number of samples reviewed. Sensitivity was calculated as the number of WBIT errors classified correctly (i.e. ‘true positives’) divided by the total number of WBIT errors (‘true positives’ + ‘false negatives’). Specificity was calculated as the number samples without a WBIT error classified correctly (‘true negatives’) divided by the total number of samples without a WBIT error (‘true negatives’ + ‘false positives’). Each of these metrics were multiplied by 100 and expressed as a percentage.

Statistical testing for significance was performed using chi-squared tests, with the exception of assessing the correlation between number of years of volunteer experience and decision support accuracy, for which a t-test of linear regression slope was used. Ethics review was not required because the study met the National Health and Medical Research Council criteria for a quality improvement activity [12].

Results

Performance of decision support mode

Fifty-one laboratory staff volunteered to participate in the study, consisting of Technical Officers (n=19), Scientific Officers (n=20), Senior Scientific Officers (n=10) and ‘Other’ (n=2). Forty-eight were full-time employees, with three part-time. Volunteers had a median of 10 years of experience validating biochemistry results (range 0–40 years). The most frequent proportion of work time spent validating results was ‘less than half’, but the full range of options were represented, with the least frequent response being ‘almost all’ (n=3).

The decision support workflow classified 510 sets of results. Accuracy was 81.2% (95% CI 77.6–84.3%; Table 1). Decision support performance was not altered by the characteristics of the volunteers providing supervision: employment grade, full-time or part-time employment, number of years’ experience and proportion of the workday typically spent validating results did not influence accuracy (p-values>0.1). The sensitivity and specificity of decision support mode is plotted in ROC space in Figure 1.

Table 1:

The performance of two modes of implementation of an artificial intelligence model for the detection of wrong blood in tube errors. In decision support mode laboratory staff reviewed the model’s predictions, and the results on which they were based, before making a final decision as to whether or not error was present. In autonomous mode there was no human contribution to decision making.

Parameter Decision support mode Autonomous mode p-Value for difference
Accuracy 81.2% 92.5% <10−7
Sensitivity 73.7% 90.6% <10−6
Specificity 88.6% 94.5% 0.02
Figure 1: 
Receiver operating characteristic curve of an artificial intelligence model functioning autonomously for the identification of wrong blood in tube errors. Also plotted are the point estimates for sensitivity and specificity for the model operating in decision support mode (73.7% and 88.6%, respectively) and autonomous mode (90.6% and 94.5%).
Figure 1:

Receiver operating characteristic curve of an artificial intelligence model functioning autonomously for the identification of wrong blood in tube errors. Also plotted are the point estimates for sensitivity and specificity for the model operating in decision support mode (73.7% and 88.6%, respectively) and autonomous mode (90.6% and 94.5%).

The performance of decision support mode was also compared to the previously reported performance of volunteers performing the task without AI decision support [4]. There was no difference between the accuracy of decision support mode and unassisted human accuracy on the task (77.8%; p=0.11).

Performance of autonomous mode

The accuracy of the AI model in autonomous mode was 92.5% (89.9–94.5%), significantly higher than the accuracy of decision support mode (p<10−7). Sensitivity and specificity were also superior to decision support mode (Table 1). Figure 1 plots the sensitivity and specificity of autonomous mode in ROC space, as well as the ROC curve of autonomous mode, which had an area under the curve of 0.980 (0.971–0.990).

Further analysis was performed on the 38 (out of 510) samples that were incorrectly classified by autonomous mode. Among these samples, there was no significant bias of the AI model over- or under-predicting error. The AI model incorrectly predicted the presence of error on 14 of these samples and no error on 24 (p=0.14). The performance of decision support mode on these 38 samples was also assessed to determine if human supervision could identify the model’s incorrect predictions. The accuracy of decision support mode for these samples was much worse than chance (p<0.001), with only eight (21%) classified correctly.

Assessment of semiautonomous mode

The probabilities assigned by the AI model to its predictions were divided into quintiles and the performance of decision support and autonomous modes were compared (Figure 2). These probabilities indicated how confident the model was in each of its predictions: as the quintile of prediction probability decreased, the model was less confident. It was therefore possible to determine if human supervision became beneficial as AI confidence declined. However, at no level of AI model confidence did the addition of laboratory staff review improve the accuracy of WBIT error identification. In fact, the accuracy of decision support mode was significantly lower than autonomous mode for all quintiles (p-values<0.01).

Figure 2: 
Accuracy of an artificial intelligence (AI) model operating in decision support mode and autonomous mode for the identification of wrong blood in tube errors. The data have been grouped by quintiles of the model’s prediction probability as a marker of its prediction confidence. The x-axis presents decreasing prediction probability: the quintile on the right represents samples the AI model classified with least confidence. Asterisks denote statistically significant differences (p<0.05) between the two approaches for each quintile.
Figure 2:

Accuracy of an artificial intelligence (AI) model operating in decision support mode and autonomous mode for the identification of wrong blood in tube errors. The data have been grouped by quintiles of the model’s prediction probability as a marker of its prediction confidence. The x-axis presents decreasing prediction probability: the quintile on the right represents samples the AI model classified with least confidence. Asterisks denote statistically significant differences (p<0.05) between the two approaches for each quintile.

Discussion

These findings provide preliminary evidence to suggest that human interaction with an AI model performing a computationally complex task worsens the performance of the model. AI in decision support mode was inferior to autonomous mode for the detection of WBIT errors on all metrics. Further, human supervision in the decision support workflow did not provide an effective safety net on the occasions the AI model made an incorrect prediction. Decision support mode was significantly worse than chance for these samples, correctly classifying only 21%.

The study also provided evidence against the use of AI in a semiautonomous mode for WBIT error detection. The performance of the decision support mode remained inferior to autonomous mode across the range of prediction probabilities of the AI model. It was observed that the performance of both modes of operation declined as the AI model’s confidence decreased. This suggests that when it assigned lower prediction confidence to samples, the AI model was identifying cases that were inherently more challenging to classify and human insight remained unable to improve upon the AI prediction for these challenging samples.

In this study, human supervision of the AI model weakened its performance to the extent that it was no better than laboratory staff working without AI assistance. Volunteers were given brief written instructions, however, no training in the use of AI as a decision support tool was provided. Training and greater familiarization of staff with the AI system may improve the performance of decision support mode. This remains an issue for further investigation. Considered more positively, the deployment of AI in decision support mode was not worse than baseline human performance. It has been suggested that AI models initially be implemented in parallel to existing procedures to give laboratory staff the opportunity to gain confidence in them [13]. This study suggests that there would be no harm in giving staff access to the model’s predictions for a period prior to deploying it in autonomous mode.

The task investigated in the study was computationally complex. It involved considering changes in six analytes, along with the time difference between results, in the context of the age and sex of the patient. AI models generally excel on prediction tasks that require complex computation and identification of subtle patterns. In contrast, it is likely that in making predictions based on these parameters, staff will employ relatively simple heuristics or even base decisions on intuition. It may be that allowing AI models to operate autonomously is preferable for computationally complex tasks in the laboratory more generally.

The task examined is only a single step in a larger workflow for WBIT error detection. The study results, therefore, do not suggest that laboratory staff should be removed from the workflow entirely, but instead focus their attention on the many functions unable to be performed by the model. For example, once the model has flagged a likely WBIT error, staff may review the sample labeling, discuss the circumstances of the collection with the staff involved, arrange further testing on the current and previous samples (e.g. blood grouping) and/or arrange collection of a new sample.

This study had several limitations. Firstly, WBIT errors were simulated by the random switching of patient results. This is expected to replicate these errors reasonably [7, 9, 10], but fails to incorporate non-random features of these errors that may occur in practice. Secondly, the task evaluated did not replicate the training and experience of laboratory staff, who encounter WBIT errors at a much lower frequency and have access to more sample-related information when reviewing results. This may have put the performance of decision support mode at a disadvantage compared to autonomous mode. Thirdly, a sense of competition while performing the simulated task may have predisposed some volunteers to disagree with the model’s predictions. Finally, any strategy comparing current to previous results will be unable to identify WBIT errors for patients who have only a single episode of testing. Therefore, vigilant application of appropriate collection procedures and pre-analytical checks remains critical.

AI is in the earliest stages of its development in healthcare. It is likely that as new models continue to be developed, they will challenge established paradigms of laboratory workflows. This study highlights that laboratory staff interaction with an AI model may markedly alter the model’s effectiveness. Prior to deploying a model, therefore, consideration should be given to the best mode of implementation for the particular task. In the case of WBIT errors, laboratorians should consider developing and implementing autonomously functioning AI models into routine workflows.


Corresponding author: Christopher-John L. Farrell, Department of Biochemistry, New South Wales Health Pathology, Nepean Blue Mountains Pathology Service, Nepean Hospital, Derby St, Penrith, NSW, 2750, Australia, Phone: +61 2 4734 3667, Fax: +61 2 4734 4468, E-mail:

Acknowledgments

The author wishes to thank Dr. Adam Polkinghorne, NSW Health Pathology, for assistance with preparing this manuscript and the volunteers who participated in the study.

  1. Research funding: None declared.

  2. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: Authors state no conflict of interest.

  4. Informed consent: Not applicable.

  5. Ethical approval: Not applicable.

References

1. Cabitza, F, Banfi, G. Machine learning in laboratory medicine: waiting for the flood? Clin Chem Lab Med 2018;56:516–24. https://doi.org/10.1515/cclm-2017-0287.Search in Google Scholar PubMed

2. Ronzio, L, Cabitza, F, Barbaro, A, Banfi, G. Has the flood entered the basement? A systematic literature review about machine learning in laboratory medicine. Diagnostics 2021;11:372. https://doi.org/10.3390/diagnostics11020372.Search in Google Scholar PubMed PubMed Central

3. Hassani, H, Silva, ES, Unger, S, TajMazinani, M, Mac Feely, S. Artificial intelligence (AI) or intelligence augmentation (IA): what is the future? AI 2020;1:143–55. https://doi.org/10.3390/ai1020008.Search in Google Scholar

4. Farrell, CJ. Identifying mislabelled samples: machine learning models exceed human performance. Ann Clin Biochem 2021 Jul 16. https://doi.org/10.1177/00045632211032991 [Epub ahead of print].Search in Google Scholar PubMed

5. Bolton-Maggs, PH, Wood, EM, Wiersum-Osselton, JC. Wrong blood in tube – potential for serious outcomes: can it be prevented? Br J Haematol 2015;168:3–13. https://doi.org/10.1111/bjh.13137.Search in Google Scholar PubMed

6. Dunn, EJ, Moga, PJ. Patient misidentification in laboratory medicine: a qualitative analysis of 227 root cause analysis reports in the veterans health administration. Arch Pathol Lab Med 2010;134:244–55. https://doi.org/10.5858/134.2.244.Search in Google Scholar PubMed

7. Randell, EW, Yenice, S. Delta checks in the clinical laboratory. Crit Rev Clin Lab Sci 2019;56:75–97. https://doi.org/10.1080/10408363.2018.1540536.Search in Google Scholar PubMed

8. Schifman, RB, Talbert, M, Souers, RJ. Delta check practices and outcomes: a Q-Probes study involving 49 health care facilities and 6541 delta check alerts. Arch Pathol Lab Med 2017;141:813–23. https://doi.org/10.5858/arpa.2016-0161-cp.Search in Google Scholar PubMed

9. Rosenbaum, MW, Baron, JM. Using machine learning-based multianalyte delta checks to detect wrong blood in tube errors. Am J Clin Pathol 2018;150:555–66. https://doi.org/10.1093/ajcp/aqy085.Search in Google Scholar PubMed

10. Jackson, CR, Cervinski, MA. Development and characterization of neural network-based multianalyte delta checks. J Lab Precis Med 2020;5. https://doi.org/10.21037/jlpm.2020.02.03.Search in Google Scholar

11. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.Search in Google Scholar

12. National Health and Medical Research Council. Ethical considerations in quality assurance and evaluation activities. Available from: https://www.nhmrc.gov.au/about-us/resources/ethical-considerations-quality-assurance-and-evaluation-activities [Accessed 26 Jul 2021].Search in Google Scholar

13. Paranjape, K, Schinkel, M, Hammer, RD, Schouten, B, Nannan Panday, RS, Elbers, PWG, et al.. The value of artificial intelligence in laboratory medicine: current opinions and barriers to implementation. Am J Clin Pathol 2021;155:823–31. https://doi.org/10.1093/ajcp/aqaa170.Search in Google Scholar PubMed PubMed Central

Received: 2021-08-04
Accepted: 2021-10-21
Published Online: 2021-11-01
Published in Print: 2022-11-25

© 2021 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 19.4.2024 from https://www.degruyter.com/document/doi/10.1515/cclm-2021-0873/html
Scroll to top button