A scoping review on pediatric sepsis prediction technologies in healthcare
There were 36,757 records identified. After removing 9,586 duplicates and screening titles and abstracts, 163 records were assessed for inclusion, resulting in 27 articles in the final review (Supplementary Fig. 1). Most articles (23 [85%]) involved single-center datasets (Table 1). Five (19%) were multi-site studies: one combined datasets from multiple continents, including low-middle-income countries43, and four (15%) used more than one healthcare site within North America or Australia44,45,46. There were 17 (63%) articles using development datasets from the continent of North America, six (22%) from Asia, three (11%) from Europe, two (7%) from Africa, and two (7%) from Oceania. Included healthcare settings were mostly emergency departments (17 [63%]) and intensive care units (15 [56%]). Six (22%) articles included inpatient units. The completed extraction template is included in Supplementary Data 1, with the categorization and organization of the model features in Supplementary Data 2.
Endpoint definitions
More than half (20 [74%]) of the articles focused on predicting sepsis, compared to severe sepsis and septic shock. International consensus definitions and guidelines were used to define sepsis endpoints in most articles (16 [59%] of 27), including Goldstein et al.47, Rhodes et al.48, Singer et al.22, and Weiss et al.14, as well as criteria from Eisenberg et al.49, Scott et al.50, and Sepanski et al.51, with modifications (Table 2)52,53. Other definitions included treatment or diagnostic codes (8 [30%] of 27). Less frequent approaches were mortality or extracorporeal membrane oxygenation use, Delphi processes, and outcomes from another screening algorithm43,54,55,56. One article led to the most recent international consensus criteria for pediatric sepsis and septic shock: the Phoenix Sepsis Score43.
We noted variability in how clinicians defined sepsis46,57,58. Changes in treatment practices compared between when the data was collected and when the model was developed were also noted, as were expected changes in diagnostic definitions requiring model re-validation59,60. One article mentioned the risk of including overtreated patients through the use of intention and treatment-based definitions61.
Model development approaches
More than half of the articles (16 [59%]) used logistic regression (Table 3)45,46,53,54,55,56,57,58,60,61,62,63,64,65,66,67,68. Others used gradient boosting (6 [22%])52,57,69,70,71,72 and random forest (4 [15%])55,59,60,62. Less common approaches were support vector machines or maximization (2 [7%])62,72, stacked or classification and regression tree modelling (2 [7%])43,62, tree augmented or Naïve Bayes (3 [11%])58,62,72, elastic net regularization (1 [4%])44, neural networks (1 [4%])55, and empirical derivations for an age and temperature-adjusted index score (1 [4%])73.
Patient demographics
We extracted the sample size based on the number of patients from most articles (19 [70%]), which had a median of 1238 (IQR 289–4403). Some reported the number of visits (3 [11%]), encounters (5 [19%]), or admissions (2 [7%]), which had a combined median of 35330 (IQR 2464–141510). Many articles (16 [59%]) reported median age statistics, ranging from 13.1 months to 10 years. One article specifically tested their models on different age groups, such as one-month to one-year-olds and those greater than one year old58. 22 articles (81%) reported patient sex, and 10 (37%) included race or ethnicity data. Ten articles (37%) referenced the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines43,44,45,46,54,58,59,61,64,65.
Most articles (20 [74%]) had a class imbalance in their datasets, with a low prevalence of sepsis-related outcomes (0.05% to 19.38%). Five articles (19%) used datasets with more than half of the patients classified with a sepsis-related outcome, with a prevalence ranging from 50.80% to 74.50% and sample sizes ranging from 65 to 66764,65,66,68,72. One article normalized a pediatric sepsis dataset to an adult sepsis dataset because of the lack of readily available data71.
Other patient risk factor characteristics were included in 16 (59%) articles. For example, one study focused on septic shock outcomes in a post-chemotherapy population63, and another included Kawasaki disease as a differentiating outcome from sepsis68. Other health and socio-demographic risk factors that were characterized across studies were those related to immunodeficiency or immunosuppression56,61, blastomas, bone marrow transplantation, aplastic anemia, organ transplantation57,67, healthcare utilization and diagnoses history44,45,69, hospital category (e.g., quaternary, dedicated pediatric, mixed) or resource setting43,46, surgical status59, complex and chronic conditions and comorbidities43,44,45,56,70, history of prematurity56,58, insurance status44,45,61,62, weight or malnutrition43,46,54, and medical technology needs43,44,45,59,61.
Predictive features
The minimum number of features used in any article was three; the maximum was 107. The average number of features used across all models was 17.3 (SD = 19.6). Of the 27 articles in this review, we extracted ranked feature information from 29 individual models among 17 (63%) articles. For models defining a sepsis endpoint, the top six higher-ranking predictive features were age, platelet count, temperature, immunocompromised status, heart rate, and serum lactate (Supplementary Fig. 2). For severe sepsis endpoints, the top six higher-ranking predictive features were mean heart rate, maximum diastolic blood pressure, age, maximum heart rate, minimum systolic blood pressure, and heart rate (Supplementary Fig. 3). For septic shock endpoints, the top six higher-ranking predictive features were serum lactate, platelet count, maximum heart rate, maximum mean arterial pressure, Glasgow Coma Scale score, and respiratory sequential organ failure assessment score (Supplementary Fig. 4).
When categorized broadly (Fig. 1), blood pressure had the highest category ranking among each sepsis-related endpoint, followed by heart rate features for severe sepsis and septic shock, and thermoregulation for sepsis (e.g., temperature, fever, and hypothermia). Hematologic laboratory values (e.g., hematocrit, white blood cell count, platelet count, and basophils), immunological risk factors (e.g., up-to-date immunization, oncological comorbidity, cancer relapse, and chemotherapy), oxygen requirement (e.g., oxygen saturation and fraction of inspired oxygen), and demographics (e.g., age and home zip code) were also among the higher-ranking feature groups across the sepsis-related endpoints. Laboratory values were used in sepsis and septic shock models but not in the severe sepsis models included in this review.
Most articles without ranked features had less than 20 total features except one58. One article was ranked for Kawasaki disease in differentiating from sepsis68. When examining the frequency of the remaining top 20 feature groups from all developed models, blood pressure, heart rate, and thermoregulation were the most frequently occurring among each sepsis-related endpoint (Fig. 2). Hematologic laboratory values were the highest occurring laboratory feature group for sepsis and septic shock, followed by lactate. Electrolyte laboratory values, perfusion (e.g., capillary refill), parent and caregiver factors, and anthropometric and nutritional factors only appeared in sepsis modelling approaches. The renal function indicator feature (i.e., urine output) only appeared in septic shock modelling approaches.
Articles mentioned feature limitations such as having routine or non-routine access to laboratory or diagnostic information44,45,46,53,59,69,70,74, differences in handling or excluding missing features or documentation45,57,58,62,65, and concerns about human error43,44,69,73. We also noted slightly different language for similar features (e.g., “perfusion” and “capillary refill”), feature exclusion due to reliability concerns63, including only the most concerning feature values within a time range58, using qualitative variables65, and not knowing a patient’s treatment time74.
Performance metrics
The average area under the receiver operating curve (AUROC) for sepsis endpoint logistic regression-based models developed into a score-based tool was 0.84. For the linear and tree-based sepsis models, the average AUROCs were 0.81 and 0.85, respectively. For the severe sepsis models, the average AUROCs for the linear and tree-based approaches were 0.74 and 0.84, respectively. For septic shock models, the average AUROC for logistic regression-based models developed into a score-based tool, linear models, and tree-based models was 0.81, 0.83, and 0.89, respectively, with the latter score heavily weighted by two articles with AUROCs above 0.90 at multiple prediction time points57,72.
Few articles (5 [18%]) reported the area under the precision-recall curve (AUPRC)43,61,62,69,72. Three articles (11%) reported AUPRCs less than 0.50 for sepsis, severe sepsis, and septic shock, using stochastic gradient boosting, regression-based scoring tools, or random forest methods43,62,69. Two articles (7%) reported AUPRCs higher than 0.90 for a logistic regression-based scoring tool and gradient boosting model for sepsis and septic shock61,72. Most articles reported sensitivity (20 [74%]) and specificity (19 [70%]). Some reported positive predictive value (16 [59%]), negative predictive value (12 [44%]), positive or negative likelihood ratios (4 [15%]), and F1 scores (3 [11%]).
Automation type
We categorized nine articles (33%) as “analysis automation,” which is the automation of algorithms or other calculations to inform clinicians about certain sepsis criteria being met53,55,57,62,69,70,71,73,74. We categorized 18 (67%) articles as “decision automation,” in which the article described more details about how the technology provides additional information to support sepsis suspicions, stratify risk or triage levels, identify workflow actions, or achieve greater situational awareness43,44,45,46,52,54,56,58,59,60,61,63,64,65,66,67,68,72. All articles included manual provider-entered information within an electronic health record. Two articles (7%) involved “acquisition automation” through collecting data without human interaction75, such as continuous physiological monitoring55,60.
Prediction timings
The prediction timings ranged from the onset of the sepsis-related condition52,71, within five to seven seconds after initial triage69, and between two and 12 hours earlier52,55,70,73. Other prediction timings were within 24 hours60,61,67 and up to 24 hours before sepsis-related outcome onset57. For eight (30%) articles, clinical suspicion was first required to initiate the model and provide a result44,45,46,58,59,64,65,66,72. The four (11%) remaining articles were identified as being used during a patient encounter43,62,63,68.
The prediction timings also ranged with respect to other design and performance elements (Table 3). For example, one XGBoost model that could predict the onset of septic shock 24 hours in advance across over 1200 patients with hematological malignancies in the inpatient unit, with a median age of 58 months using 24 features, had a reported AUROC of 0.9357. One decision tree ensemble model that could predict the onset of severe sepsis four hours in advance across over 9400 patients in the inpatient unit, with a median age of 10 years using seven features, had a reported AUROC of 0.71852. Comparatively, the Phoenix Sepsis Criteria, a regression-model informed integer-based scoring tool using up to 13 variables to identify sepsis and septic shock across over 200000 patient encounters in high- and low-income healthcare settings, with a median age of 2.6–3.7 years, had a reported AUROC between 0.71–0.96 and an AUPRC between 0.14–0.4843.
Interface and interaction design
Few articles described specific details related to how clinicians would interact or interface with the developed model or alerting systems. Two articles (7%) described a two-tiered notification system: an “alert” tier, with a score greater than or equal to 15, to prompt a team-based bedside evaluation, or an “aware” tier, with a score greater than or equal to 30, to increase situational awareness, including highlighting the score in a bright colour upon opening the patient’s electronic health record chart61,67. The purpose was to reduce false alerts and unnecessary and unplanned team huddles while ensuring children who met the “aware” threshold were discussed during rounds, handoffs, and planned huddles61,67. The screening scores were visualized on patient lists and on-screen banners, updated in real-time to prompt a huddle at the bedside67. The clinical variables and their scoring values contributing to the overall score were also displayed, including provider-specific steps when the “alert” tier was activated to call a sepsis huddle. Clinician feedback from an urban quaternary children’s hospital indicated higher compliance with the sepsis huddle process and minimal alert fatigue, but this was partly attributed to pre-implementation and pilot testing67.
Other articles mentioned a tier-like system design to identify risk levels or support triage of the most at-risk children54,56, support risk stratification43,46,52,56,60, identify low-risk post-chemotherapy children63, and identify children who are at a higher risk of hypotensive septic shock45. One article combined their data-driven sepsis decompensation risk score with a red or yellow stoplight colour to indicate patient status53, available to all emergency department staff and clinicians74. The red or yellow stoplight colour specified how often a bedside huddle was needed for the patient, which was every 30 or 60 min, respectively74. A tracking board was used to alert all clinicians with a checkerboard flag displayed if the sepsis score indicated a child at risk, suggesting the need for a bedside assessment, with the most recent risk score shown next to the patient on the tracking board for increased situational awareness74.
Clinical implementation and decision-making
Three (11%) technologies were implemented electronically53,67,74, one (4%) non-electronically46, and one (4%) in silent mode61. One-third (9 [33%]) of the articles mentioned specific human factors considerations regarding implementation44,45,46,58,59,61,67,72,74. This included descriptions of implementation within sepsis workflows46,61,67, the impact of the technology on team-based activities, including huddles and team situational awareness61,67,74, and considerations for minimizing alert fatigue61,67. Two articles found positive impacts on huddle compliance67,74, and after electronic implementation, one found a significant improvement in team-based assessments, situational awareness, patient huddles, and communication, as well as increased fluid bolus and antibiotics administration by three hours post-activation74. However, clinicians were not always blinded to the tool, which may have biased their decisions to diagnose more or fewer patients, or the study was not powered to determine the impact on patient outcomes46. Additionally, one study mentioned that early treatment or transfers may inadvertently perturb the measured outcome or lead to overusing healthcare resources67.
For decision support, two articles differentiated sepsis from other conditions, such as non-infectious systemic inflammatory response syndrome or Kawasaki disease59,68. A few articles (9 [33%]) mentioned that users would interact with their system after clinician-initiated suspicion44,45,46,58,59,64,65,66,72. The predictive features would be entered into a smartphone-based calculator application54,66, a web calculator59,72, or an electronic health record43,44,45 once a provider thought a child might develop sepsis or septic shock. The system would provide a probability risk score, or other information would be returned to support or refute their judgment.
While at least one of the models was not ultimately designed to be clinically implemented46, none of the articles using complex data-driven modelling explicitly described the explainable aspects of their technologies if they were to be implemented. Notably, one article explained that “black box” models do not support direct inferences about how specific features impact outcome predictions59. Given the diversity of patient characteristics and demographics, features, prediction timings, healthcare settings, outcome definitions, and performance outcomes, we did not identify detailed considerations for providing prediction uncertainty information or evaluating clinician interpretations of data-driven predictions beyond limited measurements of huddle and treatment outcomes.
Potential biases
We noted risks of bias potentially impacting testing and validation, which may introduce uncertainties for appropriate clinical use. For example, selection bias was reported in one screening tool’s performance68, and feature data may be collected at different times between different technologies70. As some models were only applied to patients for whom clinicians suspected sepsis, this may introduce bias in performance outcomes44,46. Some patient groups, such as children with oncological diseases63, were more heavily represented in some datasets, and the model may perform better or worse if used in a different patient group52,54,55,56,60,61. One model was developed and tested on the same patient cohort65. Some models were intentionally biased for maximum sensitivity to alert clinicians earlier than existing technologies55, and datasets may not have been specifically collected for data-driven algorithmic or AI prediction purposes52,56,57,59.
link