Humans (Call Dispatchers + Physicians) vs AI in Emergency Medical Dispatch, One-to-one Impact Analysis
HARMONIA is a prospective cohort study with continuous enrollment, evaluating the performance of a large language model (LLM) compared to Medical Regulation Assistants (ARM) and dispatch physicians at the SAS-Centre 15 of Lille University Hospital. The study uses standardized simulated call scenarios (SimSamu corpus), managed by ARM–physician pairs and simultaneously analyzed by AI. The goal is to determine whether AI can match or exceed human performance in telephone triage using the ARM triage scale.
Prospective cohort with continuous enrollment, double-blind and intra-protocol randomization
60 simulated scenarios planned
13 dispatch physicians
10 volunteer ARMs
January - March 2026
(3-month enrollment)
SAS-Centre 15
Lille University Hospital
Primary: Compare LLM triage performance with actual ARM triage at the SAS-Centre 15 of CHU Lille against the gold-standard ARM triage scale.
Secondary:
Each scenario is managed by an ARM–physician pair, formed interchangeably from a volunteer pool. Matching follows a double-blind scheme with intra-protocol randomization. Each volunteer is exposed to all simulated cases, ensuring a complete crossover evaluation.
The gold standard is composite: strict application of the ARM triage scale to call data and expert opinion (panel of 3 blinded physicians).
The "AI" arm was operationalized using a state-of-the-art large language model (Claude family). For each scenario, the model received the gold-quality verbatim of the call and produced, reasoning over each case independently, an ARM triage level (N1–N5), the identified red flags and a clinical rationale, under a "safety-first" framing (preference for cautious over-triage in case of doubt).
The initial protocol referred to a "GPT model". The prediction set available for this first wave was produced with an LLM of the Claude family; the thesis therefore presents the results actually obtained. Evaluating a dedicated GPT pipeline, on audio (rather than transcript), with per-class probabilities, is a direct perspective for future work.
The first wave covers 30 of the 60 planned scenarios. After code normalization, documented corrections, and exclusion of three ARMs with insufficient coverage (< 25/30 scenarios), 11 ARMs and 11 physicians are evaluable, totaling 328 ARM observations and 328 physician observations, matched to 30 AI predictions.
| Metric | ARM | Physician | AI (LLM) |
|---|---|---|---|
| Exact agreement | 60.0% | 36.7% | 36.7% |
| Agreement ±1 level | 100% | 73.3% | 83.3% |
| MAE | 0.400 | 0.933 | 0.833 |
| RMSE | 0.632 | 1.265 | 1.140 |
| Bias (predicted − gold) | −0.333 | −0.733 | −0.633 |
| Quadratic weighted κ | 0.766 | 0.403 | 0.456 |
| Spearman ρ | 0.813 | 0.572 | 0.585 |
| Macro F1 | 0.550 | 0.277 | 0.262 |
Table 4 — Comparative performance of ARM / physician / AI against the gold-standard (30 matched scenarios; metrics based on the mode of human evaluations).
| Metric (all observations) | ARM (n = 328) | Physician (n = 328) |
|---|---|---|
| Exact agreement | 50.9% | 38.4% |
| Agreement ±1 class | 95.7% | 73.8% |
| MAE | 0.534 | 0.933 |
| RMSE | 0.787 | 1.295 |
| Quadratic weighted κ | 0.694 | 0.528 |
| Spearman ρ | 0.713 | 0.588 |
| Macro F1 | 0.466 | 0.352 |
Table 5 — Agreement with the gold-standard across all human observations.
ARM triage is the most robust reference baseline (κ 0.77; ±1-class agreement 100%). The AI reaches a level of agreement comparable to physicians (κ 0.46 vs 0.40) while outperforming them on ±1-class agreement and mean error, with a systematically cautious "safety-first" profile and no severe under-estimation. Notably, its largest divergences occur precisely on scenarios where the ARM gold-standard — and the ARMs themselves — under-triage (acute neurological deficit in a young patient, altered consciousness in a homeless person). Its error profile is thus complementary to that of the ARMs, supporting a hybrid human-AI configuration rather than substitution.
Finally, the difficulty perceived by ARMs does not predict actual error: mean perceived difficulty is identical whether the rating was correct (2.57/5) or off by at least one level (2.59/5; Mann-Whitney, p = 0.87). This argues for systematic rather than selective AI assistance.
MAE, RMSE, weighted Kappa, exact agreement and ±1 class, AUROC
F1 micro/macro, Spearman correlation, multiclass Brier score
Bland-Altman plots, confusion matrices, reliability diagrams
R® (v4.2.2–4.4.0) & Python 3.12
α = 5% threshold, Bonferroni correction
The study does not modify patient care in any way. It relies exclusively on simulated data (SimSamu corpus, anonymized by default, AES-256 encrypted). AI models do not store or archive audio recordings, and no training is performed on these data. Signed consent from participating ARMs and physicians. Analysis conducted on the secured CHU Lille project space (SNDS security framework), with only aggregated results exported.
Protocol writing, scientific council validation, DPO declaration
Start of enrollment
End of enrollment
End of data extraction
First-wave analysis (30 scenarios) and results validation
Final report (60 scenarios, composite gold-standard)
Peer-reviewed journal publication