Preliminary results - First wave

HARMONIA

Humans (Call Dispatchers + Physicians) vs AI in Emergency Medical Dispatch, One-to-one Impact Analysis

Project Description

HARMONIA is a prospective cohort study with continuous enrollment, evaluating the performance of a large language model (LLM) compared to Medical Regulation Assistants (ARM) and dispatch physicians at the SAS-Centre 15 of Lille University Hospital. The study uses standardized simulated call scenarios (SimSamu corpus), managed by ARM–physician pairs and simultaneously analyzed by AI. The goal is to determine whether AI can match or exceed human performance in telephone triage using the ARM triage scale.

Design

Prospective cohort with continuous enrollment, double-blind and intra-protocol randomization

Population

60 simulated scenarios planned
13 dispatch physicians
10 volunteer ARMs

Period

January - March 2026
(3-month enrollment)

Center

SAS-Centre 15
Lille University Hospital

Objectives

Primary: Compare LLM triage performance with actual ARM triage at the SAS-Centre 15 of CHU Lille against the gold-standard ARM triage scale.

Secondary:

Concordance with medical regulation decision (final regulation decision)
Participant experience analysis (ARMs + physicians) via NASA-TLX questionnaire
Reliability and speed of transcription and semantic analysis
Reliability of resource deployment predicted by AI

Methodology

Each scenario is managed by an ARM–physician pair, formed interchangeably from a volunteer pool. Matching follows a double-blind scheme with intra-protocol randomization. Each volunteer is exposed to all simulated cases, ensuring a complete crossover evaluation.

The gold standard is composite: strict application of the ARM triage scale to call data and expert opinion (panel of 3 blinded physicians).

The "AI" arm was operationalized using a state-of-the-art large language model (Claude family). For each scenario, the model received the gold-quality verbatim of the call and produced, reasoning over each case independently, an ARM triage level (N1–N5), the identified red flags and a clinical rationale, under a "safety-first" framing (preference for cautious over-triage in case of doubt).

Transparency note

The initial protocol referred to a "GPT model". The prediction set available for this first wave was produced with an LLM of the Claude family; the thesis therefore presents the results actually obtained. Evaluating a dedicated GPT pipeline, on audio (rather than transcript), with per-class probabilities, is a direct perspective for future work.

Preliminary Results (First Wave)

The first wave covers 30 of the 60 planned scenarios. After code normalization, documented corrections, and exclusion of three ARMs with insufficient coverage (< 25/30 scenarios), 11 ARMs and 11 physicians are evaluable, totaling 328 ARM observations and 328 physician observations, matched to 30 AI predictions.

Comparative performance vs ARM gold-standard (30 matched scenarios)

Metric	ARM	Physician	AI (LLM)
Exact agreement	60.0%	36.7%	36.7%
Agreement ±1 level	100%	73.3%	83.3%
MAE	0.400	0.933	0.833
RMSE	0.632	1.265	1.140
Bias (predicted − gold)	−0.333	−0.733	−0.633
Quadratic weighted κ	0.766	0.403	0.456
Spearman ρ	0.813	0.572	0.585
Macro F1	0.550	0.277	0.262

Table 4 — Comparative performance of ARM / physician / AI against the gold-standard (30 matched scenarios; metrics based on the mode of human evaluations).

Agreement with gold-standard across all human observations

Metric (all observations)	ARM (n = 328)	Physician (n = 328)
Exact agreement	50.9%	38.4%
Agreement ±1 class	95.7%	73.8%
MAE	0.534	0.933
RMSE	0.787	1.295
Quadratic weighted κ	0.694	0.528
Spearman ρ	0.713	0.588
Macro F1	0.466	0.352

Table 5 — Agreement with the gold-standard across all human observations.

Key finding

ARM triage is the most robust reference baseline (κ 0.77; ±1-class agreement 100%). The AI reaches a level of agreement comparable to physicians (κ 0.46 vs 0.40) while outperforming them on ±1-class agreement and mean error, with a systematically cautious "safety-first" profile and no severe under-estimation. Notably, its largest divergences occur precisely on scenarios where the ARM gold-standard — and the ARMs themselves — under-triage (acute neurological deficit in a young patient, altered consciousness in a homeless person). Its error profile is thus complementary to that of the ARMs, supporting a hybrid human-AI configuration rather than substitution.

Finally, the difficulty perceived by ARMs does not predict actual error: mean perceived difficulty is identical whether the rating was correct (2.57/5) or off by at least one level (2.59/5; Mann-Whitney, p = 0.87). This argues for systematic rather than selective AI assistance.

Statistical Analyses

Primary Metrics

MAE, RMSE, weighted Kappa, exact agreement and ±1 class, AUROC

Composite Score

F1 micro/macro, Spearman correlation, multiclass Brier score

Secondary Analyses

Bland-Altman plots, confusion matrices, reliability diagrams

Tools

R® (v4.2.2–4.4.0) & Python 3.12
α = 5% threshold, Bonferroni correction

Ethics and Data

The study does not modify patient care in any way. It relies exclusively on simulated data (SimSamu corpus, anonymized by default, AES-256 encrypted). AI models do not store or archive audio recordings, and no training is performed on these data. Signed consent from participating ARMs and physicians. Analysis conducted on the secured CHU Lille project space (SNDS security framework), with only aggregated results exported.

Timeline

December 2025

Protocol writing, scientific council validation, DPO declaration

January 2026

Start of enrollment

March 2026

End of enrollment

April 2026

End of data extraction

June 2026

First-wave analysis (30 scenarios) and results validation

October 2026

Final report (60 scenarios, composite gold-standard)

2026-2027

Peer-reviewed journal publication

Project Team

André Filipe Gomes Botelho

Principal Investigator

Emergency Medicine Resident
Lille University Hospital

Dr Flavie Vanbrugge

Thesis Director

Head of SAMU Unit
Lille University Hospital

Dr Edouard Lansiaux

Scientific Committee

Junior Doctor, Emergency Medicine
AI Engineer
Lille University Hospital

Dr Jonathan Hennache

Scientific Committee

SAMU Unit
Lille University Hospital

Dr Roch Joly

Scientific Committee

Head of Emergency Department
Lille University Hospital