Hybrid Simulation Framework for Pre-Clinical Evaluation of AI-Assisted Emergency Triage
EIMLIA-TEU is a hybrid simulation framework for the pre-clinical evaluation of AI-assisted triage across four dimensions simultaneously: clinical, organizational, economic, and technical. It combines Discrete Event Simulation (DES), Multi-Agent Systems (MAS), four-dimensional digital twins (structural, behavioral, temporal, technical), process mining, and a graphical systems formalization, calibrated on 600,000 anonymized ED patient records (CHU Lille, 2018–2023).
The three AI triage architectures from the TIAEU proof-of-concept (TRIAGEMASTER: Doc2Vec + MLP; URGENTIAPARSE: FlauBERT + XGBoost; EMERGINET: JEPA + VICReg) were retrained on 340,536 patients and evaluated under a dual-regime framework designed to disentangle algorithmic approximation from clinical validity.
Hybrid DES + Multi-Agent Systems simulation, 4D digital twins
600,000 ED visits (CHU Lille, 2018–2023)
18.5M event logs (process mining)
340,536 patients
(60:20:20 split; test n = 68,108)
TRIAGEMASTER (NLP)
URGENTIAPARSE (LLM)
EMERGINET (JEPA)
The central methodological contribution is an explicit separation between two evaluation regimes, designed to make visible an artefact the AI-triage literature has tended to leave implicit.
| Architecture | R1 (κw) | R2 (κw, projected) |
|---|---|---|
| URGENTIAPARSE (LLM) | 0.9956 | 0.81 [0.78; 0.84] |
| EMERGINET (JEPA) | 0.9391 | 0.74 [0.71; 0.77] |
| TRIAGEMASTER (NLP) | 0.8945 | 0.69 [0.66; 0.72] |
| Nurse baseline (literature) | 0.65 | 0.65 |
| Deployment threshold | ≥ 0.80 | ≥ 0.80 |
Dual-regime benchmarking. R1 vs. FRENCH ideal labels (algorithmic approximation, n = 68,108). R2 vs. 5-physician expert consensus (clinical validity, n = 3,000). 95% bootstrap CIs in brackets.
Under R1, all three architectures converge to κw ≥ 0.89, confirming algorithmic-approximation capacity but not clinical validity. Under R2, all three exceed the nurse baseline (κw ≈ 0.65), but only URGENTIAPARSE (κw = 0.81) clears the κw ≥ 0.80 deployment threshold. The R1→R2 gap of ≈ 0.18, replicated across architectures, quantifies the artefact of training on algorithmically reconstructed labels: prior literature evaluating AI triage against such labels likely overstates clinical validity by 0.10–0.20 of κw.
Hybrid DES-MAS runs (3-day fast mode), under the R2-corrected performance estimates, yield the following change in mean length of stay (∆LOS) versus the manual baseline:
| Scenario | ∆ Mean LOS | AI concordance |
|---|---|---|
| S2b — URGENTIAPARSE | −7.1% [−9.4; −4.8] | 76.5% |
| S2a — TRIAGEMASTER | −3.2% [−5.5; −1.0] | 62.3% |
| S2c — EMERGINET | +1.2% [−0.8; +3.1] | 91.0% |
| S3 — Crisis hybrid (200% load) | +8.8% | 90.5% |
Scenario comparison (3-day fast-mode runs, R2-corrected error injection). The hybrid crisis scenario preserves clinical safety (concordance 90.5%). Stress tests pass design targets (SURGE n = 220 ≥ 180; availability 99.1% ≥ 99.0%).
URGENTIAPARSE is the only architecture that simultaneously clears the R2 clinical-validity threshold and produces a convergent ecological flow benefit (∆LOS = −7.1%). The framework also documents the "URGENTIAPARSE simulation paradox": in earlier recorded-label cycles, the model produced an apparent ∆LOS of −13.2% while critical sensitivity collapsed to ≈ 0.002 — a safety hazard invisible to any isolated predictive metric, detectable only through multidimensional simulation.
Probabilistic sensitivity analysis (Monte Carlo, 50,000 iterations) under the realistic scenario:
| Scenario | 3-yr ROI (95% CrI) | ICER | P(dominant) |
|---|---|---|---|
| Pessimistic | 80% [−30; 320] | 12,500 €/QALY | 8% |
| Realistic | 480% [210; 1,250] | 1,840 €/QALY [dominated; 8,300] | 32% |
| Optimistic | 2,100% [890; 4,100] | dominant | 64% |
PSA Monte Carlo results (50,000 iterations) by scenario. The intervention is cost-effective at the HAS 50,000 €/QALY threshold in 99.4% of simulations (Cost-Effectiveness Acceptability Curve).
The five parameters explaining 87% of ROI variance are, in decreasing order: inappropriate hospitalization avoidance rate (38%), GHS unit cost (21%), imaging avoidance rate (12%), annual ED visit volume (10%), and AI inference latency P95 (6%). An earlier deterministic projection (ROI 10,260%, ICER 93 €/QALY) was explicitly withdrawn in favour of these probabilistic figures.
R1/R2 weighted κw, critical sensitivity, under-triage rate
Length of stay, waiting time, overcrowding rate
ROI, ICER, CEAC via Monte Carlo PSA (CHEERS 2022)
Availability, inference latency, MTTR
Single-center retrospective study (CHU Lille Adult ED, 2018–2023), CESREES-approved, MR-004 methodology with declaration N° 27797006 to the Health Data Hub. GDPR-compliant, CNIL MR-004 reference framework.
EIMLIA-TEU demonstrates that AI triage models trained on algorithmic labels recover the algorithm rather than clinical judgment; that ecological flow simulation is necessary to detect safety-flow trade-offs invisible to predictive metrics; and that under realistic R2 evaluation only URGENTIAPARSE clears the stringent κw ≥ 0.80 threshold. The TRIADE prospective multicenter cluster-RCT (CHU Lille, CH Maubeuge, CH Dunkerque) is the indispensable next step.