What AUC does migraine prediction reach?

Realistically 0.65–0.70 for personalised models after a month of data. HAPRED-II 2026 hit 0.66 personalised; Empatica 2025 reached 0.68 for next-day migraine in episodic patients. Anything published with AUC > 0.85 in this field typically has methodological issues.

Why doesn't migraine prediction get to 0.95 like ECG arrhythmia detection?

ECG has a clear physiological signature. Migraine is a multifactorial neurological event with substantial randomness — there's no clean biological marker the model can latch onto. The signal exists but it's noisy.

Is high AUC the same as accuracy?

No. AUC measures ranking — how well the model separates positives from negatives. Accuracy (correct classifications / total) depends on a chosen threshold. Two models can have identical AUC and very different accuracy at a specific operating point. AUC is the more fundamental metric.

AUC-ROC — what 'accuracy' really means in prediction

Q: What is AUC-ROC?

AUC is the area under the Receiver Operating Characteristic curve — a standard metric for measuring how well a binary classifier discriminates between positive and negative cases. 0.5 is no better than chance; 1.0 is perfect. Values above 0.7 are typically considered useful.

AUC-ROC (area under the Receiver Operating Characteristic curve) is the standard metric for measuring how well a binary classifier discriminates between positive and negative cases. It’s the number you’ll see quoted in every published migraine prediction paper, and it’s the number Hermly’s methodology page anchors its claims on.

Understanding what AUC actually means helps you read the claims others make about migraine apps with appropriate skepticism.

What AUC measures

For any binary classifier (migraine yes/no, fraud yes/no, disease yes/no), the model outputs a continuous probability or score. You choose a threshold; predictions above the threshold are “yes”, below are “no”. Different thresholds trade off sensitivity (catching true yeses) against specificity (rejecting true nos).

The ROC curve plots sensitivity vs (1 − specificity) across all possible thresholds. AUC is the area under this curve:

0.5 — diagonal line, no discrimination, random guessing.
0.7 — modest, useful discrimination.
0.8 — good discrimination.
0.9+ — excellent discrimination.
1.0 — perfect; the model never mistakes a positive for a negative.

An equivalent interpretation: AUC = the probability that the model gives a randomly chosen positive case a higher score than a randomly chosen negative case. AUC 0.66 means: 66 percent of the time, a randomly chosen migraine day scores higher than a randomly chosen non-migraine day.

Why migraine prediction lands around 0.65–0.70

The published prospective literature converges:

Study	Model	AUC
Houle 2017 (HAPRED-I, internal LOOCV)	2-feature GLMM	0.65
HAPRED-II 2026 (external 14d)	Same model, new cohort	0.586
HAPRED-II 2026 (external personalised 30d)	Bayesian update	0.66
Stubberud 2023	Random forest + wearable	0.62
Empatica 2025 (episodic, personalised)	Wearable ANS	0.68
Holsteen 2020 (multi-trigger self-report)	Within-person	0.56

The ceiling is biological, not algorithmic. Migraine attacks emerge from interacting systems with substantial randomness; the available signals (sleep, weather, cycle, HRV) carry partial information, never the complete causal picture.

Why “95% accuracy” claims are red flags

When a migraine app claims 90%+ accuracy in published literature, two things to check:

Was the test set genuinely held out? Many “personalised” models train and test on the same person without proper temporal split — the model has effectively memorised the patient.
Were any features derived from labels? “Days since last attack” as a feature plus a near-daily-attack patient produces deceptively high AUC.

The HAPRED-II paper makes this explicit: results substantially above 0.70 in this domain usually don’t replicate when held out externally.

AUC vs other metrics

AUC is the headline metric but not the only one:

Brier score — measures probability calibration. Lower is better. A model can have good AUC and poor Brier if it ranks correctly but is over-confident.
ECE (Expected Calibration Error) — bins predictions by probability and compares to observed frequency. Important for “is 70% really 70%?”.
Sensitivity at fixed specificity — useful for clinical operating points (e.g., “at 80% specificity, what’s the sensitivity?”).
Per-user F1 — median over all users; flags whether the model works for everyone or just averages out.

Hermly’s methodology tracks all of these because AUC alone can hide problems with calibration that hurt user trust.

What this means for users

When you see a migraine app claiming a specific accuracy number, the questions to ask:

Was it tested on external data (not the development cohort)?
What’s the comparable Brier or ECE showing calibration is also good?
What was the operating point for any reported sensitivity?
Did the chronic patients work too, or were they excluded? (See chronic vs episodic)

The honest answer for the entire field in 2026 is: about 0.66 personalised after a month. Higher claims need substantiation; lower claims need explaining.

What this isn’t

Not a statistics tutorial. AUC is more nuanced than this page covers (asymmetric costs, class imbalance, sampling effects). The point here is to give consumers and clinicians a mental model that lets them read migraine-app marketing critically.