AUC-ROC (area under the Receiver Operating Characteristic curve) is the standard metric for measuring how well a binary classifier discriminates between positive and negative cases. It’s the number you’ll see quoted in every published migraine prediction paper, and it’s the number Hermly’s methodology page anchors its claims on.

Understanding what AUC actually means helps you read the claims others make about migraine apps with appropriate skepticism.

What AUC measures

For any binary classifier (migraine yes/no, fraud yes/no, disease yes/no), the model outputs a continuous probability or score. You choose a threshold; predictions above the threshold are “yes”, below are “no”. Different thresholds trade off sensitivity (catching true yeses) against specificity (rejecting true nos).

The ROC curve plots sensitivity vs (1 − specificity) across all possible thresholds. AUC is the area under this curve:

  • 0.5 — diagonal line, no discrimination, random guessing.
  • 0.7 — modest, useful discrimination.
  • 0.8 — good discrimination.
  • 0.9+ — excellent discrimination.
  • 1.0 — perfect; the model never mistakes a positive for a negative.

An equivalent interpretation: AUC = the probability that the model gives a randomly chosen positive case a higher score than a randomly chosen negative case. AUC 0.66 means: 66 percent of the time, a randomly chosen migraine day scores higher than a randomly chosen non-migraine day.

Why migraine prediction lands around 0.65–0.70

The published prospective literature converges:

StudyModelAUC
Houle 2017 (HAPRED-I, internal LOOCV)2-feature GLMM0.65
HAPRED-II 2026 (external 14d)Same model, new cohort0.586
HAPRED-II 2026 (external personalised 30d)Bayesian update0.66
Stubberud 2023Random forest + wearable0.62
Empatica 2025 (episodic, personalised)Wearable ANS0.68
Holsteen 2020 (multi-trigger self-report)Within-person0.56

The ceiling is biological, not algorithmic. Migraine attacks emerge from interacting systems with substantial randomness; the available signals (sleep, weather, cycle, HRV) carry partial information, never the complete causal picture.

Why “95% accuracy” claims are red flags

When a migraine app claims 90%+ accuracy in published literature, two things to check:

  1. Was the test set genuinely held out? Many “personalised” models train and test on the same person without proper temporal split — the model has effectively memorised the patient.
  2. Were any features derived from labels? “Days since last attack” as a feature plus a near-daily-attack patient produces deceptively high AUC.

The HAPRED-II paper makes this explicit: results substantially above 0.70 in this domain usually don’t replicate when held out externally.

AUC vs other metrics

AUC is the headline metric but not the only one:

  • Brier score — measures probability calibration. Lower is better. A model can have good AUC and poor Brier if it ranks correctly but is over-confident.
  • ECE (Expected Calibration Error) — bins predictions by probability and compares to observed frequency. Important for “is 70% really 70%?”.
  • Sensitivity at fixed specificity — useful for clinical operating points (e.g., “at 80% specificity, what’s the sensitivity?”).
  • Per-user F1 — median over all users; flags whether the model works for everyone or just averages out.

Hermly’s methodology tracks all of these because AUC alone can hide problems with calibration that hurt user trust.

What this means for users

When you see a migraine app claiming a specific accuracy number, the questions to ask:

  • Was it tested on external data (not the development cohort)?
  • What’s the comparable Brier or ECE showing calibration is also good?
  • What was the operating point for any reported sensitivity?
  • Did the chronic patients work too, or were they excluded? (See chronic vs episodic)

The honest answer for the entire field in 2026 is: about 0.66 personalised after a month. Higher claims need substantiation; lower claims need explaining.

What this isn’t

Not a statistics tutorial. AUC is more nuanced than this page covers (asymmetric costs, class imbalance, sampling effects). The point here is to give consumers and clinicians a mental model that lets them read migraine-app marketing critically.