Methodology

How Hermly forecasts
migraine attacks.

A 24-hour probability, on your iPhone, recomputed every time you open Today. Below is exactly how — what we read, what we don't, which research it builds on, and the accuracy ceiling we honestly can't yet beat.

Reading time · 7 min Last updated · May 2026

In one paragraph

Hermly reads 26 daily signals from HealthKit, your menstrual cycle, WeatherKit, and an optional 1-tap stress check. A cohort machine-learning model trained on a consented research cohort gives the baseline. A per-user personalisation layer trained on-device adapts the prediction to you over the first 30 days. Your raw health data never leaves the phone. Realistic accuracy is around 0.66 AUC after a month of data — in line with the best peer-reviewed results in the field. Anything higher published in this space tends not to replicate.

From sensors to a single number.

Every refresh, Hermly fetches the day's signals from four sources, derives 26 features, and runs them through a two-layer model. Everything below the dashed line happens on your iPhone, in milliseconds, with no network call.

Hermly signal pipeline: four input sources fan into a 26-feature vector, then through a cohort XGBoost model and per-user Bayesian head, ending in a 0–100 risk score with a confidence band. HealthKit HRV · sleep · RHR · activity · wrist temp Cycle day · phase · perimenstrual flag WeatherKit pressure 24h Δ · forecast · humidity Self-report optional · 1-tap stress · Today log 26-feature vector + 4 derived deltas XGBoost handles missing values natively — no imputation needed ↑ on-device only ↓ Cohort XGBoost trained offline on consented beta · OTA-updated Per-user head Bayesian update · adapts to you over ~30 days Conformal interval wrapper emits a prediction band, not just a point Today's risk · 0–100 · with calibrated confidence band
Figure 1 · Signal sources at top, two-layer model in the middle, conformal-wrapped output at the bottom. The dashed line is the device boundary — nothing below it makes a network call.

What Hermly reads each day.

Drawn from four sources, processed into 26 raw features plus four derived "today vs. your baseline" deltas. The model handles missing values gracefully — a phone-only user still gets a useful prediction; an Apple Watch + Cycle user gets a more precise one.

From HealthKit (10)

  • HRV (24h average + 30d baseline ratio)
  • Sleep duration, efficiency, and deep-sleep fraction
  • Resting heart rate (today + baseline diff)
  • Wrist temperature anomaly (Watch Series 8+)
  • Activity / step counts vs. your baseline

From the menstrual cycle (3)

  • Cycle day (1–28+)
  • Phase (menstrual / follicular / ovulatory / luteal)
  • High-risk-window flag (perimenstrual + ovulatory days)

From WeatherKit (5)

  • Current barometric pressure
  • 24-hour pressure change
  • Pressure-drop event flag (drop > 5 hPa)
  • Local-history pressure z-score (12-month window)
  • Humidity

Temporal & history (4)

  • Day of week
  • Days since last attack
  • Attacks in past 7 days
  • Attacks in past 30 days

Optional self-report (4)

  • Daily perceived stress · 1 tap, 5 buttons, end-of-day reflection. Anchored on the 0–10 Likert used in the HAPRED-I research diary.
  • Recent attack flag (within 36 h) — derived from your own attack log; one of only two predictors in the published HAPRED-I model.

Stress is opt-in. The picker shows five words — Calm · Mild · Moderate · High · Severe — never a number. Skipping a day leaves the feature missing, never a fabricated "low stress" reading.

Two layers, both on your device.

1

Cohort model

Trained on a consented beta cohort (50 participants, 90 days of HealthKit + diary data). The base learner is XGBoost — gradient-boosted decision trees, chosen because they natively handle missing values (which every multi-source health signal eventually has) and convert one-step into a Core ML .mlpackage. The model ships with the app and updates over-the-air, but never sees your data.

Why not a transformer? Tabular gradient boosting still beats deep learning on small-N tabular data, per Shwartz-Ziv & Armon (2022). We re-validated on our pilot data; XGBoost won.

2

Per-user personalisation

The cohort model is one size for everyone. To adapt to you, Hermly stacks a small logistic regression head on top, trained on-device from your own labelled days. The published HAPRED-II 2026 trial showed this style of continuous Bayesian update lifts AUC from 0.59 in the first two weeks to 0.66 after a month — meaningful, even if the ceiling stays modest.

Read the trial: HAPRED-II, Houle et al., medRxiv 2026.

3

Conformal interval

The point estimate ("73%") is the mid-point of a wider honest interval. Hermly wraps the model output in a split conformal prediction band so the UI can communicate uncertainty when it's high (early days, sparse signals). When the band is wide, we say so; when it's tight, we trust the number.

Method: Angelopoulos & Bates, "A Gentle Introduction to Conformal Prediction" (2021).

We won't quote 95%. The literature can't either.

Migraine prediction is hard because it is inherently noisy. Decades of self-report data show the realistic AUC ceiling for published models lands in the 0.60–0.70 band. Hermly's targets are anchored to those numbers, not to marketing claims.

0.50 (chance) 0.70 1.00 (perfect)

Holsteen 2020 · multi-trigger self-report (n=178)
0.56
HAPRED-I external · stress + current state (n=230)
0.59
HAPRED-II personalised · after 30 days (n=230)
0.66
Stubberud 2023 · wearable + diary, ML hold-out (n=18)
0.62
Hermly target · personalised after 30 days
~0.65
Hermly stretch · personalised after 90 days
~0.75

AUC = area under the ROC curve. 0.50 is no better than chance; 1.00 is perfect. Numbers above 0.85 in published mobile-app migraine literature usually involve sample-size or label-leakage issues — see the HAPRED-II discussion for a careful read.

Privacy isn't a policy. It's the architecture.

The cohort model trained from research-cohort data ships with the app. Your phone runs both the cohort inference and the personal head locally. Your sleep, HRV, cycle, attack records, pain logs, and personalised weights stay on your iPhone. Our servers never see them — they couldn't, even if subpoenaed.

What our servers do see: subscription state (free / trial / Pro), keyed by your anonymous Apple transaction ID; and anonymous event counters (e.g., "onboarding completed today") that contain no health values. Detailed list: our privacy promise.

06:51
Today
Tue · May 13
62
ELEVATED
Higher risk window. Pressure forecast to drop in 4h.
Sleep last night was 5.4h, below your baseline.
Pressure
−8 hPa today
Cycle
Day 26
Sleep
5.4 h

Six papers that shaped the model.

Hermly is engineering, not original research — we read what the field has published and built the most honest implementation we could of those ideas.

Houle et al. — HAPRED-II: Individualised Forecasting of Headache Attack Risk medRxiv 2026 · n=230 · 8-week prospective

External validation of a 2-feature parsimonious migraine forecaster. Cohort baseline AUC 0.59; per-user Bayesian updating lifts that to 0.66 after a month. The discussion section is a model of honest reporting.

What Hermly borrows: the Bayesian-update architecture (V1 personalisation), realistic AUC targets, base-rate-drift monitoring after launch, and the safety monitoring concept.

Houle et al. — HAPRED-I: Forecasting Individual Headache Attacks Using Perceived Stress Headache 2017 · n=95 · the original

The two-feature baseline: today's stress (Daily Stress Inventory) plus current headache state. AUC 0.65 on leave-one-out validation. Showed that adding more self-report predictors did not improve fit.

What Hermly borrows: the hadHeadacheLast36h "current state" predictor — one of only two features needed for a useful forecast — and the discipline to keep self-report scales tiny.

Lateef et al. — Sleep, Mood, Energy, and Stress as Headache Predictors Neurology 2024 · n=477 · 4×/day EMA

Decomposed each daily signal into person-mean and within-person Δ-from-mean. Showed both carried independent predictive signal. Energy had opposite-signed effects on morning vs. afternoon attacks — single-window models lose this.

What Hermly borrows: the within-person decomposition — every baseline-paired feature emits both a ratio and a delta — plus the v2 plan for separate AM/PM prediction heads.

Empatica/Gottesman — Smartwatch Autonomic Signals + Migraine 2025 · n=10

Best individualised AUROC 0.68 for next-day migraine. None of the five chronic-migraine participants had above-random performance — only the five episodic ones did.

What Hermly borrows: the chronic-frequency gate. When you're in chronic territory (≥15 attacks in 30 days), Hermly says so honestly instead of pretending to predict.

Three things we won't pretend to be.

Not a diagnostic tool

Hermly is a wellness app, not FDA-cleared. It does not diagnose migraine, classify subtype, or detect comorbidities. The Doctor Report is structured data for your conversation with a clinician — never a substitute for one.

Not a treatment recommender

The app does not tell you when to take medication. Even on a high-risk forecast, you'll see facts ("Pressure dropping", "Sleep below your baseline"), not instructions. Acute and preventive medication choices belong with you and your doctor.

Not always right

At ~0.66 AUC, the model is meaningfully better than chance and meaningfully worse than perfect. Some high-risk days pass without an attack. Some quiet days bring one. The UI tries to communicate this honestly — including when the prediction shouldn't be trusted at all.

FAQ.

Does it work without an Apple Watch?

Yes. The phone-only path uses sleep, cycle, and weather to drive predictions. Adding a Watch adds heart-rate variability, resting heart rate, and wrist temperature, which improve accuracy on most users — but the app still works without one.

What if I forget to log attacks?

The personalisation layer needs your labels to learn. Skipped attacks aren't fatal — the cohort model still runs — but accuracy levels off rather than improving. The Apple Watch and Live Activity flows are designed so logging takes one tap, even mid-attack.

How long until predictions get useful?

Day one for the cohort baseline (drawn from research-cohort data). The per-user layer noticeably improves after about 14 days and stabilises around day 30, mirroring the HAPRED-II 2026 trajectory.

Why isn't it 95% accurate?

Because nothing in the published literature is. Migraine attacks emerge from interacting biological systems with substantial randomness. The realistic personalised AUC ceiling for a 24-hour forecast lands around 0.66–0.70 in every prospective study to date. We'd rather be honest about that than oversell.

Can I see what features the model is using?

Yes. Today shows the three biggest contributors below the risk number, with their direction and value. The Doctor Report exports a richer breakdown. The full feature schema will be published alongside the open-source release.

Is the prediction model audited or peer-reviewed?

Not yet. The cohort model is being trained on a 50-person prospective beta (recruitment open). After launch we plan external validation comparable to the HAPRED-II protocol. Findings will be published whether they support the product or not.

Predictions on your phone.
Data that stays there too.

Hermly is in private beta. Leave your email for an invitation when the cohort opens further.