January 2026

When a Pattern Becomes Proof

I worked in a research lab as an undergrad—my brother and I, a small team, a lot of time spent staring at spreadsheets. What I remember most is a feeling you'd get sometimes when looking at the raw data. A hunch. Something in the numbers that seemed to be saying something, even before you ran the statistics.

I'd look at the numbers and think I knew how it would turn out. I'd see a pattern, sense a relationship. Then we'd run the statistics—and often, the math didn't agree with my gut. The correlations I was sure existed weren't there. The effects I dismissed as noise turned out to be real.

That's the thing about intuition: it's generative, not conclusive. A hunch can tell you where to look. It can't tell you what you'll find.

This is the evidence layer of the same loop you see everywhere in DeltaN1: decide, test, adapt.

• • •

The promise of modern AI is essentially this: what if you had a brilliant researcher available around the clock, one who had read everything and could see patterns across vast amounts of data? That's legitimately exciting. And in many ways, large language models deliver on it. They're remarkable pattern matchers. They can synthesize information across domains, generate hypotheses, explain complex relationships in plain language.

But here's the thing. When you ask an AI health tool why your energy crashed this week, and it tells you it "might be related to the alcohol you logged Wednesday or your reduced sleep quality," that's the AI equivalent of a researcher's hunch. It sees a pattern. It's making an educated guess. It's not wrong to say it—the connection might be real. But "might be related" isn't proof. It's a hypothesis dressed up as an answer.

This isn't a flaw in how these systems are built. It's a fundamental characteristic of what they are. Language models are trained to predict plausible text, not to compute statistical significance. They can tell you that something seems related. They cannot tell you whether that relationship would hold up if you ran the numbers properly.

• • •

There's another problem, and it's one that most people using AI health tools don't know about.

Researchers at Stanford and Berkeley studied what happens when you give language models lots of information to work with—the exact scenario you'd want for a health AI that's supposed to know your whole history. They found something troubling: performance doesn't improve linearly as you add more context. It degrades. Specifically, models show a U-shaped accuracy curve. They're good at remembering information at the beginning of what you give them. They're good at the end. But information in the middle? They lose track of it.

The accuracy drop isn't subtle. In some cases, models performed twenty to thirty percent worse when the relevant information was buried in the middle of the context. In the worst cases, they actually performed worse than if you'd given them no documents at all. The researchers called it "lost in the middle," and it's become one of the most cited findings in AI research over the past two years.

Think about what that means for health applications. You want the system to remember that three weeks ago you started taking magnesium, and two weeks ago your sleep improved, and last week you stopped taking it, and this week your sleep got worse again. That's exactly the kind of longitudinal pattern that would get lost in the middle of a context window. The AI might see the beginning of your health story and the end, but miss the crucial middle chapter where the actual cause and effect played out.

• • •

I think about this a lot, because it mirrors a tension I saw constantly in research: the difference between having a feeling about the data and actually proving something.

Trisha Greenhalgh, a physician and researcher, wrote about this in a paper on clinical intuition. She described intuition as "rapid, subtle, contextual"—characteristics that make it valuable for generating hypotheses but unreliable for confirming them. Her argument wasn't that intuition is bad. It's that the best practitioners "generate and follow clinical hunches as well as—not instead of—applying the deductive principles of evidence-based medicine."

As well as. Not instead of. That's the key phrase.

A brilliant doctor might have a hunch that your symptoms are related to something in your lifestyle. But they don't publish that hunch in a medical journal. They run tests. They look at the statistics. They determine whether the relationship holds up under scrutiny. The hunch gets them started. The proof is what matters.

• • •

So we built both.

Behind the conversational interface—behind the part that talks to you in plain language and feels like a knowledgeable friend—there's a layer of systems that do something fundamentally different. They don't generate plausible text. They compute actual statistics.

When the system tells you that coffee after 2pm reduces your sleep quality by 23%, that number didn't come from a language model making a reasonable-sounding guess. It came from a statistical calculation that looked at every day you logged coffee, compared your sleep on those nights versus other nights, controlled for other variables, and computed the effect size with a confidence interval. The language model's job is to translate that finding into something you can understand and act on. The math happened elsewhere.

The same is true for the patterns it finds over time. When it notices that your energy crashes two days after a bad night of sleep—not one day, but two—that's not intuition. That's a system that tested multiple time lags, found that the two-day lag had the strongest statistical relationship, and flagged it as significant. A language model might guess at causation. This actually measures it.

There's a system that detects when your baseline has shifted—when the magnesium that used to help your sleep stops working, or when your resting heart rate drifts upward over weeks in a way that's too gradual to notice day by day. There's a system that identifies which single change in your routine has the biggest downstream effects on everything else—what we call a keystone behavior—by tracing cause-and-effect chains through your data. There's a system that knows how confident to be in each finding, and that confidence isn't a feeling. It's a calculated probability based on sample size and effect strength and how many other hypotheses were tested.

None of this is visible to you when you're chatting with the app. You just see advice that feels unusually specific. But the specificity isn't coming from an AI that's good at sounding confident. It's coming from math that was done before the conversation started.

• • •

Here's what this looks like in practice.

You wake up feeling terrible. Your energy is low, your focus is shot, you're not sure why. You ask a typical AI health tool what's going on. It looks at your recent data and says: "Your HRV dropped 15% this week. Based on your connected data, this might be related to the alcohol you logged Wednesday or your reduced sleep quality. Consider prioritizing rest tonight."

That's a reasonable response. It's not wrong. But it's hedged for a reason—the AI doesn't actually know which factor caused the drop, or whether either of them did. It's pattern matching, offering possibilities, covering its bases with words like "might" and "consider."

Now here's what it looks like when statistical systems have already done the work: "Your HRV dropped 15%. After tracking your patterns for three weeks, I can tell you: for your body, alcohol after 6pm reduces next-day HRV by 18%, plus or minus 4%, with 90% confidence. Wednesday's drinks explain Thursday's crash. Want to test different cutoff times to find what works for you?"

Same data. Same language model translating the output. But one response is a hypothesis and the other is a finding. One says "this might be the cause." The other says "this is the cause, here's the effect size, here's how confident we are, and here's what we could test next."

The difference isn't cosmetic. It's the difference between a hunch and proof.

• • •

I want to be clear about something: the AI component of this system isn't a limitation we're working around. It's genuinely valuable. Language models are remarkable at taking dry statistical findings and translating them into something that feels personal and actionable. They're good at context, at tone, at knowing when to push and when to ease off. They're good at being a coach.

What they're not good at—what they're not designed to be good at—is mathematical certainty. And in health, certainty matters. Not false certainty, not overconfident claims, but actual statistical confidence about what's working and what isn't.

Greenhalgh was right: the best approach uses both. Intuition to generate hypotheses, to make connections, to feel like a knowledgeable partner in the conversation. And rigorous analysis to separate the patterns that are real from the ones that just seemed plausible.

The AI gives you a brilliant researcher who's always available. The statistical systems give that researcher a lab.

• • •

No one would publish a medical paper based on a gut feeling, no matter how experienced the researcher. The Journal of Medicine doesn't accept submissions that say "it seemed like this was probably related to that." You need the statistics. You need the confidence intervals. You need to show your work.

We think the same standard should apply to telling you what works for your health. Not because we don't trust AI—we use it constantly, it's core to what we do. But because when you're making decisions about your body and your life, you deserve more than plausible-sounding guesses.

You deserve proof.

Sources

Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" — Stanford, Berkeley, Samaya AI (2024)
Greenhalgh, "Intuition and evidence—uneasy bedfellows?" — British Journal of General Practice (2002)
Databricks, "Long Context RAG Performance of LLMs" — Performance degradation analysis (2024)

Also read: Your Journey with ΔN1 — see what this looks like over 3 months. Or How the Plan Actually Works — the difference between advice and a plan.

Join the waitlist and be first to try it.

Join the Waitlist