On the morning of March 31st, my Apple Watch Ultra notified me that I may have sleep apnea and should speak with a doctor. It wasn't the first time. The notification is generated by watchOS's FDA-cleared sleep apnea detection feature, which uses the Watch's accelerometer to identify breathing irregularities during sleep. Apple earned that clearance. The feature works.
What the notification didn't tell me was what the preceding three nights looked like in the data the Watch had already collected. My blood oxygen averaged 91.4%, 92.0%, and 92.1% across March 29, 30, and 31 the worst three-night cluster in 95 days of continuous tracking. My respiratory rate hit 28.2 breaths per minute on March 30, the second-highest reading in that entire dataset. My heart rate variability collapsed to 22 milliseconds, roughly half what it should be for someone my age. The Watch saw all of it. It connected enough dots to file a clinical-grade alarm. And then, almost certainly, it gave me a Sleep Score that didn't reflect any of it.
I have a formal diagnosis of Central Sleep Apnea. I wear both an Apple Watch Ultra and an Oura Ring. I've spent more than two years collecting data from both, and what that data shows isn't just a gap between two consumer devices. It's a contradiction sitting inside a single piece of hardware, between two systems that have reached opposite conclusions about the same person's health on the same night.
Apple built a Sleep Score optimized to make you feel good about your sleep. That's a design choice. For most people, it's probably fine. For anyone with a diagnosed sleep disorder, it can quietly work against the clinical system Apple built right alongside it.
Two Systems, One Device
Apple's sleep apnea detection feature arrived with watchOS 11, available on Apple Watch Series 9, Series 10, Ultra 2, and Ultra 3. It uses the accelerometer to detect wrist movements associated with breathing disturbances during sleep, a method Apple validated through clinical trials and submitted to the FDA for de novo clearance as a medical device feature. When the algorithm crosses its confidence threshold over a 30-day observation window, it surfaces a notification telling you to see a doctor.
That notification is serious. Apple didn't build it to engage you with the Health app. It built it because sleep apnea, left undetected, carries real cardiovascular risk. The FDA clearance exists precisely because the stakes justify regulatory oversight.
The Sleep Score is a different creature entirely. Introduced alongside the sleep apnea feature in watchOS 11, it distills a night of sleep into a single number between 0 and 100. Apple weights total sleep duration, efficiency, time in each stage, heart rate, and respiratory rate. The goal is clarity. Sleep is complicated, and a single score is easier to act on than a wall of metrics.
The problem is what that simplification costs when the sleeper has a diagnosed breathing disorder.
What Two Years of Data Actually Shows
Over 643 nights tracked by the Oura Ring between April 2024 and May 2026, and 95 consecutive nights tracked by Apple Watch from February through May 2026, a picture emerges that no press release walkthrough of either product's features would prepare you for.
The Oura data alone contains a story worth sitting with. My average sleep score across those 643 nights was 72.2. On 33% of all nights, the score fell below 70. On 69 nights it fell below 60. The lowest single score was 28, recorded on October 8, 2025, the kind of number that in any other context would prompt a follow-up conversation with a clinician.
But the number that tells the real story isn't the sleep score. It's the Breathing Disturbance Index: the per-hour count of respiratory irregularities Oura tracks throughout each night. Oura's own documentation flags a BDI above 20 as a potential indicator of sleep-disordered breathing. My average across the full two-year dataset was 20.2. That average is itself sitting on the threshold. The distribution underneath it is what matters.
The Escalation Nobody Scored
From April 2024 through April 2025, my BDI averaged 16.5 per night. Elevated for a healthy adult, unsurprising for someone with CSA, but relatively stable. Then something shifted.
From May through November 2025, my monthly average BDI never dropped below 28.8. For seven straight months, breathing disturbances averaged 30 per hour. In September 2025, a single night hit a BDI of 68. Twenty-one nights across that stretch exceeded 40. On 67% of the nights in that seven-month window, the BDI crossed Oura's own warning threshold of 20.
This was a documented clinical deterioration. Not a bad week. Seven months of worsening sleep-disordered breathing, visible in the data, accumulating night after night.
Then in December 2025, the BDI collapsed. From 31.2 in November to 13.2 in December to 8.7 in January 2026. No deliberate intervention that I can identify. Central apnea fluctuates with stress load, cardiovascular changes, sleep position, and factors that often don't announce themselves. Whatever drove the escalation apparently resolved on its own.
What Oura's sleep score did during all of this is instructive. During the escalation period, the average score was 72.6. During the recovery period, it was 74.0. A difference of 1.4 points across a clinical arc that saw BDI drop by more than 20. The score did not track the deterioration. It did not track the recovery. It produced essentially the same number throughout a two-year period in which my breathing during sleep went from manageable to severely disrupted and back again.
To be clear: Oura isn't completely blind to the problem. My worst nights during the escalation did tend to score lower. The correlation exists. It's just weak, a correlation coefficient of -0.033 between sleep score and BDI across the full dataset, meaning BDI barely moves the needle. On 30% of the 91 nights when my BDI exceeded 30, Oura still scored my sleep above 75. On 14 of those nights, above 80. Neither platform is giving a fully honest accounting. The difference is that Oura doesn't also have an FDA-cleared clinical alarm sitting in the same app, and it doesn't advertise that alarm while simultaneously smoothing over the signals that drive it.
The Night the Watch Finally Said Something
The March 31st notification didn't arrive in a vacuum. It arrived during one of the worst physiological stretches in 95 days of Apple Watch data. Three consecutive nights with SpO2 in the low 92s. Respiratory rate peaking at 28.2 breaths per minute, a number more consistent with moderate physical exertion than sleep. HRV floored at 22 milliseconds across all three nights. Every metric the Watch tracks pointed in the same direction for 72 consecutive hours.
The morning after the notification, my Oura ring scored that night a 47. Readiness: 56. Oura was unambiguous, something was wrong, the body hadn't recovered, the day should be adjusted. The Watch had fired its clinical alarm the morning before. And yet neither system has a mechanism to connect those events in a way visible to the user. The notification happened. The low score happened. The Sleep Score whatever it showed, sat beside both of them, doing its own calculation.
A score of 47 from Oura isn't a yellow flag. It's a system telling you plainly that last night was bad. Apple's Sleep Score value for the same window is something I can't confirm, because Apple doesn't write Sleep Score values back to HealthKit as queryable data. The score lives inside the Sleep app and doesn't persist in a format that allows historical analysis. That's a design choice worth naming: the metric Apple puts most prominently in front of users is the one it makes hardest to audit over time.
What the Sensor Gap Adds
Comparing Oura and Apple Watch SpO2 on the 34 nights where both devices recorded blood oxygen produces a consistent gap. Oura averaged 95.4% on those nights. Apple averaged 93.9%. A 1.5 percentage point systematic difference, in the same direction, every time.
Finger-based optical sensors like Oura's are generally considered more accurate than wrist-based PPG for blood oxygen measurement. The wrist has lower capillary density, and the Watch's sensor has to contend with movement artifacts and skin contact variability in ways a ring does not. Apple has acknowledged wrist PPG limitations in its own device documentation.
If Apple's sensor reads 1.5 points lower than a more accurate reference device, the SpO2 data feeding into the Sleep Score algorithm is already starting from a depressed baseline. The Watch isn't scoring against what your oxygen saturation actually was. It's scoring against what its wrist sensor estimated, and that estimate runs consistently low. For a healthy sleeper, a 1.5-point gap at the 96-97% range doesn't change much. At 93-94%, where I sit chronically, it matters.
The Design Philosophy Problem
None of this means Apple's health engineering is careless. The sleep apnea detection feature is genuinely impressive work, and FDA de novo clearance for a consumer wearable is not a trivial achievement. Apple has invested seriously in turning the Watch into a clinical instrument for specific, high-stakes conditions.
The problem is the layer sitting on top of that work. The Sleep Score isn't a clinical instrument. It's a consumer engagement feature, designed with consumer psychology in mind. Scores that consistently land in the 60s cause users to disengage. Scores in the 80s keep them opening the app. Apple has every structural incentive to weight the algorithm toward the high end, and the input that would most reliably drag it down for someone with CSA breathing disturbance data the Watch is already collecting is precisely the input that doesn't appear to move the needle.
That's the contradiction Apple hasn't resolved. They built a feature that says you may be seriously ill. They built it on the same hardware as a feature that says you slept great. Both outputs exist. Only one is designed to keep you engaged with the product. For someone without a prior diagnosis, that dynamic isn't neutral. They receive the notification, feel the appropriate alarm, schedule a sleep study. And then every morning while they wait for that appointment, their Watch hands them a reassuring number that quietly tells them most nights are actually fine.
The score doesn't intend to undermine the notification. It just does.
What Better Would Look Like
Across 643 nights of Oura data, the BDI moved dramatically. My condition visibly worsened for seven months, then recovered. The sleep score barely registered either event. A scoring system that genuinely incorporated breathing disturbance data would have tracked that arc. It would have been lower during the escalation. It would have recovered when the BDI recovered. It would have given me information instead of reassurance.
Apple already has the data. The Watch tracks respiratory rate every night. The sleep apnea detection feature processes movement patterns associated with breathing irregularities. The raw ingredients for a more honest score exist inside the device I already wear.
Several changes would close the gap meaningfully. Surfacing the breathing disturbance count as a visible nightly metric the way both platforms surface time in each sleep stage would be a start. Allowing Sleep Score to be queried historically through HealthKit would let users actually audit the relationship between their physiology and their scores over time. And when a user has received a sleep apnea notification within the previous 30 days, the Sleep Score algorithm should weight respiratory metrics differently. The Watch already knows the clinical context. It is choosing not to use it in the output users see most.
The Apple Watch Ultra I wear has sensors precise enough to detect a respiratory event, run it through an FDA-cleared algorithm, and decide whether to file a medical-grade alert. That capability is real. The score that appears alongside it should reflect the same seriousness. Until it does, the Sleep Score isn't just incomplete for someone with a diagnosed sleep disorder. It's working in direct opposition to the feature Apple is most proud of building.