
TL;DR
11 months of chronic GPU TDR FAILED (nvlddmkm Event ID 153 + ID 14) on a brand-new RTX 5070 Ti Inno3D + i9-14900K + MSI MS-7E07 / Win 11 25H2 Build 26200. Identical signature across 17+ incidents : video signal lost on all displays, GPU temp instantly drops to 0°C in HWiNFO (sensor lost), PC stays on, no BSOD, no reboot. Recovery Count = Fatal Error Count = Lane Errors = 0 in every log. 12V rail stable (≥11.91V). Pattern persisted through every change attempted. Most recent crash : 08/05/2026 19:01 on Path of Exile 2 in 75 min, 17 hours after a 2h-stable session — config was Intel Default Settings (PL1=125W, just confirmed by Linpack PASS).
Full technical dossier (Gist) : https://gist.github.com/jrm08/0d95c01151906beeb4041e88b424fc1c
System
| Component | Spec |
|---|---|
| CPU | Intel i9-14900K (microcode 0x12B confirmed) |
| Motherboard | MSI MS-7E07 (Z790) — BIOS M.70 |
| GPU | Inno3D RTX 5070 Ti (Blackwell, 16 GB GDDR7) |
| RAM | 64 GB DDR5 — Corsair CMP64GX5M2X6600C32 (rated 6800 CL32) — running at 4788 MHz, XMP not applied |
| PSU | Corsair HX1000i 1000W ATX 3.1 (replaced from RM850x — same crash) |
| Display | LG GSM9E9D (native 3440×1440 @ 100 Hz) — running 5120×2160 @ 165 Hz HDR DSC active via DisplayPort |
| Storage | Samsung NVMe (PCIe 3.0) primary, Crucial NVMe secondary |
| OS | Win 11 Pro Build 26200.7840 (25H2 Insider Dev) — WDDM 3.2 |
| Driver | NVIDIA 591.86 (also crashed on 576.28) |
Symptom (uniform across 17+ incidents)
- Gaming session, GPU stable under load (76-83°C, 220-280W, 100% utilization)
- Sudden freeze of all HWiNFO sensors (last value held)
- GPU Temperature drops to 0°C (sensor disconnect)
- GPU Power stays frozen at last value (no electrical cut)
- Video signal lost on all displays simultaneously
- OS stays alive (sound sometimes continues, ping responsive)
- Event Viewer :
nvlddmkmID 153 (Resetting TDR occurred on GPUID:100) + ID 14 (CMDre 00000002 00000200 00001000 00000005 0000002d) - No BSOD. No automatic reboot. Manual reset required.
Zero precursor in any log. No thermal spike, no voltage drop, no PCIe error (Fatal Error Count = 0, Lane Errors = 0 confirmed across all sessions including the one that crashed).
What's been ruled out (with evidence)
| Hypothesis | Status | Evidence |
|---|---|---|
| PSU undersized (RM850x 850W) | REFUTED 14/03 | Replaced with HX1000i 1000W ATX 3.1, fresh native 12V-2×6 cable. Crash within 10 min same day. |
| MSI Afterburner / RTSS hooks | REFUTED 06/04 | Uninstalled both. Crash on Last Epoch 12m52s. |
| PCIe ASPM transitions (Recovery Count) | REFUTED 07/04 | BIOS Native ASPM = Disabled, Root Port 5 L1 = Disabled, Win Power PCIe = off. Crash 2m37s. RC delta = 0 confirmed. |
| CPU Vmin shift permanent (silicon degradation) | REFUTED 07/05 | Applied Intel Default Settings (PL1=125W, PL2=188W, IccMax=307A). OCCT Linpack 30 min = 0 errors vs 10 errors / 20 min at 253W on 10/03. CPU is not silicon-degraded. |
| CPU overpower → AVX errors → GPU pipeline corruption | REFUTED 08/05 | INC-004 — TDR PoE2 crash at 75 min with PL1=125W active (verified in-log). Linpack PASS ≠ gaming stability. |
What's also been applied without effect
- DDU + clean driver install (multiple times) — produces ~2 weeks of stability then relapses
- HAGS (Hardware GPU Scheduling) off since 12/04
- MPO (Multi-Plane Overlay) off via OverlayTestMode = 5 since 12/04
- TdrDelay = 10s (was 60s) since 12/04
- PCIe forced Gen3 in BIOS (was Gen5) — improves stability but doesn't fix
- Tested on two driver versions (576.28 and 591.86) — both crash
- Microcode 0x12B (Intel Vmin shift patch) confirmed in BIOS
- Hyper-V tested both enabled and disabled at some point (current state not formally documented — flagged as gap)
Current top hypotheses (after agent-based re-analysis 08/05)
- Driver nvlddmkm + interaction with Blackwell firmware — 64% confidence. But challenger argument : TDR via nvlddmkm is tautological (every NVIDIA TDR logs there). 2 driver major versions both crash → probable upstream cause.
- Win 11 25H2 / WDDM 3.2 regression — 60%. But : first crash is from May 2025, Build 26200.7840 = February 2026. Timeline incompatible unless a different OS regression caused the early crashes.
- Inno3D VBIOS borderline — 48%. Untested via PL 70% reduction.
Two emerging hypotheses
- DX12 pipeline workload specific to PoE2 + Blackwell driver (mesh shaders / bindless / heavy command lists)
- GPU PMU bug on Blackwell (P-state transitions under heterogeneous load)
The question I'm now asking myself (challenger insight)
What hasn't changed across all 17+ crashes in 11 months ?
Variables that have stayed constant since May 2025 :
- The LG monitor + DP cable
- The forced 5120×2160 @ 165 Hz HDR with DSC compression (way above DP 1.4 native bandwidth — DSC is doing heavy lifting)
- BIOS M.70 (never compared against newer MSI releases)
- DDR5 running at 4788 MHz with XMP not active (not a JEDEC-validated config for this kit)
- Hyper-V state (whatever it is, presumably hasn't oscillated 17 times)
- Intel ME firmware (never inspected)
When everything I've changed leaves the symptom unchanged, the cause is probably in the list above.
Specific asks :
- Has anyone running RTX 5070 Ti or 5080/5090 on Win 11 25H2 Insider Dev seen TDR persist after: DDU + driver clean + HAGS off + MPO off + Intel Default CPU + PCIe Gen3 + ASPM off + microcode 0x12B ?
- Has anyone tested 5K2K @ 165 Hz HDR with DSC vs 3440×1440 @ 100 Hz native on Blackwell driver and seen a stability difference ?
- MSI MS-7E07 owners running BIOS post-M.70, do the changelogs mention RTX 5000 series PCIe / DSC / power-state fixes ?
- DDR5 6800 Corsair kits on Z790 — has anyone seen the system silently fall back to ~4788 MHz instead of activating XMP, and what was the fix ?
- PoE2 + Blackwell community : are crashes more frequent on PoE2 specifically vs Last Epoch / other DX12 titles ?
- WinDbg analysts : I have 4 unanalyzed minidumps from May 2025 to Jan 2026. Anyone willing to look at the
!analyze -voutput and tell me if the stack frame is constant (= upstream cause) or varies (= driver-specific bug) ? (Linked in the full Gist.)
Full technical dossier (timeline, all 17+ incidents, all logs, hardware monitoring CSV references, full agent-driven analysis) is hosted as a public GitHub Gist : https://gist.github.com/jrm08/0d95c01151906beeb4041e88b424fc1c
Thanks to anyone who reads through this. Even rejected hypotheses help — they narrow the search. Especially interested in counter-cases : RTX 5070 Ti owners on similar config who DON'T crash.