TL;DR

11 months of chronic GPU TDR FAILED (nvlddmkm Event ID 153 + ID 14) on a brand-new RTX 5070 Ti Inno3D + i9-14900K + MSI MS-7E07 / Win 11 25H2 Build 26200. Identical signature across 17+ incidents : video signal lost on all displays, GPU temp instantly drops to 0°C in HWiNFO (sensor lost), PC stays on, no BSOD, no reboot. Recovery Count = Fatal Error Count = Lane Errors = 0 in every log. 12V rail stable (≥11.91V). Pattern persisted through every change attempted. Most recent crash : 08/05/2026 19:01 on Path of Exile 2 in 75 min, 17 hours after a 2h-stable session — config was Intel Default Settings (PL1=125W, just confirmed by Linpack PASS).

Full technical dossier (Gist) : https://gist.github.com/jrm08/0d95c01151906beeb4041e88b424fc1c

System

Component Spec
CPU Intel i9-14900K (microcode 0x12B confirmed)
Motherboard MSI MS-7E07 (Z790) — BIOS M.70
GPU Inno3D RTX 5070 Ti (Blackwell, 16 GB GDDR7)
RAM 64 GB DDR5 — Corsair CMP64GX5M2X6600C32 (rated 6800 CL32) — running at 4788 MHz, XMP not applied
PSU Corsair HX1000i 1000W ATX 3.1 (replaced from RM850x — same crash)
Display LG GSM9E9D (native 3440×1440 @ 100 Hz) — running 5120×2160 @ 165 Hz HDR DSC active via DisplayPort
Storage Samsung NVMe (PCIe 3.0) primary, Crucial NVMe secondary
OS Win 11 Pro Build 26200.7840 (25H2 Insider Dev) — WDDM 3.2
Driver NVIDIA 591.86 (also crashed on 576.28)

Symptom (uniform across 17+ incidents)

  1. Gaming session, GPU stable under load (76-83°C, 220-280W, 100% utilization)
  2. Sudden freeze of all HWiNFO sensors (last value held)
  3. GPU Temperature drops to 0°C (sensor disconnect)
  4. GPU Power stays frozen at last value (no electrical cut)
  5. Video signal lost on all displays simultaneously
  6. OS stays alive (sound sometimes continues, ping responsive)
  7. Event Viewer : nvlddmkm ID 153 (Resetting TDR occurred on GPUID:100) + ID 14 (CMDre 00000002 00000200 00001000 00000005 0000002d)
  8. No BSOD. No automatic reboot. Manual reset required.

Zero precursor in any log. No thermal spike, no voltage drop, no PCIe error (Fatal Error Count = 0, Lane Errors = 0 confirmed across all sessions including the one that crashed).

What's been ruled out (with evidence)

Hypothesis Status Evidence
PSU undersized (RM850x 850W) REFUTED 14/03 Replaced with HX1000i 1000W ATX 3.1, fresh native 12V-2×6 cable. Crash within 10 min same day.
MSI Afterburner / RTSS hooks REFUTED 06/04 Uninstalled both. Crash on Last Epoch 12m52s.
PCIe ASPM transitions (Recovery Count) REFUTED 07/04 BIOS Native ASPM = Disabled, Root Port 5 L1 = Disabled, Win Power PCIe = off. Crash 2m37s. RC delta = 0 confirmed.
CPU Vmin shift permanent (silicon degradation) REFUTED 07/05 Applied Intel Default Settings (PL1=125W, PL2=188W, IccMax=307A). OCCT Linpack 30 min = 0 errors vs 10 errors / 20 min at 253W on 10/03. CPU is not silicon-degraded.
CPU overpower → AVX errors → GPU pipeline corruption REFUTED 08/05 INC-004 — TDR PoE2 crash at 75 min with PL1=125W active (verified in-log). Linpack PASS ≠ gaming stability.

What's also been applied without effect

  • DDU + clean driver install (multiple times) — produces ~2 weeks of stability then relapses
  • HAGS (Hardware GPU Scheduling) off since 12/04
  • MPO (Multi-Plane Overlay) off via OverlayTestMode = 5 since 12/04
  • TdrDelay = 10s (was 60s) since 12/04
  • PCIe forced Gen3 in BIOS (was Gen5) — improves stability but doesn't fix
  • Tested on two driver versions (576.28 and 591.86) — both crash
  • Microcode 0x12B (Intel Vmin shift patch) confirmed in BIOS
  • Hyper-V tested both enabled and disabled at some point (current state not formally documented — flagged as gap)

Current top hypotheses (after agent-based re-analysis 08/05)

  1. Driver nvlddmkm + interaction with Blackwell firmware — 64% confidence. But challenger argument : TDR via nvlddmkm is tautological (every NVIDIA TDR logs there). 2 driver major versions both crash → probable upstream cause.
  2. Win 11 25H2 / WDDM 3.2 regression — 60%. But : first crash is from May 2025, Build 26200.7840 = February 2026. Timeline incompatible unless a different OS regression caused the early crashes.
  3. Inno3D VBIOS borderline — 48%. Untested via PL 70% reduction.

Two emerging hypotheses

  • DX12 pipeline workload specific to PoE2 + Blackwell driver (mesh shaders / bindless / heavy command lists)
  • GPU PMU bug on Blackwell (P-state transitions under heterogeneous load)

The question I'm now asking myself (challenger insight)

What hasn't changed across all 17+ crashes in 11 months ?

Variables that have stayed constant since May 2025 :

  • The LG monitor + DP cable
  • The forced 5120×2160 @ 165 Hz HDR with DSC compression (way above DP 1.4 native bandwidth — DSC is doing heavy lifting)
  • BIOS M.70 (never compared against newer MSI releases)
  • DDR5 running at 4788 MHz with XMP not active (not a JEDEC-validated config for this kit)
  • Hyper-V state (whatever it is, presumably hasn't oscillated 17 times)
  • Intel ME firmware (never inspected)

When everything I've changed leaves the symptom unchanged, the cause is probably in the list above.

Specific asks :

  1. Has anyone running RTX 5070 Ti or 5080/5090 on Win 11 25H2 Insider Dev seen TDR persist after: DDU + driver clean + HAGS off + MPO off + Intel Default CPU + PCIe Gen3 + ASPM off + microcode 0x12B ?
  2. Has anyone tested 5K2K @ 165 Hz HDR with DSC vs 3440×1440 @ 100 Hz native on Blackwell driver and seen a stability difference ?
  3. MSI MS-7E07 owners running BIOS post-M.70, do the changelogs mention RTX 5000 series PCIe / DSC / power-state fixes ?
  4. DDR5 6800 Corsair kits on Z790 — has anyone seen the system silently fall back to ~4788 MHz instead of activating XMP, and what was the fix ?
  5. PoE2 + Blackwell community : are crashes more frequent on PoE2 specifically vs Last Epoch / other DX12 titles ?
  6. WinDbg analysts : I have 4 unanalyzed minidumps from May 2025 to Jan 2026. Anyone willing to look at the !analyze -v output and tell me if the stack frame is constant (= upstream cause) or varies (= driver-specific bug) ? (Linked in the full Gist.)

Full technical dossier (timeline, all 17+ incidents, all logs, hardware monitoring CSV references, full agent-driven analysis) is hosted as a public GitHub Gist : https://gist.github.com/jrm08/0d95c01151906beeb4041e88b424fc1c

Thanks to anyone who reads through this. Even rejected hypotheses help — they narrow the search. Especially interested in counter-cases : RTX 5070 Ti owners on similar config who DON'T crash.