They can plan multi-step actions, generate creative solutions, and assist in complex decision-making. Yet these strengths fade when tasks stretch over long, dependent sequences. Even small per-step error rates compound quickly, turning an impressive short-term performance into complete long-term failure.
That fragility poses a fundamental obstacle for real-world systems. Most large-scale human and organizational processes – from manufacturing and logistics to finance, healthcare, and governance – depend on millions of actions executed precisely and in order. A single mistake can cascade through an entire pipeline. For AI to become a reliable participant in such processes, it must do more than reason well. It must maintain flawless execution over time, sustaining accuracy across millions of interdependent steps.
Apple’s recent study, The Illusion of Thinking, captured this challenge vividly. Researchers tested advanced reasoning models such as Claude 3.7 Thinking and DeepSeek-R1 on structured puzzles like Towers of Hanoi, where each additional disk doubles the number of required moves. The results revealed a sharp reliability cliff: models performed perfectly on simple problems but failed completely once the task crossed about eight disks, even when token budgets were sufficient. In short, more “thinking” led to less consistent reasoning.



