top of page
Search

Please, For the Love of Compute, Stop Telling AI to "Think Step-by-Step"

Here at Factal, we have spent the better part of the last few years whispering sweet nothings into the ears of Large Language Models. We, like you, learned the magic incantations. We learned that if you want GPT-4 to solve a math problem or analyze a messy dataset without hallucinating a conspiracy theory, you have to use the Golden Prompt: "Let's think step by step."


It was the WD-40 of prompt engineering. It fixed everything. It turned "System 1" rapid-fire guessers into "System 2" deliberators.


But we have some bad news. If you are using the new class of "Reasoning Models" (like OpenAI’s o1 series or DeepSeek-R1) and you are still using that prompt… you are actively making your AI dumber.


We’re not just saying this to be contrarian. We’ve been diving deep into the research on Reasoning Interference, and it turns out that asking a Thinking Model to "show its work" is the computational equivalent of asking a master chess player to explain every synaptic firing in their brain while they are playing against Magnus Carlsen. They’re going to lose, and their explanation is going to be nonsense.  


The Paradox of the "Monitorability Tax"

To understand why our favorite prompt is now poison, we have to understand how these new models work. Old school LLMs (let’s call them System 1) were purely instinctual. They didn't have a "brain" to pause and reflect; they just predicted the next word based on vibes and probability. When we asked them to "think step by step," we were forcing them to print out intermediate steps to create a temporary working memory.

  

The new Reasoning Models (System 2) are different. They have an internal, hidden "scratchpad" where they generate thousands of "thinking tokens" before they ever spit out a single word to you. They explore, they backtrack, they hit dead ends, and they self-correct - all in the dark. This is where the Monitorability Tax comes in.


We humans have trust issues. We want to see the reasoning. We want the model to output a nice, clean, numbered list of steps so we can verify it. But research shows that forcing a Reasoning Model to translate its high-speed, complex internal thoughts into slow, clunky English while it is trying to solve the problem actually drains its cognitive budget.  

It’s a tax. You can have a model that solves the problem perfectly (but stays silent about how it did it), or you can have a model that gives you a mediocre explanation and gets the answer wrong. You cannot have both.



In fact, studies show that when you force these models to externalize their thinking, they suffer from "Calibration Breakage". By writing out a reason, the model accidentally convinces itself that the reason is true, even if it isn't, leading to confident hallucinations. It’s like talking yourself into a bad idea just because you said it out loud.


Noam Brown and the Secret Language of AI

This brings us to a fascinating point raised by OpenAI’s Noam Brown and echoed in the research on "Latent Reasoning".  

The fundamental issue is that English (or any human language) is a terrible medium for complex computation. It is "low bandwidth". It is ambiguous. It is full of fluff.  

When a model is "thinking" internally, it isn't necessarily thinking in words. It is operating in a Latent Space—a high-dimensional vector space where it can hold multiple conflicting ideas in "superposition" at the same time. It can manipulate abstract concepts that don't even have a direct English translation.  

When we command the model to "explain your reasoning step-by-step," we are forcing it to collapse that beautiful, complex 12,000-dimensional thought into a flat, clumsy English sentence. This is called Quantization Noise. The model has to "round off" its brilliance to fit into our limited vocabulary, and in doing so, it loses the nuance required to solve the problem.  

As Noam Brown suggests, if we really want to unlock the next level of intelligence, we need to let the models develop their own internal language—a "Neuralese"—that is optimized for reasoning, not for chatting with us.  

The "Double Penalty" Disaster

There is a scenario where this gets even worse. We call it the Double Penalty.  

This happens when you are trying to save money (classic startup move, we get it) so you set the model’s "thinking budget" to low, but you still ask for a "step-by-step" explanation.

This is the worst possible configuration. You have starved the model of the internal tokens it needs to actually solve the problem, and simultaneously burdened it with the task of writing a detailed essay about the solution it hasn't found yet. The result? Confabulation. The model just starts making things up that sound logical but are completely factually bankrupt. It’s the AI equivalent of a student who didn't study for the exam trying to fill three blue books with fluff to get partial credit.  

An overview of Gemini 2.5 Flash's performance with its own (implicit) reasoning, Chain-of-Thought (CoT) prompting and CoT prompting with a limited token budget
An overview of Gemini 2.5 Flash's performance with its own (implicit) reasoning, Chain-of-Thought (CoT) prompting and CoT prompting with a limited token budget

Let It Think in the Dark

So, what is the takeaway for us builders and users?

We need to learn to trust the black box a little more. The era of "Prompting for Reasoning" is dying; the era of "Allocating for Reasoning" is here.  

If you are using a model like o1 or DeepSeek-R1, delete the "step-by-step" line from your prompt. Stop micromanaging the AI. Let it run its internal reinforcement learning loops. Let it think in high-dimensional vectors that we can't comprehend.

At Factal, we’re redesigning our backend to respect this shift. We’re stripping away the "show your work" constraints for the hard stuff and judging the models solely by the light of their results.  

It feels weird. It feels like giving up control. But if we want the right answers, we have to let the machine think in the dark.

Now, we know exactly what you’re thinking. "If the AI is developing its own secret language that I can’t read, isn’t that… kind of terrifying?"

Yes. Yes, it is. The idea that we are effectively trading "monitorability" for "capability" is a massive shift, and frankly, the fact that these models might soon be solving our problems using a "Neuralese" that we are physically incapable of understanding is a topic that deserves its own panic attack. We are going to dive deep into the safety implications (and existential dread) of this Black Box future in a coming blog post. But for today? Just delete the "step-by-step" prompt and enjoy the accuracy.  

 
 
 

Comments


bottom of page