Aug 4, 2025

Chain of Thought Research

Looking into Chain of Thought Prompting

This section will primarily be looking at the 2022 Paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models and its rammifications into the development of AI into what it is today, and if (looking at the modern capabilities) how useful and effective it is in the modern landscape.

Chain of thought prompting is encouraging the LLM to provide a series of intermediate reasoning steps, to break down a complex problem into a series of small steps (a ‘chain of thought’) before getting to the final solution. To encourage the LLM to engage in Chain of thought (CoT) prompting it is provided a few demonstrations of chain of thought as examples.

In 2022 this allowed even a weak LLM with poor reasoning skills to be on par/better with the best LLM at the time (Finetuned ChatGPT 3) with only 8 example prompts of CoT.

This use of Chain of Thought combined 2 different ideas: the use of natural language to generate logic that can help with mathematical reasoning (either by training from scratch or finetuning an already trained model), and a relatively new concept - in context learning via prompting, instead of finetuning which was the main method to get a LLM to work upon a task well beyond a general level. As both ideas had limitations, the combination of the two fixed the shortfalls of the other (as finetuning/ training from scratch is incredibly costly and complicated to create the necessary dataset to train the LLM to a level of appropriate skill, and the in context prompting was poor at tasks that required reasoning abilitites and did not get better with the scale of the LLM). They did this by providing a prompt consisting of 3 things:

Input
Chain of Thought
Output

Here is an example:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

This would provide the answer of:

A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.

Previously, without Chain of Thought prompting, this would’ve provided the answer as 27. So even this small change in the style of prompting allows it to work out word equations when they were previously unable to.

Why is CoT useful?

It has many properties that are useful, beyond the fact that it allows a LLM to mimick how a human being would approach a multistep problem, via breaking it down into small decomposable chunks that can be solved before the next step until the overall problem is solved, but it also provides:

An Window of insight into the behaviour of the model, allowing (if the answer is wrong) us to debug the path that a LLM took to the wrong answer
Allows LLMs to solve essentially any task that humans can solve via language
It can be quickly executed into any sufficiently large LLM with just a few examples of it being used.

Both images present shows that the addition of Chain of Thought Prompting allowed larger models to perform better (~100B parameters needed to show significant improvement), however it also at times worsened smaller models, as when they attempted to produce Chain of Thought reasoning, they produced illogical lines of thought, so produced more wrong answers on average.

The idea of Chain of Thought prompting revolutionised the capabilities of Language Models (particularly large ones), ushering in ‘Prompt Engineering’, which is an incredibly useful tool even today in 2025 for getting a LLM to work effectively and efficiently on any task presented to them. While not the only tool today due to the emergence of MCPs and tool/context engineering, as I wrote about on 29/7, prompt engineering is still a key principle for making AI work.

Since this paper, more work has been pushed into this method of AI reasoning, even allowing the moving on from prompt engineering, only changing the decoding process to allow us to spot the inherent reasoning that an LLM has. It also allows us to find differences (particularly in confidence) between Chain of Thought and Non Chain of Thought thinking paths. This is highlighted in the 2024 Paper (Appropriately) named Chain-of-Thought Reasoning Without Prompting.

Currently, most LLMs utilise a form of response picking via ‘Greedy decoding’, which is a strategy that selects the most probable token at each step, as it is computationally efficient and the simplest method of getting generally accurate results. However it does reach a kind of ‘local optimum’ based on the context it is provided, instead of considering the other potential sequences and the quality of the sequence that it chooses. The paper suggests changing to a more nuanced ‘top-k’ token approach, that considers the top k sequences, and then finds the highest quality/confidence sequence, as CoT paths are frequently found in the top k results, but not always the top due to the excessive use in Greedy decoding in training LLMs typically. This allows the reasoning process underlying an LLM to be utilised, instead of obscured by the bias of the training it recieved. This method of decoding is obviously called Chain of Thought Decoding.

It is shown to be even more useful than before, as Chain of Thought decoding improves both pre trained and finetuned models, however this will change the choice of k in the model, as a pre trained model will improve more as the k increases, whereas the instruction/finetuned model has CoT paths present quite high up in its decoding paths, so the incerease of k leads to no signficant added benefit.

This method also improved on Chain of Thought Prompting as it is more generalised than prompting, as it inherently uses the LLMs underlying reasoning skills instead of forcing the model to imitate the Chain of Thought wanted, making it less task specific, as said here:

Despite achieving high performance, few-shot prompting techniques are often task-specific, requiring prompt designs tailored to each task. This limits their generalizability across tasks. Advanced prompting techniques often require manually intensive prompt engineering, and their effectiveness varies depending on the choice of prompts, resulting in inconsistent performance outcomes.

Xuezhi Wang/Denny Zhou

However, as a result of these papers, we have moved on massively from this innovation. The Decreasing Value of Chain of Thought in Prompting is a key paper that covers what I will be explaining and expanding on, with some research into if it truly does matter if modern day LLMs need the prompt engineering in order to:

Engage in Chain of Thought
Correctly answer math and commonsense reasoning questions
Consistently answer correctly.

With the widespread modern use of some form of CoT reasoning in LLMs despite not being asked to, it has led to the reduction in overall improvement/impact in the use of CoT prompting.

It of course depends heavily on the model, with non reasoning models improving performance by a small amount, especially if it doesnt take part in step by step processing. But it can also introduce more variability and sometimes less confidence in the answers it provides, which can trigger the occasional error in questions it would previously be okay with. For the models that already take part in a form of CoT reasoning, it provided a negligible improvement in performance, but increasing the number of tokens that is required compared to direct answers that it would already provide, which increases the cost and time of using the LLM. For models that clearly use explicit reasoning capabilities, CoT prompting also again results in marginal improvements in answer accuracy, but again significantly increases the time and tokens needed to generate answers.

This leads me to believe that, while useful for older and more primitive models that Chain of Thought prompting is still a useful tool to boost the performance of the LLM, but for modern day LLMs it is potentially outweighed by the increased time and cost of utilising it.