Aug 5, 2025

Improvements to Chain of Thoughts

After yesterday looking into Chain of Thought prompting and reasoning, I found more developments and changes that allow it to work better and more efficiently with more general and vague tasks, that require exploration and strategy to approach them correctly. The Tree of Thoughts is a good way to change the framework to allow these more general/complex tasks to be completed.

What is the Tree Of Thoughts?

Instead of following a single linear chain of reasoning, Tree of Thoughts allows the model to explore multiple reasoning paths simultaneously, much like how humans consider various approaches when solving difficult problems. The “tree” structure represents different branches of thought that can be explored, evaluated, and either pursued further or abandoned. This overcomes the normal ‘left to right/token level’ pattern that models originally used to take to make decisions during inference.

These multiple reasoning paths (called ‘thoughts’) are the LLMs way to mimic human reasoning patterns (multiple ways to go about solving complex problems), while also allowing the ability to look ahead at what each thought produces, and also the ability to backtrack if a thought ends up being a poor choice.

This shift to allow for multiple strands of thinking also is solidifying how we mimic human cognition with LLMs, to allow them to engage in more complex tasks using a different style of thinking compared to the system 1 style token based, left to right style thinking. The Tree of Thoughts further allows complex thinking, known as system 2 thinking in human cognition research.

Despite it being a step onwards from the initial working Language models, and an improvement from the 2022 changes to allow more complex reasoning tasks, its origin and idea was first considered in the 1950s by architects Newell, Shawn and Simon, who characterized problem solving as searching through a combinatorial problem space, represented as a tree.

Each thought specifically is a ‘coherent language sequence that serves as an intermediate step towards problem solving’. As LMs are inherently based in language, this allows the steps to be self evaluated towards finding a solution, so it can be reasoned which one would provide the ‘best’ next intermediate step until the problem is solved.

This is entirely different to how LMs work normally, as they are normally trained or programmed, but as shown yesterday, this form of reasoning is very much capable by LLMs as the training allows them to have an inherent ability to reason problems given the right tools/abilities.

A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution is finally discovered. —— Newell et al.

This simple change to include a breadth of thoughts instead of 1 or a chain of them increased its ability to solve maths problems (in their case a Game of 24), increasing the solve rate of 4% using normal Chain of Thought prompt, to up to 74%, significantly higher than even the most refined, best result Chain of Thought prompting with self repetition.

It even improved some of its creating writing, which both methods were good at, being the best 41% of the time, and 38% being on par with the previous Chain of Thought process.

It was also shown to on old models to be better than newer models using weaker frameworks (GPT 3.5 vs 4), so it seems to inherently improve weaker languages.

Improving Chain Of Thought Reasoning in LLMs

This method, despite being incredibly useful and powerful, it is extensive in its exploration for the best path. This does inevitably lead to the inference needed to be more complex.

The combination of Chain of Thought decoding (as mentioned yesterday) with Tree of Thought, if we can now fine tune the LLM to leverage the Tree of Thought tree it can achieve the same level of performance, reducing the amount of complex inference needed. This fine tuning is call Chain of Preference Optimisation.

These LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. This combination now allows the LLM to perform better.

This method is inspired by the reinforcement learning from human feedback techniques (Direct Preference Optimisation), and CPO demonstrates how to effectively transfer the deliberate reasoning capabilities of expensive tree-search methods into efficient single-path inference, making advanced reasoning techniques more practical for real-world deployment while maintaining their quality benefits.

It is finetuned by the DPO algorithm. The paired preferences are then constructed by categorising if it would be preferred by the final path chosen by the Tree of Thoughts. This essentially allows the LLM to generate the path preferred by the Tree of Thought while using Chain of Thought decoding at the point where inference is needed.

This change allowed a number of key improvements, such as:

4.3% increase in accuracy over the normal base LLMs
57.5x faster than LLMs for inference, with similar or better performance compared to the use of Tree of Thoughts
Much better than the previous, supervised fine tuned models that only used selected paths

These improvements have shaped the way we use tokens

Chain of Draft

There has been further improvements of course beyond the other Chain of Preference Optimisation. While nothing truly groundbreaking has occurred since 2024, there has been a number of improvements to token usage, with some small prompt engineering. Known as Chain of Draft, it uses its own form of prompt engineering to reduce the number of tokens used in the reasoning of steps, while not reducing the effectiveness of the Chain of Thoughts method. It reduces the LLMs use of words used in the intermediate reasoning steps, reducing the word equations down to actual equations and answering immediately with some steps.

This simple change allowed the average tokens and latency to drop dramatically:

Arithmetic reasoning:
- GTP-4o:
  - Token #: 205.1 -> 43.9
  - Latency: 4.2s -> 1.0s
- Claude 3.5 Sonnet:
  - Token #: 190.0 -> 39.8
  - Latency: 3.1s -> 1.6s
Commonsense Reasoning (Task 1):
- GTP-4o:
  - Token #: 75.7 -> 30.2
  - Latency: 1.7s -> 1.3s
- Claude 3.5 Sonnet:
  - Token #: 172.5 -> 31.3
  - Latency: 3.2s -> 1.4s
Commonsense Reasoning (Task 2):
- GTP-4o:
  - Token #: 28.7 -> 15.0
  - Latency: 0.9s -> 0.7s
- Claude 3.5 Sonnet:
  - Token #: 189.4 -> 14.3
  - Latency: 3.6s -> 1.0s
Symbolic Reasoning:
- GTP-4o:
  - Token #: 52.4 -> 16.8
  - Latency: 1.4s -> 0.8s
- Claude 3.5 Sonnet:
  - Token #: 135.3 -> 18.9
  - Latency: 3.1s -> 1.6s

This massive improvement from Chain of Thought methods is also accompanied with some failures. As it is quite inconsistent without examples, dropping the accuracy down a lot more than the original Chain of Draft method, which is a slight drop in accuracy for saving tokens and time. It is also not as effective on small and older models, as it is most likely devoid of the training data in the CoD style that would allow it to be effective.

Conclusion

Overall while many improvements have been made, there is still massive improvements to be made in the field, allowing it LLMs to work both effectively and efficiently without compromising the quality of the response, which is effectively what Chain of Draft does, along with the Chain of Preference Optimisation which, while being an improvement computationally, is still computationally expensive, and in longer chains it may possibly overfit or remove whole branches where it may not be necessary to take as big of a step back, so lots of noise which can impact training data as well. It is also expensive in terms of man-hours, as it still needs the use of Humans to produce the ‘preferred routes’

Small Aside Today

In the buildup to ChatGPT 5, there has been reports from Anthropic that suggest that Claude Code was being used by OpenAI engineers, for some cases 24/7 on max subscriptions, costing Anthropic $10,000-$90,000!

As this broke their terms and conditions, as Claude Code cannot be used to make a competing product (which it looks like they have to make ChatGPT 5 more viable) they have been barred from using it in future.

“Claude Code has become the go-to choice for coders everywhere, and so it was no surprise to learn OpenAI’s own technical staff were also using our coding tools ahead of the launch of GPT-5,” -Christopher Nulty, Spokesperson for Anthropic.

Thought it was quite funny!