Aug 11, 2025

Looking at Chat GPT 5 + Other OpenAI developments

Chat GPT 5 has released! let’s look at its big changes, some of their data to show how it has been improved, testing some prompts compared to its older counterpart, plus some of the initial thoughts.

OpenAI’s post

Chat GPT 5 is OpenAI’s new LLM (Possibly produced with the help of Claude Code as we saw the other day). According to OpenAI, it is improved in many areas over the previous models, such as:

Coding
Math
Writing
Healthcare
Visual perception.

It has a number of changes, with the main change being the conversion from the many specialised models with the Chat GPT 4 release, to 1 unified model in the Chat GPT 5 release, meaning it now has multiple models for different amounts/levels of thinking, with a smart efficient model for the majority of questions, and a deeper reasoning model (GPT-5 Thinking) for questions that require deeper levels of thought to answer effectively, and a real-time router to decide which model to use depending on multiple factors, such as the conversation type, the complexity of the prompt, the tools needed to answer correctly, and if the prompt mentions needing a deeper level of analysis (OpenAI used the addition of ‘think hard about this’ in the prompt to make GPT-5 use the deeper reasoning model).

As of right now in the early stages of the use of the new model, it has been shown to not be highly efficient to correctly use the correct level of thinking (will be covered later in this blog post), but the real-time router is continuously trained on real signals, such as when users switch models, teh preference rate for responses, and the ‘measured correctness’ (obviously improving over time as Chat GPT 5 gets better at answering prompts effectively). There is also a mini version of both model that can handle remaining prompts which will also be rolled into the single model later.

Coding

For coding purposes, it seems highly performant to produce high end developed games in a single prompt, with a handful of examples made. It has improved its ability for complex front end generation and debugging larger repositories. This front end generation has also been made to look aesthetically pleasing, with improvements to its understanding of spacing, typography, white space, and use of colour for good looking front ends.

i.e. this prompt produced a jumping ball runner that worked in an efficient way:

Prompt: Create a single-page app in a single HTML file with the following requirements:
- Name: Jumping Ball Runner
- Goal: Jump over obstacles to survive as long as possible.
- Features: Increasing speed, high score tracking, retry button, and funny sounds for actions and events.
- The UI should be colorful, with parallax scrolling backgrounds.
- The characters should look cartoonish and be fun to watch.
- The game should be enjoyable for everyone.

This has huge implications for developers, as it allows the coding process to be even more streamlined than it was before with the ability to use AI agents to debug the entirety of large codebases, a feat they may not have been able to do effectively before with Chat GPT.

Creative Writing and Healthcare

It has also improved in its natural language capabilities. Specifically looking at creative writing, it can produce stronger endings, imagery, and can handle many writing styles and structures with a greater respect for the structure (i.e sustaining unrhymed iambic pentameter or free verse). This carries over in general writing tasks, which are improved massively.

It has also improved its ability to inform users on their health. Unlike previously, it is more of an ‘active thought partner’ (as OpenAI describes it), flagging potential concerns and prompting the user with questions to get more helpful answers with their issues. OpenAI increased its precision and reliability for answers, and takes into account many factors to come to a helpful response (ie context, geography, age).

Evaluation with Previous Models and Benchmarks

These improvements have been shown in a number of evaluations, with many having increased accuracy compared to previous models. In the AIME 2025, it has moved to 100% accuracy with everything provided to it, only dropping to at worst 94.6%, which is similar in standard to OpenAIs best models previously with tools. Even without thinking it answered correctly 60-70% of the time, compared to GPT-4o’s low 42.1%.

It also excelled to 32.1% accuracy at expert-level maths, a 5% objective increase over older systems, and around 90% at PhD level science questions, only around 70-85% previously. Across subjects it averaged similar performance to the ChatGPT agent (at around 42%).

Coding wise, it increased to 75% from 69.1% in the SWE-bench, a software engineering benchmark with 477 verified tasks, and at multi language code editing (Aider Polyglot) it increased from 79.6% accuracy to 88% from o3 to GPT-5.

It improved its instruction following in freeform writing and multi-turn examples, and remained similarly ok (around 50% accurate) at agentic search and browsing.

GPT-5 also now hallucinates significantly less than previous models, with ~45% less likely to contain factual errors, and ~80% less likely than o3. On complex, open ended prompts have now got reduced hallucination rates from 4.5-5.7% to 0.7-1.0%.

It also reduced its ‘confidence’, with more honest communication and openness about its actions and capabilities. This has moved on from the previous training where they may be encouraged to lie in order to get the best ‘reward’, GPT-5 now does not lie when the tools available are not enough to provide a sufficient response.

Criticism and Personal Review

However, there has been a lot of discourse surrounding the introduction of Chat GPT 5, with the addition of this new system also meaning the deprecation of many older models for the ‘mini-combined’ system that Chat GPT 5 offers, so it means that:

GPT-4o
GPT-4o-mini
OpenAI o3
OpenAI-o4-mini
GPT-4.1-Nano
OpenAI o3 Pro

Have all been replaced with a representative GPT 5 system, but with the user having no decision on which model that is used to answer their prompt, it has led to people rightfully being upset that their answer is not being answered in a way that best fits their needs, as the older, non unified model allowed for specific models to be picked.

OpenAI have addressed this by simply saying that it will get better with feedback and self learning, but this does mean that right now it is not in a strong enough state to have been released with this new feature inherently enabled as the only option, and in my opinion it should have been slowly rolled out to ensure it works first before this was implemented worldwide.

There is also the massive issue with context input. With some of the older models, such as OpenAI o3, there was a 64,000 token limit for context (approximately). GPT-5 has now got a limit of 256,000 tokens. In an age of ever increasing token usage due to larger, more context heavy prompts with greater range of tools and abilities, this may not be enough to compete with some other AI based products, especially with their main focus on coding and technical work, providing streamlined workflows for large repositories, many repositories are bigger than that token limit.

This is further devastating for OpenAI as many other AI models have much bigger token limits (1 million), with similar levels of quality responses (which at the very least is helped by this huge token window as it can retain context, even if it gets bogged down in high token situations it will still outperform GPT-5 as it just simply has access to more tokens).

This could be the ‘nail in the coffin’ for OpenAI, damaging a lot of their user base to provide a service it simply cannot perform with its limitations with less tokens, less user freedom (to pick specified models), and slower response times for only slightly improved performance in some cases.

Testing

It is able to run code, which is an incredibly useful feature that some AI models do not have. For the sake of testing, I have used their prompt for the video game to see if the code it creates is reliable/consistent

Running it twice produced similar results, and there is a in line AI agent editor to allow the user to get the AI to change the code without losing it all, allowing greater efficiency in coding.

I also noticed that there is a ‘quick answer button’ which allows users to vaguely choose the model for the prompt. Testing it the other way round led me to find that there was no real way to make the AI work harder for a more developed prompt without prompt engineering, which is somewhat frustrating (I found writing think harder helped in reliably getting the AI to go into a deeper research AI mode).

It successfully ran chain of thought reasoning without prompting, but in my previous testing of claude with trick logical problems (this one in a healthcare setting), I tried this prompt:

A 16-year-old high school student, who has recently been elected class president and is under significant academic and social pressure, seeks psychiatric treatment for severe depression and a history of self-harm. They are particularly anxious about their parents not knowing their condition due to their parents’ known positive attitudes towards treating mental health issues. Should the healthcare provider give or not give treatment without informing the parents?

Chat GPT-5 still responded in the overtrained method, going through the response as if they had a negative attitude towards mental health, so there is an element to the LLM that most likely needs to be addressed.

Conclusion

Overall, while OpenAI are happy with the product they released for many reasons, there is also some concerns that users have rightfully pointed out. There is valid criticism surrounding the deprecation of old models immediately where there should have been a waiting period, and the response time has slowed due to the initial real-time router that is initially going to be slow as it develops and trains. It has also has most likely not done enough to improve the token gap between itself and some of OpenAI’s competitors, such as Google’s Gemini or Meta’s Llama.

Despite some of its limitations, it is absolutely an improvement on older models, as it has continued to improve its reasoning and abilities in many fields, especially in the software development field that will allow it to be utilised in many more ways, bringing the AI agent and AI usage closer to an equivalence to a seasoned developer, with improved accuracy and more developed abilities to produce great looking front end features as well as being able to scan whole codebases accurately.