Sep 3, 2025

Can we change the Model to save cost?

I want to see if the current AI API is actually the correct choice. As there is a number of models out in the world, all with various differing benefits and drawbacks, I have decided to look at a varying range of models and their lighterweight counterparts, in order to both better understand the choices on the market, and to improve the MOD Navigator, or at the very least ensure that the correct model choice is the one we have in place. When initially creating the MOD navigator, I initially chose the Anthropic API, specifically Claude Opus 4.1, as it is the strongest model and I thought it would produce the best responses with the quickest speed. After conversations with Lucien, We moved ourselves onto another Anthropic Model, the Claude Haiku 3.5, as we wanted to keep within a smaller budget, and I saw no real change in the quality of the responses with given queries. To ensure that this is not luck however, I want to test this in a more rigorous setting.

Model Research

This is a set of models that I will be looking at for both the cost, speed and its overall ability to meet to the evaluation metrics. Below is a table for their respective prices for their usage.

Model Name	Price per token? (USD/Million)	Notes
GPT-O3	0.50 Input, 2.00 Output	Old Gold Standard
GPT-5	1.25 Input, 10.00 Output	Standard OpenAI API
GPT-5 Nano	0.05 Input, 0.40 Output	Low Level OpenAI API
Claude Haiku 3.5	0.80 Input, 4.00 Output	Low level Anthropic API
Claude Sonnet 4	3.00 Input, 15.00 Output	Medium level Anthropic API
Claude Opus 4.1	15.00 Input, 75.00 Output	High level Anthropic API
Gemini 2.5 Pro	1.25-2.50 Input, 10.00-15.00 Output	Google’s API
Gemini 2.5 Flash	0.30 Input, 2.50 Output	Lightweight Google API
Deepseek	0.56 Input, 1.68 Output	Chinese API

So on this initial assessment we can see that there is a number of cheaper models that we can consider to replace our current model (Claude Haiku 3.5), if they hold up to the same standard or better than the current model in place.

Testing Code

All code can be found on the ‘testing_models’ branch of the DocReader Repository


"""
Comprehensive Model Test - Generate Full Markdown Report

This script runs all available models with the complete set of test queries
and generates a detailed markdown report with all responses.
"""

import os
import json
from datetime import datetime
from model_comparison import ModelTester, MODEL_CONFIGS, generate_markdown_report

def main():
    print("🚀 Starting Comprehensive Model Comparison")
    print("=" * 60)
    
    # Load test queries
    try:
        with open('test_queries.json', 'r') as f:
            test_queries = json.load(f)
        print(f"📝 Loaded {len(test_queries)} test queries")
    except FileNotFoundError:
        # Fallback to default queries
        test_queries = [
            "What is the recommended occupancy for junior ranks according to Table 2D.1?",
            "What are Key User Requirements (KURs) and how do they differ from other priorities?",
            "What should the Theatre Infrastructure Development Directive (TIDD) address regarding infrastructure support?",
            "What is an Infrastructure Development Plan and what key elements should it include?",
            "What approvals might be required before occupation of a military facility?",
            "What is the purpose and content of a Statement of Need (SON) in infrastructure planning?",
            "How should surge capacity be considered when sizing military bases?",
            "What guidance should a TIDD provide about the planning horizon for infrastructure investments?",
            "What is a Variation Order (VO) and when is it typically used in military infrastructure projects?",
            "What factors should be considered when planning for base population and contractor presence?"
        ]
        print(f"📝 Using {len(test_queries)} default test queries")
    
    # Check available models
    available_models = []
    for config in MODEL_CONFIGS:
        if os.getenv(config.api_key_env):
            available_models.append(config)
    
    print(f"🔑 Found API keys for {len(available_models)} models:")
    for config in available_models:
        print(f"   ✅ {config.name}")
    
    print(f"\n🧪 Starting comprehensive test...")
    print(f"   Models: {len(available_models)}")
    print(f"   Queries: {len(test_queries)}")
    print(f"   Total API calls: {len(available_models) * len(test_queries)}")
    print(f"   Estimated time: {len(available_models) * len(test_queries) * 3:.0f}s")
    
    # Initialize tester
    try:
        tester = ModelTester()
        print("✅ RAG system loaded successfully")
    except Exception as e:
        print(f"❌ Failed to load RAG system: {e}")
        return
    
    # Run comprehensive tests
    all_results = []
    total_tests = len(available_models)
    
    for i, config in enumerate(available_models, 1):
        print(f"\n[{i}/{total_tests}] Testing {config.name}...")
        
        try:
            result = tester.test_model(config, test_queries)
            all_results.append(result)
            
            if result.error_count == 0:
                print(f"   ✅ Success: {result.average_response_time:.2f}s avg, ${result.total_cost:.6f} total")
            else:
                print(f"   ⚠️  {result.error_count} errors out of {len(test_queries)} queries")
                print(f"   📊 Success rate: {result.success_rate:.1%}")
                
        except Exception as e:
            print(f"   ❌ Critical error: {e}")
            # Create error result for reporting
            from model_comparison import ModelResults, QueryResult
            error_results = [
                QueryResult(q, "", 0, 0, 0, 0, str(e)) for q in test_queries
            ]
            all_results.append(ModelResults(
                config.name, error_results, 0, 0, 0, 0, len(test_queries), config.notes
            ))
    
    # Generate comprehensive report
    print(f"\n📊 Generating comprehensive markdown report...")
    
    report = generate_markdown_report(all_results, test_queries)
    
    # Save report with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"model_comparison_comprehensive_{timestamp}.md"
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(report)
    
    print(f"✅ Report saved: {filename}")
    
    # Print summary
    print(f"\n📈 FINAL SUMMARY")
    print("=" * 60)
    successful_models = [r for r in all_results if r.success_rate > 0]
    
    if successful_models:
        # Sort by success rate, then by cost
        successful_models.sort(key=lambda x: (x.success_rate, -x.total_cost), reverse=True)
        
        print(f"🏆 Best performing models:")
        for i, result in enumerate(successful_models[:3], 1):
            print(f"   {i}. {result.model_name}: {result.success_rate:.0%} success, {result.average_response_time:.2f}s avg, ${result.total_cost:.6f}")
        
        # Cost analysis
        costs = [r.total_cost for r in successful_models if r.total_cost > 0]
        if costs:
            cheapest = min(successful_models, key=lambda x: x.total_cost)
            fastest = min(successful_models, key=lambda x: x.average_response_time)
            
            print(f"\n💰 Most cost-effective: {cheapest.model_name} (${cheapest.total_cost:.6f})")
            print(f"⚡ Fastest: {fastest.model_name} ({fastest.average_response_time:.2f}s avg)")
    
    print(f"\n📄 Full report with all responses: {filename}")
    print("=" * 60)

if __name__ == "__main__":
    main()

This code produces not only reponses to 10 queries, it also works out the speed of a response (and its averages), the cost per query and the number of tokens used, and also works out how it stacks up against the evals we have set out for the MOD navigator.

Results

The full report can be found here. For brevity we will only look at a few key examples, and the overall results of the test (code found in the previous section, also under comprehensive_test.py in the repository).

Table results:

Model	Success Rate	Avg Time (s)	Total Cost	Cites All	Not Para	Cites Sections	Overall Quality
Claude Haiku 3.5	100.0%	3.15	$0.013560	0.90	0.70	0.90	0.83
Deepseek	100.0%	11.04	$0.007277	0.90	0.90	0.60	0.80
GPT-O3	100.0%	11.54	$0.021297	0.90	0.90	0.60	0.80
Claude Sonnet 4	100.0%	2.73	$0.048360	0.90	0.90	0.60	0.80
Gemini 2.5 Flash	100.0%	1.42	$0.005431	0.90	0.80	0.60	0.77
Claude Opus 4.1	100.0%	4.31	$0.252825	0.90	0.60	0.80	0.77
GPT-5 Nano	100.0%	12.70	$0.009882	0.80	0.80	0.50	0.70
Gemini 2.5 Pro	100.0%	2.34	$0.030156	0.70	0.80	0.40	0.63
GPT-5	60.0%	23.68	$0.083575	0.50	0.50	0.30	0.43

This table highlights a number of interesting findings. First we can see all but Chat-GPT 5 was successful in creating a response to queries. On the inspection of the 4 queries that failed, all encountered an error:

HTTPSConnectionPool(host=‘api.openai.com’, port=443): Read timed out. (read timeout=60)

This is a timeout error, this tells us complex questions will make GPT-5 fail, so very quickly highlights that this Model is not suitable for the task.

We also see all GPT models suffering in the speed department, with each query being slower by up to 22 seconds, but on average around 8 seconds slower per query. This is a problem I believe will only increase as the size of the markdown files increase, so it is a point of concern.

Model	Avg Input Tokens	Avg Output Tokens	Input Range	Output Range	Token Efficiency
Gemini 2.5 Flash	873	108	751-1017	17-290	Very High
Deepseek	841	156	740-969	5-290	High
GPT-5 Nano	876	2061	809-973	1092-4779	Very Low
Claude Haiku 3.5	924	109	825-1082	32-293	Very High
GPT-O3	848	762	747-966	277-1461	Medium
Gemini 2.5 Pro	867	113	813-971	7-290	Very High
Claude Sonnet 4	936	135	825-962	30-293	High
Claude Opus 4.1	907	175	825-962	103-343	High
GPT-5	851	1087	747-973	699-1449	Low

The averages for each token usage is incredibly high for each GPT model, also contributing to their slower response times.

All 3 Anthropic Models are relatively expensive for the amount of queries used, so cost wise that is a considerable factor when scaling up.

We can see however that the current model we use, Claude Haiku 3.5, on average hit the evals the most often, with only paraphrasing being its real low point, but was more accurate with providing all information and citing it correctly, so absolutely a good sign that we are on the right track.

Another thing that was initially baffling to me is the lack of quality from higher grade models. As I am unsure on the specifics of why these responses were bad, I am making an educated guess that it is due to the natural language systems being better in these models, so wanting to add unnecessary information to each response, so it in turn lowers the evaluation metric in turn. This is an important point to consider in future, as better responses could be made with these models, but they require better fine tuning to get them to work at a more effective rate.

Conclusion

Overall, I think the best balance for the current system is still the Claude Haiku 3.5 Model. It has the best overall quality according to the evaluation metrics previously made, with the 4th fastest response time (which may be lower with caching as the quality is better overall, so fewer responses will need to be duplicated and remade to be better for the user to find it helpful). The only downside is that it is slightly more costly than the next best on in my opinion, Gemini 2.5 Flash.

2.5 Flash had the fastest times of all models (a blisteringly fast 1.42s per query response), and its only failure was the lack of citations/ incorrect citations, so possibly with some tweaking this model could work even better than the current model place, for 1/3 the cost of the Haiku 3.5 Model.

As a budget option, Deepseek also had a number of good qualities. It was obviously the cheapest, and had the 2nd(!) best quality according to our previous evaluation metrics. It did however have some issues with speed. It took around 11s per response, so for new documents this is a massive drop in speeds initially.

In conclusion, there is the 3 options to consider: Deepseek for high quality responses at a cheap rate for slow speed, Gemini 2.5 Flash for a good combination for both speed and quality, but needs further tweaking to be considered better than the third option, our current model in place, Claude Haiku 3.5, with the best quality, 3rd fastest speed and 2nd most cost effective.