Can we change the Model to save cost?


I want to see if the current AI API is actually the correct choice. As there is a number of models out in the world, all with various differing benefits and drawbacks, I have decided to look at a varying range of models and their lighterweight counterparts, in order to both better understand the choices on the market, and to improve the MOD Navigator, or at the very least ensure that the correct model choice is the one we have in place. When initially creating the MOD navigator, I initially chose the Anthropic API, specifically Claude Opus 4.1, as it is the strongest model and I thought it would produce the best responses with the quickest speed. After conversations with Lucien, We moved ourselves onto another Anthropic Model, the Claude Haiku 3.5, as we wanted to keep within a smaller budget, and I saw no real change in the quality of the responses with given queries. To ensure that this is not luck however, I want to test this in a more rigorous setting.

Model Research

This is a set of models that I will be looking at for both the cost, speed and its overall ability to meet to the evaluation metrics. Below is a table for their respective prices for their usage.

Model NamePrice per token? (USD/Million)Notes
GPT-O30.50 Input, 2.00 OutputOld Gold Standard
GPT-51.25 Input, 10.00 OutputStandard OpenAI API
GPT-5 Nano0.05 Input, 0.40 OutputLow Level OpenAI API
Claude Haiku 3.50.80 Input, 4.00 OutputLow level Anthropic API
Claude Sonnet 43.00 Input, 15.00 OutputMedium level Anthropic API
Claude Opus 4.115.00 Input, 75.00 OutputHigh level Anthropic API
Gemini 2.5 Pro1.25-2.50 Input, 10.00-15.00 OutputGoogleโ€™s API
Gemini 2.5 Flash0.30 Input, 2.50 OutputLightweight Google API
Deepseek0.56 Input, 1.68 OutputChinese API

So on this initial assessment we can see that there is a number of cheaper models that we can consider to replace our current model (Claude Haiku 3.5), if they hold up to the same standard or better than the current model in place.

Testing Code

All code can be found on the โ€˜testing_modelsโ€™ branch of the DocReader Repository


"""
Comprehensive Model Test - Generate Full Markdown Report

This script runs all available models with the complete set of test queries
and generates a detailed markdown report with all responses.
"""

import os
import json
from datetime import datetime
from model_comparison import ModelTester, MODEL_CONFIGS, generate_markdown_report

def main():
    print("๐Ÿš€ Starting Comprehensive Model Comparison")
    print("=" * 60)
    
    # Load test queries
    try:
        with open('test_queries.json', 'r') as f:
            test_queries = json.load(f)
        print(f"๐Ÿ“ Loaded {len(test_queries)} test queries")
    except FileNotFoundError:
        # Fallback to default queries
        test_queries = [
            "What is the recommended occupancy for junior ranks according to Table 2D.1?",
            "What are Key User Requirements (KURs) and how do they differ from other priorities?",
            "What should the Theatre Infrastructure Development Directive (TIDD) address regarding infrastructure support?",
            "What is an Infrastructure Development Plan and what key elements should it include?",
            "What approvals might be required before occupation of a military facility?",
            "What is the purpose and content of a Statement of Need (SON) in infrastructure planning?",
            "How should surge capacity be considered when sizing military bases?",
            "What guidance should a TIDD provide about the planning horizon for infrastructure investments?",
            "What is a Variation Order (VO) and when is it typically used in military infrastructure projects?",
            "What factors should be considered when planning for base population and contractor presence?"
        ]
        print(f"๐Ÿ“ Using {len(test_queries)} default test queries")
    
    # Check available models
    available_models = []
    for config in MODEL_CONFIGS:
        if os.getenv(config.api_key_env):
            available_models.append(config)
    
    print(f"๐Ÿ”‘ Found API keys for {len(available_models)} models:")
    for config in available_models:
        print(f"   โœ… {config.name}")
    
    print(f"\n๐Ÿงช Starting comprehensive test...")
    print(f"   Models: {len(available_models)}")
    print(f"   Queries: {len(test_queries)}")
    print(f"   Total API calls: {len(available_models) * len(test_queries)}")
    print(f"   Estimated time: {len(available_models) * len(test_queries) * 3:.0f}s")
    
    # Initialize tester
    try:
        tester = ModelTester()
        print("โœ… RAG system loaded successfully")
    except Exception as e:
        print(f"โŒ Failed to load RAG system: {e}")
        return
    
    # Run comprehensive tests
    all_results = []
    total_tests = len(available_models)
    
    for i, config in enumerate(available_models, 1):
        print(f"\n[{i}/{total_tests}] Testing {config.name}...")
        
        try:
            result = tester.test_model(config, test_queries)
            all_results.append(result)
            
            if result.error_count == 0:
                print(f"   โœ… Success: {result.average_response_time:.2f}s avg, ${result.total_cost:.6f} total")
            else:
                print(f"   โš ๏ธ  {result.error_count} errors out of {len(test_queries)} queries")
                print(f"   ๐Ÿ“Š Success rate: {result.success_rate:.1%}")
                
        except Exception as e:
            print(f"   โŒ Critical error: {e}")
            # Create error result for reporting
            from model_comparison import ModelResults, QueryResult
            error_results = [
                QueryResult(q, "", 0, 0, 0, 0, str(e)) for q in test_queries
            ]
            all_results.append(ModelResults(
                config.name, error_results, 0, 0, 0, 0, len(test_queries), config.notes
            ))
    
    # Generate comprehensive report
    print(f"\n๐Ÿ“Š Generating comprehensive markdown report...")
    
    report = generate_markdown_report(all_results, test_queries)
    
    # Save report with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"model_comparison_comprehensive_{timestamp}.md"
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(report)
    
    print(f"โœ… Report saved: {filename}")
    
    # Print summary
    print(f"\n๐Ÿ“ˆ FINAL SUMMARY")
    print("=" * 60)
    successful_models = [r for r in all_results if r.success_rate > 0]
    
    if successful_models:
        # Sort by success rate, then by cost
        successful_models.sort(key=lambda x: (x.success_rate, -x.total_cost), reverse=True)
        
        print(f"๐Ÿ† Best performing models:")
        for i, result in enumerate(successful_models[:3], 1):
            print(f"   {i}. {result.model_name}: {result.success_rate:.0%} success, {result.average_response_time:.2f}s avg, ${result.total_cost:.6f}")
        
        # Cost analysis
        costs = [r.total_cost for r in successful_models if r.total_cost > 0]
        if costs:
            cheapest = min(successful_models, key=lambda x: x.total_cost)
            fastest = min(successful_models, key=lambda x: x.average_response_time)
            
            print(f"\n๐Ÿ’ฐ Most cost-effective: {cheapest.model_name} (${cheapest.total_cost:.6f})")
            print(f"โšก Fastest: {fastest.model_name} ({fastest.average_response_time:.2f}s avg)")
    
    print(f"\n๐Ÿ“„ Full report with all responses: {filename}")
    print("=" * 60)

if __name__ == "__main__":
    main()

This code produces not only reponses to 10 queries, it also works out the speed of a response (and its averages), the cost per query and the number of tokens used, and also works out how it stacks up against the evals we have set out for the MOD navigator.

Results

The full report can be found here. For brevity we will only look at a few key examples, and the overall results of the test (code found in the previous section, also under comprehensive_test.py in the repository).

Table results:

ModelSuccess RateAvg Time (s)Total CostCites AllNot ParaCites SectionsOverall Quality
Claude Haiku 3.5100.0%3.15$0.0135600.900.700.900.83
Deepseek100.0%11.04$0.0072770.900.900.600.80
GPT-O3100.0%11.54$0.0212970.900.900.600.80
Claude Sonnet 4100.0%2.73$0.0483600.900.900.600.80
Gemini 2.5 Flash100.0%1.42$0.0054310.900.800.600.77
Claude Opus 4.1100.0%4.31$0.2528250.900.600.800.77
GPT-5 Nano100.0%12.70$0.0098820.800.800.500.70
Gemini 2.5 Pro100.0%2.34$0.0301560.700.800.400.63
GPT-560.0%23.68$0.0835750.500.500.300.43

This table highlights a number of interesting findings. First we can see all but Chat-GPT 5 was successful in creating a response to queries. On the inspection of the 4 queries that failed, all encountered an error:

HTTPSConnectionPool(host=โ€˜api.openai.comโ€™, port=443): Read timed out. (read timeout=60)

This is a timeout error, this tells us complex questions will make GPT-5 fail, so very quickly highlights that this Model is not suitable for the task.

We also see all GPT models suffering in the speed department, with each query being slower by up to 22 seconds, but on average around 8 seconds slower per query. This is a problem I believe will only increase as the size of the markdown files increase, so it is a point of concern.

ModelAvg Input TokensAvg Output TokensInput RangeOutput RangeToken Efficiency
Gemini 2.5 Flash873108751-101717-290Very High
Deepseek841156740-9695-290High
GPT-5 Nano8762061809-9731092-4779Very Low
Claude Haiku 3.5924109825-108232-293Very High
GPT-O3848762747-966277-1461Medium
Gemini 2.5 Pro867113813-9717-290Very High
Claude Sonnet 4936135825-96230-293High
Claude Opus 4.1907175825-962103-343High
GPT-58511087747-973699-1449Low

The averages for each token usage is incredibly high for each GPT model, also contributing to their slower response times.

All 3 Anthropic Models are relatively expensive for the amount of queries used, so cost wise that is a considerable factor when scaling up.

We can see however that the current model we use, Claude Haiku 3.5, on average hit the evals the most often, with only paraphrasing being its real low point, but was more accurate with providing all information and citing it correctly, so absolutely a good sign that we are on the right track.

Another thing that was initially baffling to me is the lack of quality from higher grade models. As I am unsure on the specifics of why these responses were bad, I am making an educated guess that it is due to the natural language systems being better in these models, so wanting to add unnecessary information to each response, so it in turn lowers the evaluation metric in turn. This is an important point to consider in future, as better responses could be made with these models, but they require better fine tuning to get them to work at a more effective rate.

Conclusion

Overall, I think the best balance for the current system is still the Claude Haiku 3.5 Model. It has the best overall quality according to the evaluation metrics previously made, with the 4th fastest response time (which may be lower with caching as the quality is better overall, so fewer responses will need to be duplicated and remade to be better for the user to find it helpful). The only downside is that it is slightly more costly than the next best on in my opinion, Gemini 2.5 Flash.

2.5 Flash had the fastest times of all models (a blisteringly fast 1.42s per query response), and its only failure was the lack of citations/ incorrect citations, so possibly with some tweaking this model could work even better than the current model place, for 1/3 the cost of the Haiku 3.5 Model.

As a budget option, Deepseek also had a number of good qualities. It was obviously the cheapest, and had the 2nd(!) best quality according to our previous evaluation metrics. It did however have some issues with speed. It took around 11s per response, so for new documents this is a massive drop in speeds initially.

In conclusion, there is the 3 options to consider: Deepseek for high quality responses at a cheap rate for slow speed, Gemini 2.5 Flash for a good combination for both speed and quality, but needs further tweaking to be considered better than the third option, our current model in place, Claude Haiku 3.5, with the best quality, 3rd fastest speed and 2nd most cost effective.