Sep 11, 2025

Looking into other Search Methods

The RAG pipeline failed. The evals I were using failed to pick up on the fundamental flaws of the RAG pipeline with the FAISS indexing. When attempting to ask questions generally on the nature of the document (JTTP 4-05), It was unable to read or understand many parts of the document due to the caching failing, so it could not understand the context it was being provided.

Attempts to continue to use the RAG pipeline

First I tried changing the chunk settings for the cache, with many different configurations for both the size and amount of overlap to attempt to get all sections of the system to have useful elements of data. This brought some success, allowing around half of the chunks of data to be useable, going from the original coverage being 18.2% to an ok level of 42.4%. This however is not acceptable for our use case, as we want every detail of the document to be able to be queried and questioned on. This also took all the usefulness of the caching system away, as there had to be an incredibly high number of chunks (7,640), and still had the fundamental issue of not being able to correctly gather all parts of the document:

Only 3.0% of sections can be reproduced verbatim

36.4% of sections completely failed to retrieve

Zero high similarity matches (≥80%)

This catastrophic failure led to trying a different approach.

New system

The next approach I used was a simpler one: Just a strict hierarchal approach, prioritising direct section lookuo first, then via a keyword search, then finally a full text search:

def _build_context(self, question: str, max_tokens: int) -> LLMContext:
    """Build structured context for the query - OLD HIERARCHICAL APPROACH"""
    
    # 1. Try direct section lookup first (HIGHEST PRIORITY)
    section_number = self._detect_section_request(question)
    if section_number and section_number in self.section_index:
        self._log(f"🎯 Direct section match: {section_number}")
        # IMMEDIATE RETURN - other methods never tried
        return LLMContext(
            query=question,
            relevant_sections=[direct_match],
            search_strategy="direct_section"
        )
    
    # 2. Only if direct search fails, try keyword search
    matches = self._find_keyword_matches(question)
    if matches:
        self._log(f"🔍 Found {len(matches)} keyword matches")
        # IMMEDIATE RETURN - full-text search never tried
        return LLMContext(
            query=question,
            relevant_sections=matches[:3],
            search_strategy="keyword_search"
        )
    
    # 3. Only as last resort, try full-text search
    text_matches = self._full_text_search(question)
    if text_matches:
        return LLMContext(
            query=question,
            relevant_sections=text_matches[:3],
            search_strategy="full_text_search"
        )
    
    # 4. Give up if nothing found
    return LLMContext(search_strategy="no_matches")

This works to a good standard, as all sections of the document are present and accounted for, and it will answer the majority of questions with very little error. However, it has a number of issues.

Information loss - If there is a question surrounding a clause and its content, it may snap to a direct search, which means you are losing valuable insight that could also be elsewhere in the document.
No Cross-Validation - There is no real way for us to definitively evaluate our answers within the context of answering for a user. As the Feedback system was based on the old methods, this is a highly important feature to have when considering we now have no real feedback for answering anymore (as of yet, will be reintroduced).
Binary success/failure - Also part of the previous issue, as we have no way to validate if an answer is good, if a query is nuanced, it will either pick a method (ie direct search), even if it is not good, or try the next method, which will lose the previous methods benefits.

Further Improvements

After looking at Simon Willison’s blog, I decided to use a Reciprocal Rank Fusion approach which is a hybrid combination of all 3 systems:

def _build_context_with_rrf(self, question: str, max_tokens: int) -> LLMContext:
    """Build context using RRF fusion - NEW PARALLEL APPROACH"""
    
    # 1. Run ALL search methods in parallel
    search_results = {}
    
    # Direct section search (always runs)
    section_number = self._detect_section_request(question)
    if section_number and section_number in self.section_index:
        search_results['direct'] = [create_direct_match(section_number)]
    else:
        search_results['direct'] = []
    
    # Keyword search (always runs)
    search_results['keyword'] = self._find_keyword_matches(question)[:10]
    
    # Full-text search (always runs)  
    search_results['fulltext'] = self._full_text_search(question)[:10]
    
    # 2. Apply RRF to intelligently combine all results
    fused_results = self._reciprocal_rank_fusion(search_results)
    
    # 3. Return best combined results
    return LLMContext(
        query=question,
        relevant_sections=fused_results[:3],
        search_strategy="rrf_fusion"
    )

def _reciprocal_rank_fusion(self, search_results: Dict[str, List[QueryMatch]], k: int = 60):
    """
    Apply Reciprocal Rank Fusion to combine multiple search result lists
    
    RRF Formula: score(d) = Σ(1 / (k + rank_i(d))) for all systems i
    """
    segment_scores = defaultdict(float)
    
    # Strategic weighting based on method strengths
    method_weights = {
        'direct': 3.0,      # Highest - exact section matches are usually perfect
        'keyword': 1.5,     # Medium - good conceptual matches
        'fulltext': 1.0     # Lower - catches edge cases
    }
    
    for method, results in search_results.items():
        weight = method_weights.get(method, 1.0)
        
        for rank, match in enumerate(results, 1):
            segment_id = match.segment.id
            
            # RRF score calculation with method weighting
            rrf_score = weight * (1.0 / (k + rank))
            segment_scores[segment_id] += rrf_score
            
            # Store best match object for each segment
            if segment_id not in segment_objects:
                segment_objects[segment_id] = match
    
    # Return segments sorted by combined RRF score
    return sorted(segment_objects.values(), key=lambda x: x.relevance_score, reverse=True)

The new system now does all 3 at once, and weights the best one according to the number of hits it has section wise. This stops information loss as all sections are considered even if there is a direct match in one section. This cross validation is a key feature I want to have in the MOD navigator, as it allows for a greater nuance to be fed into the LLM and provides reliable consistent results.

#### Evals and Testing

As I still think the previous evaluation metrics were good for what we want to achieve in our use case, the plan is to change the nature of our testing. We need to test more comprehensively in each section of the document to ensure that it is being correctly used.

Queries used:

What are the roles and responsibilities of PJHQ and CJO in governing the operational estate according to Section 1? How does the concept of a Military Works Area (MWA) balance operational imperatives against health and safety standards, and what risks does this create? In multinational NATO operations, how does the principle of Collective Responsibility affect UK infrastructure decisions and resource allocation? What are the implications when commanders must choose between compliance with UK Health and Safety standards and operational effectiveness? What are the key responsibilities of theatre Infrastructure Staff and how do they differ from PJHQ Infrastructure Staff according to Section 2? How does the separation of duties principle affect the relationship between requirement definition and infrastructure delivery, and when might this separation be inappropriate? What challenges arise from the need for Infrastructure Staff continuity when personnel rotate every six months, and how is this addressed? How do air infrastructure requirements differ from land-based infrastructure, particularly regarding command relationships and specialist staff embedding? What are the three main planning documents mentioned in Section 3 and how do they relate to each other? How does the difficulty in predicting operation duration create ‘decision making paralysis’ in infrastructure investment, and what are the consequences? What is the relationship between operational infrastructure as a ‘line of development’ and other military capabilities, particularly in theatre capability integration? How should the transition from equipment infrastructure to constructed infrastructure be managed during the progression from early entry to enduring operations? What are the differences between strategic reconnaissance and operational reconnaissance according to Section 4? What critical information gaps exist before reconnaissance, and how do these affect the quality of infrastructure planning decisions? How do host nation capabilities and coalition arrangements influence UK infrastructure requirements during reconnaissance assessment? What role do RE technical specialists play in operational reconnaissance, and how does their input affect force structure decisions? What factors should be considered when making planning estimates for construction forces according to Section 5? How does the lack of current experience affect the accuracy of infrastructure estimates, and what are the consequences of poor estimates? What is the relationship between construction force estimates and overall personnel planning, particularly regarding J3/J5 appreciation? How should planners balance optimistic versus pessimistic estimates when dealing with uncertain operational timelines and resource availability? What are the key principles for writing infrastructure requirements according to Section 6, and why is ‘freezing’ the requirement important? How does the tension between ‘good enough’ versus ‘optimal’ solutions affect requirement definition and project success measurement? What risks arise when Subject Matter Experts change during project execution, and how can UK-based SME endorsement mitigate these risks? How should requirement changes be managed once construction has begun, and what are the trade-offs between immediate changes versus post-completion adjustments? What are the main sections of the Infrastructure Request Proforma outlined in Annex 2A and what approval stages does it require? How does the proforma ensure financial probity through its approval workflow, particularly regarding the separation of requirement and commercial functions? What role does compliance checking play in the proforma process, and how do different authorities (Fire, ATO, EHT) contribute to project approval? How does the proforma system balance operational urgency with proper governance, particularly in the initial approval and peer review stages? What is the purpose of the Theatre Infrastructure Development Directive (TIDD) according to Annex 2C and how does it relate to Infrastructure Development Plans? How do TIDDs balance strategic campaign considerations with detailed tactical requirements, and what challenges arise in this integration? What is the iterative relationship between TIDDs, Infrastructure Development Plans, and the Sustainability Statement in operational planning? How do Infrastructure Development Plans provide continuity for theatre staff while remaining flexible enough for rapid operational changes? What accommodation standards and occupancy rates are specified in Annex 2D for different categories of personnel? How should military judgement be applied when interpreting JSP 315 Scale 5 provisions versus NATO QSTAG 1176 guidance? What considerations affect occupancy efficiency, particularly regarding senior officer accommodations and gender segregation requirements? How do the various NATO standards (STANAG 2136, 2280, 2882, etc.) create potential conflicts or gaps in infrastructure provision requirements?

Results

Summary

Total Queries: 36
Successful Queries: 36
RRF Activation Rate: 100% (36/36)
Success Rate: 100%
Average Confidence: 0.8
Average Sources per Response: 4.2 sections referenced

Content Quality Assessment

Citations Coverage: 0.0% average
Verbatim Quoting: 0.0% average
Section References: 16.7% average
Overall Quality Score: 7.3% average

These results seem concerning. Let us have a look at the specific query responses:

Query 1 Response:

Based on the context provided, here are the roles and responsibilities of PJHQ and CJO in governing the operational estate:

1. Overall Responsibility
- According to Section 201, the Chief of the Joint Staff (CJO) is ultimately responsible for the operational estate.

2. Control and Management
- PJHQ exercises overall control of the operational estate through:
  - J1/J4 branch for operational management
  - J8 branch for funding arrangements (Section 201)

3. Strategic Planning and Directive
- PJHQ produces and updates the Theatre Infrastructure Development Directive (TIDD) (Insight 2-2)
- The TIDD:
  - Establishes planning horizons
  - Sets infrastructure provision standards
  - Addresses strategic campaign themes
  - Estimates overall financial development costs
  - Sets priorities, standards, constraints, and planning goals
  - Is agreed at 1-star level within PJHQ and by Command Sec (ACOS J8)
  - Becomes part of the Theatre Reference Document (TRD)

4. Decision-Making Support
- PJHQ plays a crucial role in the decision-making space for infrastructure development (Section 220)
- Decisions are typically made through investment appraisals and business cases
- Effective decision-making depends on staff capacity and capability

5. Control Model Flexibility
- PJHQ determines the appropriate number of Infrastructure Staff and their headquarters subordination
- Ensures no competition over real estate control and resource access
- Considers close collaboration with in-theatre financiers (Section 207)

6. Continuity Management
- PJHQ helps maintain continuity in infrastructure planning, especially given frequent personnel rotations (Section 223)

These responsibilities ensure a comprehensive, strategic approach to managing the operational estate across different operational contexts.

This answer seems quite good at both an initial look, and on consulting the source document. This tells me that the evals are a bit crude, but equally my responses are also somewhat too dense, so I need to shorten my responses by reducing the number of sources it uses.

Query 6 Response:

Based on the context provided, here's a comprehensive answer:

The separation of duties principle is a fundamental governance approach for infrastructure management, with specific implications for requirement definition and delivery:

1. Principle of Separation
According to Section 205 and Section 235, there is a standard practice to separate requirement definition from infrastructure delivery for non-battlefield infrastructure. This separation is optimized for:
- Efficiency
- Financial prudence
- Clear organizational roles

2. Key Characteristics
As outlined in Section 210(b), the separation involves:
- Distinguishing between long-term planning and daily task execution
- Separating those stating and funding requirements from those with commercial powers and payment clearance

3. When Separation Might Be Inappropriate
Section 205 specifically notes two scenarios where strict separation may not be suitable:
- Early stages of an operation
- Operations requiring minimal construction

In these cases, organizations must retain flexibility to select the most appropriate model, potentially combining requirement and delivery functions.

4. Practical Implementation
The Infrastructure Staff (Section 211):
- Identifies and prioritizes requirements
- Does not command delivery assets
- Facilitates delivery without directly managing infrastructure tasks

5. Guiding Principle
The overarching goal is to ensure clear responsibilities while maintaining operational adaptability, with PJHQ retaining the ability to adjust the organizational approach as needed.

Overall these 2 responses I feel do meet the requirements required of them, as I believe it quotes all necessary sources, and paraphrases where it needs to for clarity/brevity, but also quotes verbatim when required. I believe it may overuse sources for a more robust answer, but that is not necessarily a ‘bad’ thing here, as we want the Agent to be 100% accurate in its answers, and in the eyes of the user using the system.

Overall I have saw a dramatic improvement in its abilities to not only produce responses on all areas of the system, it can also answer effectively and robustly, with no hallucination.