Aug 28, 2025

Continuous Calibration/Continuous Development framework - Where Do Evals Play a Part?

I have been looking into the Continuous Calibration/ Continuous Development (CC/CD)framework these past couple of days in regards to my most current project, a document reader that reads bureaucratic/ government documents, and using a RAG pipeline provides answers to queries you have in the document by quoting directly from the text. It uses a FAISS index for the data that you load into the document (It is persistent on disk, and gets loaded into memory when the code is currently ran). It then calls Claude API to find and write the relevant section(s) of the document, and in future will show and use direct links to other government documents (for now we are only testing on a random chapter 2 from JTTP 4-05). The link to the repo is here.

Evals

As part of the CC/CD framework, it has 7 sections (6 key parts, 1 transition part):

Scope Capability and Curate Data
Set up Application
Design Evals Transition - Deploy here
Run Evals
Analyse behaviour and spot error patterns
Apply fixes Deploy and back to step 1.

This blog post will be about the steps 3 and 4 - The evaluation metrics (evals).

First, we need to consider if my current evals are currently being used in the way that the CC/CD framework wants them to be used.

From the wonderful blog (Lenny’s Newsletter) about this framework, this is how they describe and differentiate these evals to normal tests:

Evals are scoring mechanisms that help you assess whether your AI system is working, where it’s falling short, and what needs improvement. Think of them as the equivalent of tests in traditional software. But unlike typical unit or integration tests, where inputs and outputs are fixed and correctness is binary, AI evals deal with ambiguity.

So our evals need to be utilising the application that has been set up in a non deterministic fashion, So we need to consider a number of factors.

How do big productions (probably) use Evals?

Looking at [Google], they have a different method to evaluate metrics. Instead of using a general set of rating rubics (like most LLM-as-a-judge systems), they use an adaptive rubric, which generates a unique set of pass/fail rubics for each prompt in a dataset, so each evaluation is relevant.

So Google essentially uses a 2 step system: Rubric generation, then Rubric validation. This is however for a more general prompting system (ie open ended), so this method will not be useful for us, as we can define a standard rubric as every prompt should achieve the same goal.

Claude creates Evals in a similar but different way structurally. They consider task specific evals, but also highlight the important of edge cases (giving some examples):

Irrelevant or nonexistent input data

Overly long input data or user input

[Chat use cases] Poor, harmful, or irrelevant user input

Ambiguous test cases where even humans would find it hard to reach an assessment consensus

All of these most likely need to be considered when writing test evals, so to have an extra eval for ‘is the data not possible to provide?’ could be a very useful thing to implement in my code.

They also suggest to automate when possible, and to prioritise volume over quality here, so to use LLMs to judge, and have many judged to also help.

They also have a perfect example for us to look at (will be covered later on its use) for the consistency of a FAQ (Frequently Asked Questions) bot, using a few techniques to find similarity.

General LLMs has overall studies on the Evaluation process, and describes in detail the types of processes and systems that we can base evaluation metrics on.

Interaction Mode

This distinguishes between two primary evaluation approaches:

Static & Offline Evaluation: Uses pre-generated datasets and static test cases. While cheaper and simpler to run, these evaluations lack nuance and are more prone to error propagation, making them less accurate representations of real system performance.

Dynamic & Online Evaluation: Leverages simulations or real user interactions after deployment. This approach captures domain-specific context and identifies issues not discovered in static testing. Examples include web simulators (MiniWoB, WebShop, WebArena) that can verify correct agent behavior sequences. This section also introduces Evaluation-driven Development (EDD), which proposes continuous evaluation throughout the agent development cycle, such as the AgentOps component to monitor production performance.

Evaluation Data

Structured benchmarks: AAAR-1.0, ScienceAgentBench, TaskBench for research reasoning and scientific workflows Tool-use focused: FlowBench, ToolBench, API-Bank for function-calling across large API repositories Interactive environments: AssistantBench, AppWorld, WebArena for dynamic decision-making and long-horizon planning

Safety and robustness: AgentHarm and AgentDojo for harmful behavior and prompt injection resistance Standardized leaderboards: Berkeley Function-Calling Leaderboard (BFCL) and Holistic Agent Leaderboard providing automated metrics and ranking systems

Metrics Computation Methods

Three primary approaches for computing evaluation metrics: Code-based methods: Most deterministic and objective, using explicit rules and test cases. Highly reliable for structured tasks but inflexible for open-ended responses. LLM-as-a-Judge: Uses LLMs to evaluate responses based on qualitative criteria. Highly scalable and adaptable to complex tasks. Extensions include Agent-as-a-Judge using multiple AI agents for refined assessments. Human-in-the-loop: Remains the gold standard for subjective aspects and safety-critical judgments through user studies, expert audits, or crowdworker annotations. Most reliable but expensive and difficult to scale.

Evaluation Tooling

This section highlights the emergence of software frameworks supporting automated, scalable evaluation workflows:

Open-source frameworks: OpenAI Evals, DeepEval, InspectAI, Phoenix Commercial platforms: GALILEO providing analytics and debugging capabilities Development platforms: Azure AI Foundry, Google Vertex AI, LangGraph, Amazon Bedrock incorporating evaluation features AgentOps architecture: Proposed for continuous monitoring of deployed agents with real-time feedback

Evaluation Contexts

Describes the environments where evaluations are performed, balancing realism versus control:

Controlled environments: Simpler and more secure but less representative Realistic environments: More costly but better reflect final performance Progressive contexts: Evolution from mocked APIs to live deployment as agent performance improves Web simulators: MiniWoB, WebShop, WebArena for less contained systems

This further emphasizes that evaluation context should align with the system’s intended use, with a natural progression from controlled testing environments to real-world deployment as agents mature and demonstrate reliability.

What do I have currently?

What I currently have in the evals.py could be considered to be a Reference Dataset. This is a reference for what expected behaviour looks like and how the AI should respond in my system. So I should treat it in that way, similar to few shot context systems and how examples are used in Chain of Thought prompting. As I was making this previously, I tried to consider this to be the actual designed evals, so this is an incorrect style of thinking in this framework.

I will now style these in accordance to using it as an encoded part of the data provided to the Claude API, to allow for better answering and being part of a feedback loop with the logs of queries to build an overall bank of good examples.

In regards to this dataset, I believe a good response should have these things:

Source/Paragraph reference
Nothing summarised (which can lose data)
All relevant sections written (from most to possibly important).

For now I am unsure what to do with tables.

What Factors should I consider in making a Eval?

I need to consider a number of factors while creating the evaluation metrics. There is also another piece on writing evals on Lenny’s Newsletter. We need to consider:

Hallucination: Does it make up data not present in the Chapter 2 markdown file?
Correctness: Does it actually get the data the user needs from their query?
Does it retrieve all relevant data throughout the Chapter 2 file?

So to evaluate our RAG program performance, we most likely need to use our reference dataset to test it (as highlighted in both the huggingface RAG evaluation and the previous CC/CD framework blog linked before) and an evaluator to compute accuracy. We can actually use other LLMs to judge these factors from the responses we provide. I will use another anthropic model to check against the original response. (Sonnet to check, Haiku to write)

Examples and Testing

import os
from collections import Counter

# Load environment variables from .env file
try:
    import dotenv
    dotenv.load_dotenv()
except ImportError:
    print("Warning: python-dotenv not installed. Environment variables from .env file won't be loaded.")

# Note: The following imports are for the LLM-as-a-Judge section.
# The script will handle cases where they are not installed.
try:
    from ragas import evaluate
    from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
    from datasets import Dataset
    RAGAS_INSTALLED = True
except ImportError:
    RAGAS_INSTALLED = False

# --- 1. Mock RAG Pipeline and Data Generation ---

def get_mock_rag_output(query: str) -> dict:
    """
    Simulates the output of a RAG pipeline for a given query.
    In a real application, this function would interact with your actual
    vector database and language model.
    """
    # A simple rule-based mock for demonstration
    if "capital of france" in query.lower():
        return {
            "query": query,
            "retrieved_contexts": [
                "France is a country in Western Europe. It is known for its wines and sophisticated cuisine.",
                "Paris is the capital and most populous city of France. It is a major global center for art, fashion, gastronomy and culture.",
                "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France."
            ],
            "generated_answer": "Paris is the capital of France, known for its art and culture."
        }
    elif "tallest mountain" in query.lower():
        return {
             "query": query,
            "retrieved_contexts": [
                "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas.",
                "The elevation of its summit is 8,848.86 m (29,031.7 ft).",
                "K2 is the second-highest mountain on Earth."
            ],
            "generated_answer": "The tallest mountain is Mount Everest, with an elevation of 8,848.86 meters."
        }
    else:
        # Simulate a case where the RAG system fails to find relevant info
        return {
            "query": query,
            "retrieved_contexts": ["The sky is blue and the grass is green."],
            "generated_answer": "I am not sure about that."
        }

def setup_evaluation_data() -> list:
    """
    Creates the ground-truth dataset and runs the mock RAG pipeline on it.
    This produces the data that will be used by all evaluation methods.
    """
    print("--- 1. Generating RAG Outputs for Evaluation ---")
    # In a real-world scenario, you would have a more extensive dataset.
    evaluation_dataset = [
        {
            "query": "What is the capital of France?",
            "ground_truth": "The capital of France is Paris."
        },
        {
            "query": "What is the tallest mountain?",
            "ground_truth": "Mount Everest is the tallest mountain in the world."
        }
    ]

    # Run the mock RAG pipeline on our dataset to get the outputs to evaluate
    rag_outputs = []
    for item in evaluation_dataset:
        rag_output = get_mock_rag_output(item["query"])
        rag_output.update(item) # Add ground_truth to the dict
        rag_outputs.append(rag_output)

    print("Successfully generated RAG outputs.")
    print("-" * 50)
    return rag_outputs

# --- 2. Code-Based Evaluation ---

def calculate_f1_score(generated_answer: str, ground_truth: str) -> float:
    """
    Calculates the F1 score, a metric for word overlap, between two strings.
    """
    gen_tokens = generated_answer.lower().split()
    gt_tokens = ground_truth.lower().split()

    if not gen_tokens or not gt_tokens:
        return 0.0

    common = Counter(gen_tokens) & Counter(gt_tokens)
    num_same = sum(common.values())

    if num_same == 0:
        return 0.0

    precision = 1.0 * num_same / len(gen_tokens)
    recall = 1.0 * num_same / len(gt_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def run_code_based_evaluation(rag_outputs: list):
    """
    Runs the deterministic F1-score evaluation on the RAG outputs.
    """
    print("\n--- 2. Running Code-Based Evaluation (F1 Score) ---")
    print("Pros: Objective, fast, cheap, reproducible.")
    print("Cons: Inflexible, misses semantic meaning, requires exact ground truth.\n")

    for output in rag_outputs:
        f1 = calculate_f1_score(output["generated_answer"], output["ground_truth"])
        print(f"Query: {output['query']}")
        print(f"  -> F1 Score: {f1:.4f}")
    print("-" * 50)

# --- 3. LLM-as-a-Judge Evaluation ---

def run_llm_as_judge_evaluation(rag_outputs: list):
    """
    Runs the LLM-as-a-Judge evaluation using the 'ragas' library.
    """
    print("\n--- 3. Running LLM-as-a-Judge Evaluation (RAGAs) ---")
    print("Pros: Scalable, handles subjective nuance, evaluates semantics.")
    print("Cons: Model bias, API costs, depends on judge model's quality.\n")

    if not RAGAS_INSTALLED:
        print("RAGAs library not found. Skipping this section.")
        print("To run, please install with: uv add ragas pandas datasets langchain-anthropic")
        print("-" * 50)
        return

    if not os.getenv("ANTHROPIC_API_KEY"):
        print("ANTHROPIC_API_KEY environment variable not set. Skipping this section.")
        print("-" * 50)
        return

    # Configure Claude as the LLM for RAGAs evaluation
    try:
        from langchain_anthropic import ChatAnthropic
        from ragas.llms import LangchainLLMWrapper
        from ragas.embeddings import LangchainEmbeddingsWrapper
        from langchain_huggingface import HuggingFaceEmbeddings
        
        # Initialize Claude model
        claude_llm = ChatAnthropic(
            model="claude-3-5-haiku-latest",
            anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
            temperature=0.0
        )
        
        # Initialize embeddings (using HuggingFace as fallback since we don't have OpenAI)
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        
        # Wrap both for RAGAs
        wrapped_llm = LangchainLLMWrapper(claude_llm)
        wrapped_embeddings = LangchainEmbeddingsWrapper(embeddings)
        
        # Set the LLM and embeddings for all metrics
        for metric in [faithfulness, answer_relevancy, context_precision, context_recall]:
            metric.llm = wrapped_llm
            if hasattr(metric, 'embeddings'):
                metric.embeddings = wrapped_embeddings
            
        print("Using Claude (claude-3-5-haiku-latest) for LLM-as-a-Judge evaluation...")
        print("Using HuggingFace embeddings (all-MiniLM-L6-v2) for similarity calculations...")
        
    except ImportError as e:
        print(f"Missing dependencies for Claude integration: {e}")
        print("Please install with: uv add langchain-anthropic langchain-huggingface")
        print("-" * 50)
        return
    except Exception as e:
        print(f"Error configuring Claude: {e}")
        print("-" * 50)
        return

    # Convert the list of dictionaries to a Hugging Face Dataset object
    rag_dataset_dict = {
        "question": [output["query"] for output in rag_outputs],
        "contexts": [output["retrieved_contexts"] for output in rag_outputs],
        "answer": [output["generated_answer"] for output in rag_outputs],
        "ground_truth": [output["ground_truth"] for output in rag_outputs]
    }
    rag_dataset = Dataset.from_dict(rag_dataset_dict)

    print("Evaluating with RAGAs metrics (faithfulness, answer_relevancy)...")
    try:
        results = evaluate(
            dataset=rag_dataset,
            metrics=[
                faithfulness,       # How factually consistent is the answer with the context?
                answer_relevancy,   # How relevant is the answer to the question?
                context_precision,  # Are the retrieved contexts relevant?
                context_recall,     # Were all relevant contexts retrieved?
            ],
        )
        print("RAGAs Evaluation Results:")
        print(results)
    except Exception as e:
        print(f"An error occurred during RAGAs evaluation: {e}")

    print("-" * 50)

# --- 4. Human-in-the-Loop Evaluation ---

def get_human_evaluation(rag_output: dict) -> dict:
    """
    Simulates a human evaluation workflow by prompting the user for input.
    """
    print("\n--- Human Evaluation Required ---")
    print(f"Query: {rag_output['query']}")
    print(f"Retrieved Contexts: {rag_output['retrieved_contexts']}")
    print(f"Generated Answer: {rag_output['generated_answer']}")
    print("-" * 20)

    score = None
    while score is None:
        try:
            raw_score = input("Rate the answer quality on a scale of 1 (bad) to 5 (excellent): ")
            score = int(raw_score)
            if not 1 <= score <= 5:
                print("Invalid score. Please enter a number between 1 and 5.")
                score = None
        except (ValueError, TypeError):
            print("Invalid input. Please enter a number.")

    feedback = input("Provide any additional qualitative feedback (optional): ")
    return {"human_score": score, "feedback": feedback or "N/A"}

def run_human_in_the_loop_evaluation(rag_outputs: list):
    """
    Iterates through RAG outputs to collect human feedback.
    """
    print("\n--- 4. Running Human-in-the-Loop Evaluation ---")
    print("Pros: Gold standard for quality, captures user experience, handles nuance.")
    print("Cons: Expensive, slow, not scalable, subject to human bias.\n")

    human_feedback_results = []
    for output in rag_outputs:
        feedback = get_human_evaluation(output)
        human_feedback_results.append(feedback)
        print(f"  -> Human Feedback Received: {feedback}\n")
    print("-" * 50)


# --- Main Execution Block ---
if __name__ == "__main__":
    # 1. Generate the data to be evaluated
    rag_outputs_to_evaluate = setup_evaluation_data()

    # 2. Run code-based evaluation
    run_code_based_evaluation(rag_outputs_to_evaluate)

    # 3. Run LLM-as-a-Judge evaluation
    run_llm_as_judge_evaluation(rag_outputs_to_evaluate)

    # 4. Run human-in-the-loop evaluation
    run_human_in_the_loop_evaluation(rag_outputs_to_evaluate)

    print("\nEvaluation complete. This script demonstrates how each method provides a")
    print("different perspective on your RAG pipeline's performance.")

Outputs of a small test:

--- 1. Generating RAG Outputs for Evaluation ---
Successfully generated RAG outputs.
--------------------------------------------------

--- 2. Running Code-Based Evaluation (F1 Score) ---
Pros: Objective, fast, cheap, reproducible.
Cons: Inflexible, misses semantic meaning, requires exact ground truth.

Query: What is the capital of France?
  -> F1 Score: 0.4444
Query: What is the tallest mountain?
  -> F1 Score: 0.4762
--------------------------------------------------

--- 3. Running LLM-as-a-Judge Evaluation (RAGAs) ---
Pros: Scalable, handles subjective nuance, evaluates semantics.
Cons: Model bias, API costs, depends on judge model's quality.

Using Claude (claude-3-5-haiku-latest) for LLM-as-a-Judge evaluation...
Using HuggingFace embeddings (all-MiniLM-L6-v2) for similarity calculations...
Evaluating with RAGAs metrics (faithfulness, answer_relevancy)...
Evaluating: 100%|██████████| 8/8 [00:06<00:00,  1.20it/s]
RAGAs Evaluation Results:
{'faithfulness': 1.0000, 'answer_relevancy': 0.9404, 'context_precision': 0.7917, 'context_recall': 1.0000}
--------------------------------------------------

--- 4. Running Human-in-the-Loop Evaluation ---
Pros: Gold standard for quality, captures user experience, handles nuance.
Cons: Expensive, slow, not scalable, subject to human bias.


--- Human Evaluation Required ---
Query: What is the capital of France?
Retrieved Contexts: ['France is a country in Western Europe. It is known for its wines and sophisticated cuisine.', 'Paris is the capital and most populous city of France. It is a major global center for art, fashion, gastronomy and culture.', 'The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.']
Generated Answer: Paris is the capital of France, known for its art and culture.
--------------------
Rate the answer quality on a scale of 1 (bad) to 5 (excellent): Provide any additional qualitative feedback (optional):   -> Human Feedback Received: {'human_score': 5, 'feedback': 'Great answer'}


--- Human Evaluation Required ---
Query: What is the tallest mountain?
Retrieved Contexts: ["Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas.", 'The elevation of its summit is 8,848.86 m (29,031.7 ft).', 'K2 is the second-highest mountain on Earth.']
Generated Answer: The tallest mountain is Mount Everest, with an elevation of 8,848.86 meters.
--------------------
Rate the answer quality on a scale of 1 (bad) to 5 (excellent): Provide any additional qualitative feedback (optional):   -> Human Feedback Received: {'human_score': 5, 'feedback': 'Excellent response'}

--------------------------------------------------

So what are the trade offs?

Code-Based Evaluation (F1 Score) As shown in the output, this method gave the answers F1 scores of 0.4444 and 0.4762. These relatively low scores demonstrate the primary weakness of this approach, even though a human would rate the answers as nearly perfect. Pros: Objective and Reproducible: The calculation will always produce the same score for the same inputs, making it excellent for tracking regressions in your pipeline. Fast and Cheap: This method runs almost instantly and requires no external API calls, making it ideal for running frequently during development (e.g., in a CI/CD pipeline). No External Bias: The score is based on a strict mathematical formula, free from the biases of human evaluators or LLM judges. Cons: Inflexible and Lacks Semantic Understanding: This is the most significant drawback. The generated answers were excellent, but because they didn’t match the exact wording of the ground_truth strings, the scores were low. It penalizes perfectly valid, semantically identical answers. Requires a “Golden” Answer: This method is only useful when there is a single, pre-defined correct answer, making it unsuitable for questions that are open-ended or have multiple valid responses. Can Be Misleading: Relying solely on this metric would lead you to believe your RAG pipeline is performing poorly, when in reality it is generating high-quality responses.
LLM-as-a-Judge (RAGAs) The RAGAs evaluation provided a much more insightful and positive assessment, with high scores across the board (faithfulness: 1.0, answer_relevancy: 0.94, context_recall: 1.0). This aligns far better with the intuitive quality of the answers. Pros: Handles Nuance and Semantics: Unlike the F1 score, the LLM judge understood that the generated answers were contextually correct and relevant to the questions. It can assess abstract qualities like “faithfulness” to the source documents. Highly Scalable: You can evaluate thousands of query-response pairs automatically without human intervention, making it much more scalable than manual evaluation. No Need for a Strict Ground Truth: Metrics like faithfulness and context_precision don’t require a pre-written “correct” answer. They evaluate the quality of the generation based on the retrieved documents, which is more representative of a real-world RAG task. Cons: Introduces Model Bias: The evaluation quality is entirely dependent on the capabilities and potential biases of the “judge” LLM (in this case, Claude Haiku). A different judge model might produce different scores. API Costs and Latency: Running these evaluations involves API calls to a powerful LLM, which costs money and takes more time than simple code-based metrics. Potential for “Gaming”: The RAG model could inadvertently be optimized to please the specific preferences of the judge LLM, which may not always align with human user preferences.
Human-in-the-Loop Evaluation This method confirmed that the generated answers were perfect, with both receiving a human score of 5/5. This serves as the “ground truth” against which the other methods can be compared. Pros: Gold Standard for Quality: Human judgment is the ultimate measure of whether an answer is truly useful, natural, and trustworthy. It captures subtle aspects of user experience that automated systems might miss. Handles Complex and Subjective Tasks: Humans are uniquely capable of assessing creativity, tone, safety, and ethical considerations that are difficult to quantify with automated metrics. Provides Qualitative Feedback: The optional feedback (‘Great answer’, ‘Excellent response’) provides invaluable insights into why an answer is good or bad, which can guide development efforts more directly. Cons: Extremely Expensive and Slow: This is the biggest drawback. Manually evaluating hundreds or thousands of responses is prohibitively costly and time-consuming. Does Not Scale: It is not a feasible method for continuous, large-scale testing during iterative development. Subject to Human Bias and Inconsistency: Different human evaluators might score the same answer differently. Their judgment can be influenced by fatigue, mood, or unclear guidelines, leading to inconsistent results

Results:

I tweaked how the evals.py file worked. I used a LLM-judge method to look at the response that the rag.py file created, then compared it to 3 Boolean items:

cites_all_relevant: Does it site all relevant information?

not_paraphrasing: Is it writing the exact section as it is written?

cites_sections_paragraphs: Does it cite all information that it gives from the document?

It then collates and gives it a score from 0 (worst) to 1 (best). It then also gives notes on the response, then writes an improvement section to allow the user to provide better feedback. To get the queries, it gets a text file of your choosing and runs each query 1 by 1 to allow each query to have separate feedback overall. This will be useful for when I create and use logs to also tweak and improve the system. It is placed into a testing JSON file to allow for easier reading.

Example test command line:

uv run python evals.py \
  --queries-file test_query.txt \
  --run-real \
  --judge-model "claude-3-5-haiku-latest" \
  --output test_results.json

I also added some few shot examples to the system, to overall steer the system into the style of response that I wanted it to mimic. Again this can be further improved as the system is scaled up to also read the logs of previous inputs to further improve the system.

EXAMPLE_QA_PAIRS = [
    {
        "question": "What is the recommended occupancy for junior ranks?",
        "answer": "According to Table 2D.1 in section 2D3, the occupancy for junior ranks is 4 personnel."
    },
    {
        "question": "What are Key User Requirements (KURs)?", 
        "answer": "a. Priority. A priority should be assigned to each element of requirement.13 (1) Key User Requirements (KURs), are the Critical Success Factors without which the facility/service will not possibly be fit for purpose i.e. not be ‘good enough’. During the construction or delivery of support, these may form the basis of Key Performance Indicators (KPIs). (2) Any other priority is coded P1, P2 (and so on) to reflect lower priorities"
    },
    {
        "question": "What should the Theatre Infrastructure Development Directive (TIDD) address?",
        "answer": "**2C2. Contents of a Theatre Infrastructure Development Directive.** The size and contents of the TIDD will vary over time. As a guide it should address: a. **Infrastructure support to the operation.** The TIDD should explain what infrastructure is required to support the operation, for example, an 'RSOI camp for 5,000', an 'APOD for Sp Ac and 24 Typhoon aircraft', or 'up to 20 company-sized patrol bases'. b. **Planning horizon.** The TIDD should give the planning horizon for infrastructure investments throughout theatre. Ideally, some estimate should be given of the likely tenure of each site and major facility; are they likely to be needed for a short term (less than say six months) or longer term (say up to two years)? c. **Infrastructure programme.** It should state who is responsible for producing the Infrastructure Programme. h. **Health and Safety.** The TIDD may need to give some clarification of the health and safety procedures to be adopted. It may also remind theatre to confirm its infrastructure safety Standard Operating Procedures (SOPs) with PJHQ. Further details are in Part 3. i. **Military works area.** The TIDD should clarify whether a military works area exists and, if so, its boundaries. j. **Lands and environmental issues.** Guidance may be given on the selection of base locations, condition surveys, leasing and environmental assessments, if appropriate.",
    }
]

def select_relevant_examples(query: str, examples: list, max_examples: int = 2):
    """Select the most relevant examples based on token overlap with the query."""
    import re
    
    def token_overlap(text1: str, text2: str) -> int:
        tokens1 = set(re.findall(r'\w+', text1.lower()))
        tokens2 = set(re.findall(r'\w+', text2.lower()))
        return len(tokens1 & tokens2)
    
    # Score examples by token overlap with query
    scored_examples = []
    for example in examples:
        score = token_overlap(query, example["question"])
        scored_examples.append((score, example))
    
    # Sort by score and return top examples
    scored_examples.sort(key=lambda x: x[0], reverse=True)
    return [example for _, example in scored_examples[:max_examples]]

I would like to also add another layer of human feedback in the next iteration, a simple thumbs up/thumbs down feedback after each response, so there is metrics to also steer the AI to what the overall userbase would like to see. This obviously could not be the only tool for evaluation, as it could be skewed by bias, but it is a handy extra tool at our hands.