Improving the MOD Navigator
After I have deployed the most recent version of the MOD navigator (found here), I was falling back into the CC/CD framework (previously mentioned a few blog posts ago) to set up the next round of capability additions.
As the project has matured, I have allowed the AI agent working in the background more agency on what is a ‘good’ response, and naturally with caching and human feedback elements it has got better over time with responses and response times. It now automatically produces links to the source material, finding what is the apt thing to both highlight and how many links to add if necessary, and now will skip its bad results previously, coming up with a new response instead.
Scoping capability and curating data
The first thing that I would like to do is expand the size of the project. For now it still has a small reference dataset that it uses (Chapter 2 of the Joint Tactics, Techniques and Procedures 4-05), and at the very least I would like to expand this to a much wider file (such as the entire JTTP pdf, and surrounding documents that I can use), or possibly add the ability to add 1 root file (instead of the one I have).
Both features and capabilities need a greater system to convert the pdf file into markdown form. As of right now the system (pdf2md.py) is not up to scratch for a number of reasons:
- It cannot handle correct linkage to references
- It cannot handle images
- It cannot handle the natural exclusion of figures and the referencing of figures
All of these issues had to be either manually fixed or had to be fixed with the use of generative AI. This can also lead to a number of issues:
- Can skip over key sections with the context shortening present in most LLMs
- May paraphrase sections instead of using the words actually used
- Costly for larger files if properly introduced
- Can hallucinate data that may not exist in the pdf at all
So I would like to overall improve pdf2md.py in order to best treat all the issues that it currently has, which will allow for either direction I take in future to be more easily implemented.
Setting up the Improvements
First I wanted to add the images into the markdown - as they were generally ignored in the first iteration of this project. This is a relatively simple fix, using the base64 package to gather images from the text:
import base64
def embed_image_as_base64(img_path: Path) -> str:
"""Convert image to base64 data URL for embedding in markdown"""
with open(img_path, 'rb') as f:
img_data = f.read()
# Determine MIME type based on file extension
ext = img_path.suffix.lower()
mime_type = 'image/jpeg' if ext in ['.jpg', '.jpeg'] else 'image/png'
# Create base64 data URL
b64_data = base64.b64encode(img_data).decode('utf-8')
return f"data:{mime_type};base64,{b64_data}"
We also had to add a new argument to the parser, appropriately named ‘—embed-images’ that allows us to put the images directly into the markdown. This also means it has to save the images, so they in this situation were saved in the test_images folder (just named after the markdown file which I named here as test.md)
if embed_images:
# Embed image as base64 data URL
img_path = assets_dir / img
data_url = embed_image_as_base64(img_path)
out.write(f"\n\n")
else:
# Reference external image file
out.write(f"\n\n")
I tested it on this document, which clearly has text of a few different sizes and fonts, and most importantly has an image that will need to be shown in the respective markdown document.
Results:
This is what it looks like: Here