Combining and improving the OCR system with an interface.


Over the next few days I will be continuing to build the OCR system I have in order to

Current System

The training I attempted with the OCR text recognition did not work, so in utilising text recognition I will use already made systems.

This is the current OCR I am using:

import cv2 
import pytesseract


def ocr_core (img):
    text=pytesseract.image_to_string(img)
    return text

img= cv2.imread('img.png')

def get_greyscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

def noise_removal(image):
    return cv2.medianBlur(image, 5)

def thresholding(image):
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

img= get_greyscale(img)
img= thresholding(img)
img= noise_removal(img)

text= ocr_core(img)

print(text)

This script worked perfectly for processing saved images, but it had obvious limitations:

Fixed input: Only worked with a specific image file Command-line only: No user-friendly interface No camera integration: Couldn’t capture live images Manual process: Required running the script each time

Webpage

This is the website created. It has the webcam enabled to allow for real time photos of real images, so it allows for ease of access.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>AI Camera Tool</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            min-height: 100vh;
            display: flex;
            align-items: center;
            justify-content: center;
            padding: 20px;
        }
        
        .container {
            background: rgba(255, 255, 255, 0.95);
            backdrop-filter: blur(10px);
            border-radius: 20px;
            padding: 30px;
            box-shadow: 0 20px 40px rgba(0, 0, 0, 0.1);
            max-width: 800px;
            width: 100%;
        }
        
        .header {
            text-align: center;
            margin-bottom: 30px;
        }
        
        h1 {
            color: #333;
            margin-bottom: 10px;
            font-size: 2.5em;
            background: linear-gradient(135deg, #667eea, #764ba2);
            -webkit-background-clip: text;
            -webkit-text-fill-color: transparent;
        }
        
        .subtitle {
            color: #666;
            font-size: 1.1em;
        }
        
        .camera-section {
            display: flex;
            flex-direction: column;
            align-items: center;
            gap: 20px;
            margin-bottom: 30px;
        }
        
        .video-container {
            position: relative;
            border-radius: 15px;
            overflow: hidden;
            box-shadow: 0 10px 30px rgba(0, 0, 0, 0.2);
        }
        
        video {
            max-width: 100%;
            width: 400px;
            height: 300px;
            object-fit: cover;
        }
        
        canvas {
            display: none;
        }
        
        .controls {
            display: flex;
            gap: 15px;
            flex-wrap: wrap;
            justify-content: center;
        }
        
        button {
            background: linear-gradient(135deg, #667eea, #764ba2);
            color: white;
            border: none;
            padding: 12px 24px;
            border-radius: 25px;
            cursor: pointer;
            font-size: 16px;
            font-weight: 600;
            transition: all 0.3s ease;
            box-shadow: 0 4px 15px rgba(102, 126, 234, 0.4);
        }
        
        button:hover {
            transform: translateY(-2px);
            box-shadow: 0 6px 20px rgba(102, 126, 234, 0.6);
        }
        
        button:disabled {
            opacity: 0.6;
            cursor: not-allowed;
            transform: none;
        }
        
        .analysis-section {
            margin-top: 30px;
        }
        
        .analysis-types {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 15px;
            margin-bottom: 20px;
        }
        
        .analysis-type {
            background: #f8f9ff;
            border: 2px solid #e1e8ff;
            border-radius: 10px;
            padding: 15px;
            cursor: pointer;
            transition: all 0.3s ease;
            text-align: center;
        }
        
        .analysis-type:hover, .analysis-type.selected {
            border-color: #667eea;
            background: #f0f4ff;
            transform: scale(1.02);
        }
        
        .analysis-type h3 {
            color: #333;
            margin-bottom: 5px;
        }
        
        .analysis-type p {
            color: #666;
            font-size: 0.9em;
        }
        
        .results {
            background: #f8f9ff;
            border-radius: 15px;
            padding: 20px;
            margin-top: 20px;
            min-height: 100px;
            border-left: 4px solid #667eea;
        }
        
        .loading {
            display: flex;
            align-items: center;
            justify-content: center;
            gap: 10px;
            color: #667eea;
        }
        
        .spinner {
            width: 20px;
            height: 20px;
            border: 2px solid #e1e8ff;
            border-top: 2px solid #667eea;
            border-radius: 50%;
            animation: spin 1s linear infinite;
        }
        
        @keyframes spin {
            0% { transform: rotate(0deg); }
            100% { transform: rotate(360deg); }
        }
        
        .error {
            color: #e74c3c;
            text-align: center;
            padding: 20px;
        }
        
        .success {
            color: #27ae60;
            text-align: center;
            margin-bottom: 20px;
        }
        
        @media (max-width: 600px) {
            .container {
                padding: 20px;
            }
            
            h1 {
                font-size: 2em;
            }
            
            video {
                width: 100%;
                height: 200px;
            }
            
            .controls {
                flex-direction: column;
                align-items: center;
            }
            
            button {
                width: 200px;
            }
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="header">
            <h1>🤖 AI Camera Tool</h1>
            <p class="subtitle">Real-time image analysis powered by AI</p>
        </div>
        
        <div class="camera-section">
            <div class="video-container">
                <video id="video" autoplay muted playsinline></video>
                <canvas id="canvas"></canvas>
            </div>
            
            <div class="controls">
                <button id="startCamera">📹 Start Camera</button>
                <button id="capturePhoto" disabled>📸 Capture & Analyze</button>
                <button id="stopCamera" disabled>⏹ Stop Camera</button>
            </div>
        </div>
        
        <div class="analysis-section">
            <h3 style="color: #333; margin-bottom: 15px; text-align: center;">Choose Analysis Type:</h3>
            
            <div class="analysis-types">
                <div class="analysis-type selected" data-type="general">
                    <h3>🔍 General Description</h3>
                    <p>Overall scene analysis and object identification</p>
                </div>
                
                <div class="analysis-type" data-type="accessibility">
                    <h3>♿ Accessibility</h3>
                    <p>Description for visually impaired users</p>
                </div>
                
                <div class="analysis-type" data-type="text">
                    <h3>📝 Text Recognition</h3>
                    <p>Extract and read text from the image</p>
                </div>
                
                <div class="analysis-type" data-type="safety">
                    <h3>⚠️ Safety Analysis</h3>
                    <p>Identify potential hazards or safety concerns</p>
                </div>
            </div>
            
            <div id="results" class="results">
                <p style="color: #666; text-align: center;">Start your camera and capture a photo to see AI analysis results here.</p>
            </div>
        </div>
    </div>

    <script>
        class AICameraTool {
            constructor() {
                this.video = document.getElementById('video');
                this.canvas = document.getElementById('canvas');
                this.ctx = this.canvas.getContext('2d');
                this.results = document.getElementById('results');
                this.currentAnalysisType = 'general';
                this.stream = null;
                
                this.initializeEventListeners();
            }
            
            initializeEventListeners() {
                document.getElementById('startCamera').addEventListener('click', () => this.startCamera());
                document.getElementById('capturePhoto').addEventListener('click', () => this.captureAndAnalyze());
                document.getElementById('stopCamera').addEventListener('click', () => this.stopCamera());
                
                document.querySelectorAll('.analysis-type').forEach(type => {
                    type.addEventListener('click', (e) => this.selectAnalysisType(e.target.closest('.analysis-type')));
                });
            }
            
            async startCamera() {
                try {
                    this.stream = await navigator.mediaDevices.getUserMedia({
                        video: { 
                            width: { ideal: 640 },
                            height: { ideal: 480 },
                            facingMode: 'environment'
                        }
                    });
                    
                    this.video.srcObject = this.stream;
                    
                    document.getElementById('startCamera').disabled = true;
                    document.getElementById('capturePhoto').disabled = false;
                    document.getElementById('stopCamera').disabled = false;
                    
                    this.showMessage('📹 Camera started successfully!', 'success');
                    
                } catch (error) {
                    console.error('Error accessing camera:', error);
                    this.showMessage('❌ Could not access camera. Please check permissions.', 'error');
                }
            }
            
            stopCamera() {
                if (this.stream) {
                    this.stream.getTracks().forEach(track => track.stop());
                    this.video.srcObject = null;
                    this.stream = null;
                }
                
                document.getElementById('startCamera').disabled = false;
                document.getElementById('capturePhoto').disabled = true;
                document.getElementById('stopCamera').disabled = true;
                
                this.showMessage('Camera stopped.', 'success');
            }
            
            selectAnalysisType(element) {
                document.querySelectorAll('.analysis-type').forEach(type => {
                    type.classList.remove('selected');
                });
                element.classList.add('selected');
                this.currentAnalysisType = element.dataset.type;
            }
            
            captureAndAnalyze() {
                // Set canvas dimensions to match video
                this.canvas.width = this.video.videoWidth;
                this.canvas.height = this.video.videoHeight;
                
                // Draw current video frame to canvas
                this.ctx.drawImage(this.video, 0, 0);
                
                // Convert to base64
                const imageData = this.canvas.toDataURL('image/jpeg', 0.8);
                
                this.analyzeImage(imageData);
            }
            
            async analyzeImage(imageData) {
                this.showLoading();
                
                try {
                    // Simulate AI analysis with different responses based on type
                    await this.simulateAIAnalysis(imageData);
                } catch (error) {
                    console.error('Analysis error:', error);
                    this.showMessage('❌ Analysis failed. Please try again.', 'error');
                }
            }
            
            async simulateAIAnalysis(imageData) {
                // Simulate API call delay
                await new Promise(resolve => setTimeout(resolve, 2000));
                
                const analysisPrompts = {
                    general: "I can see this is a captured image from your camera. In a real implementation, this would connect to AI services like OpenAI Vision, Google Cloud Vision, or AWS Rekognition to analyze what's in the image - identifying objects, people, text, colors, and providing detailed descriptions of the scene.",
                    
                    accessibility: "For accessibility: This tool would provide detailed spatial descriptions, identify potential obstacles or navigation aids, describe people's positions and actions, read any visible text aloud, and highlight important visual information that would help visually impaired users understand their environment.",
                    
                    text: "Text Recognition: In a full implementation, this would use OCR (Optical Character Recognition) to extract any text visible in the image, including signs, documents, labels, or handwritten notes. The text would then be read aloud or displayed in large, high-contrast format.",
                    
                    safety: "Safety Analysis: This would identify potential hazards like stairs, obstacles, wet floors, construction zones, traffic, or other safety concerns. It would provide alerts and recommendations for safe navigation through the detected environment."
                };
                
                const mockResults = {
                    confidence: Math.floor(Math.random() * 20) + 80,
                    timestamp: new Date().toLocaleTimeString(),
                    analysis: analysisPrompts[this.currentAnalysisType]
                };
                
                this.displayResults(mockResults);
            }
            
            displayResults(results) {
                this.results.innerHTML = `
                    <div style="border-bottom: 1px solid #e1e8ff; padding-bottom: 15px; margin-bottom: 15px;">
                        <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 10px;">
                            <strong style="color: #333;">Analysis Results</strong>
                            <span style="background: #667eea; color: white; padding: 4px 12px; border-radius: 20px; font-size: 0.9em;">
                                ${results.confidence}% confidence
                            </span>
                        </div>
                        <div style="color: #666; font-size: 0.9em;">
                            Captured at ${results.timestamp}
                        </div>
                    </div>
                    
                    <div style="line-height: 1.6; color: #444;">
                        ${results.analysis}
                    </div>
                    
                    <div style="margin-top: 20px; padding: 15px; background: #f0f7ff; border-radius: 10px; border-left: 4px solid #2196F3;">
                        <strong>💡 Implementation Note:</strong> To make this fully functional, you would integrate with:
                        <ul style="margin: 10px 0 0 20px; color: #555;">
                            <li>OpenAI Vision API for detailed scene analysis</li>
                            <li>Google Cloud Vision for object and text detection</li>
                            <li>Custom ML models for specialized accessibility features</li>
                            <li>Text-to-speech for audio feedback</li>
                        </ul>
                    </div>
                `;
            }
            
            showLoading() {
                this.results.innerHTML = `
                    <div class="loading">
                        <div class="spinner"></div>
                        <span>Analyzing image with AI...</span>
                    </div>
                `;
            }
            
            showMessage(message, type) {
                const className = type === 'error' ? 'error' : 'success';
                this.results.innerHTML = `<div class="${className}">${message}</div>`;
            }
        }
        
        // Initialize the camera tool when page loads
        document.addEventListener('DOMContentLoaded', () => {
            new AICameraTool();
        });
    </script>
</body>
</html>

This is useful for a number of ways to sort the issues of the OCR script:

Access device cameras in real-time Capture photos on demand Process images instantly with OCR Provide a beautiful, modern interface Offer practical features like copy-to-clipboard and text-to-speech Architecture: Building the Bridge

Connecting the two

The solution required bridging two worlds: the robust Python OCR backend and a responsive web frontend.

The Three-Layer Architecture Part 1: Enhancing the OCR Engine From Single-Shot to Multi-Configuration OCR The original script used Tesseract with default settings. I enhanced it to try multiple OCR configurations and return the best result:

This approach dramatically improved OCR accuracy by leveraging Tesseract’s different page segmentation modes.

Adding Web API Capabilities The biggest transformation was wrapping the OCR logic in a Flask web server:

Handling Web Images: The Base64 Challenge Web browsers capture images as base64-encoded data URLs. I needed a conversion pipeline:

Part 2: Building the Frontend Experience Camera Integration: Accessing Device Hardware Modern browsers provide powerful APIs for camera access. I built a JavaScript class to handle the complexity:

Real-Time Communication: Frontend ↔ Backend The magic happens when the frontend sends captured images to the OCR server:

User Experience: Making OCR Practical I added features that make the tool genuinely useful:

📋 Copy to Clipboard: One-click text copying 🔊 Text-to-Speech: Accessibility for visually impaired users 🎯 Confidence Indicators: Visual feedback on OCR quality 📱 Responsive Design: Works on desktop and mobile Part 3: The Development Experience Dependency Management Made Easy I created a comprehensive setup system:

And an automated setup script:

Testing and Debugging I built a connection test page to help debug issues:

Issues

The Technical Challenges (And Solutions)

Challenge 1: CORS Issues Problem: Browsers block cross-origin requests Solution: Flask-CORS with proper configuration

Challenge 2: Image Quality Problem: Web camera images needed preprocessing Solution: Enhanced preprocessing pipeline with multiple steps

Challenge 3: OCR Accuracy Problem: Single OCR configuration missed text Solution: Multi-configuration approach, returning best result

Conclusion

  1. Start Simple, Iterate Smartly The original OCR script was perfect for learning. Building on solid foundations made the web integration smoother.

  2. User Experience Matters Technical capability means nothing without a great interface. The camera integration and instant feedback transformed the tool’s usefulness.

  3. Error Handling is Critical Web applications face countless failure modes. Comprehensive error handling makes the difference between frustration and delight.

Conclusion:

Transforming a simple OCR script into a full-featured web application demonstrates the power of thoughtful architecture and incremental enhancement. By preserving the core OCR logic while adding web capabilities, camera integration, and user-friendly features, we created something genuinely useful.

The journey from command-line script to web application shows how powerful tools can emerge from simple beginnings. Whether you’re building OCR tools, image processors, or any data science application, the principles remain the same: start with solid foundations, enhance incrementally, and always prioritize the user experience.

Things to Improve Upon:

This webpage was going to have a few more features which I will now scrap. This means the overall layout will have to change from its current design. The OCR processor is also inconsistent with a few types of text, so I will finetune it to also become good at processing handwritten text also to increase the confidence and abilities of the machine. I also need to add the voice software to read it back.