Image Recognition using Artificial Intelligence
Over the next few days I will be looking to build an AI system (either from scratch or through adapting and fine tuning another AI) to be able to recognise text and replicate it for users with disabilities.
Before starting that I need to consider if building an AI from scratch is possible.
Looking into text recognition from images, I can see this is already a well established task that AI models have completed, so I know this is a task that is possible.
There is a few ways to create this system. The first and by far the most easy is to use pytesseract, which is a Google OCR add on, which allows python to get an image, run it into tesseract in order to ‘read’ the text, and then return it back to the user. This is the code I produced in order to get it to work:
import cv2
import pytesseract
def ocr_core (img):
text=pytesseract.image_to_string(img)
return text
img= cv2.imread('img.png')
def get_greyscale(image):
return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
def noise_removal(image):
return cv2.medianBlur(image, 5)
def thresholding(image):
return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
img= get_greyscale(img)
img= thresholding(img)
img= noise_removal(img)
text= ocr_core(img)
print(text)
This is an initially good tool to use, and if the development of a neural network does not seem to work I will improve on this to provide it with more features, as it is a ready made OCR tool with all the capabilities needed.
However, I wanted to use this time to build a working project that can use machine learning to complete a task, so instead I will be looking into developing a (fairly simple) Neural Network to allow this task to be completed.
What is a Neural Network?
A neural network is a computing system modeled after the human brain’s structure, designed to recognize patterns and perform tasks like image recognition and natural language processing. It consists of interconnected nodes (artificial neurons) arranged in layers, where each connection has a weight that is adjusted during training to improve performance. These networks learn from data, making them powerful tools for solving complex problems.
In our case, the Neural Network I would need to build would be a CNN-RNN Neural Network, with the CNN (Convolutional Neural Network) extracting the visual layers from the image, and the RNN (Recurrent Neural Network) to extract and model the sequential relationships in text.
But before that, we need to consider incredibly basic neural networks, as I have never dealt with them before:
First we must consider a typical structure of a neural network:
First we have the inputs, the ways that the network can actually determine the incomings of the machine (aka nodes in the diagram). Then we have synapses, which are weighted edges of the network that determine how important each input node is (the more extreme the weight, the more important that input is). We then have a hidden layer (or layers depending on the complexity), leading to an output that is compared to the training data. The error between the output of the network and the output of the training data is then calculated and is used to change the synapse weights, and then it is repeated iteratively until the network has a lot of data. This is shown in my first set of python code:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
training_inputs = np.array([[0, 0, 1],
[1, 1, 1],
[1, 0, 1],
[0, 1, 1]])
training_outputs = np.array([[0, 1, 1, 0]]).T
np.random.seed(1)
synaptic_weights = 2* np.random.random((3, 1)) - 1
print("Initial random weights:", synaptic_weights)
for iteration in range(500000):
input_layer = training_inputs
outputs = sigmoid(np.dot(input_layer, synaptic_weights))
error = training_outputs - outputs
adjustments = error * sigmoid_derivative(outputs)
synaptic_weights += np.dot(input_layer.T, adjustments)
print("Weights after training:", synaptic_weights)
print("Outputs after iteration:",outputs)
It is then adapted and improved upon to create it via object orientated programming, and allows the user to input their own data to produce an output. This set of training data suggests that the first input is the most useful one, and essentially follows that input to be the output.
import numpy as np
class NeuralNetwork():
def __init__(self):
np.random.seed(1)
self.synaptic_weights = 2 * np.random.random((3, 1)) - 1
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(self, x):
return x * (1 - x)
def train(self, training_inputs, training_outputs, iterations=500000):
for iteration in range(iterations):
outputs = self.think(training_inputs)
error = training_outputs - outputs
adjustments = np.dot(training_inputs.T, error * self.sigmoid_derivative(outputs))
self.synaptic_weights += adjustments
def think(self, inputs):
inputs = inputs.astype(float)
output = self.sigmoid(np.dot(inputs, self.synaptic_weights))
return output
if __name__ == "__main__":
neural_network= NeuralNetwork()
print("Initial random weights:", neural_network.synaptic_weights)
training_inputs = np.array([[0, 0, 1],
[1, 1, 1],
[1, 0, 1],
[0, 1, 1]])
training_outputs = np.array([[0, 1, 1, 0]]).T
neural_network.train(training_inputs, training_outputs, iterations=500000)
print("Weights after training:", neural_network.synaptic_weights)
A=str(input('Input 1: '))
B=str(input('Input 2: '))
C=str(input('Input 3: '))
print("New situation: input data = ", A, B, C)
print("Output data:", neural_network.think(np.array([[A, B, C]])))
This is a step in the right direction. Now we have to produce harder, more complex models for our neural networks. Now we have to produce RNN and CNN ones separately first, then we can combine them next.
RNN Neural Network
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print("TensorFlow version:", tf.__version__)
mnist=tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
print("Data loaded and normalized")
model= keras.models.Sequential()
model.add(keras.Input(shape=(28, 28)))
model.add(layers.SimpleRNN(128,return_sequences=True, activation='relu'))
model.add(layers.SimpleRNN(128,return_sequences=False, activation='relu'))
model.add(layers.Dense(10))
print(model.summary())
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optim=keras.optimizers.Adam(learning_rate=0.001)
metrics=["accuracy"]
model.compile(loss=loss, optimizer=optim, metrics=metrics)
batch_size=64
epochs=10
model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs,verbose=2)
model.evaluate(x_test,y_test,batch_size=batch_size,verbose=2)
This Recurrent Neural Network works in a simple way. It uses a greyscale image set that has digits, and simply compares itself to the annotations of the image after adding a simple RNN input layer. It prepares them by modifying the size, then just runs batches of 64 images through the whole dataset (60,000 with 10,000 test images).
On testing it had an accuracy of 100%, so it worked incredibly well.
CNN Neural Network
I wanted to learn how to build a CNN. As an initial test similar to what I would be doing later with the image and text recognition, I created a neural network that could identify if there was text in an image or not. This involved finding around 200 images of text and 200 stock images. This allowed me to create 2 sets- 1 of non text images and 1 of text images. To continue and complete the training needed for the Neural Network to work I had to preprocess the image, first resizing and flattening the image in order to have it for the right dimensions for the correct number of input layers so the CNN could identify if there was text or not.
This also included rifling through the downloaded data and ensuring only supported image file types were processed into the machine in order to not break it. The split of training:validation:testing was 70%:20%:10%. This is in line with most modern CNN systems.
import tensorflow as tf
import os
import cv2
from PIL import Image
from matplotlib import pyplot as plt
import numpy as np
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
data_dir = 'data'
image_exts = ['jpeg', 'jpg', 'png', 'bmp']
# Files to skip
skip_files = {'.DS_Store', '._.DS_Store', 'Thumbs.db', '.gitkeep'}
def get_image_format(image_path):
"""Get image format using Pillow"""
try:
with Image.open(image_path) as img:
return img.format.lower()
except Exception:
return None
def is_valid_image_file(filename):
"""Check if file should be processed"""
# Skip system files
if filename in skip_files or filename.startswith('.'):
return False
# Check file extension
ext = filename.lower().split('.')[-1] if '.' in filename else ''
return ext in image_exts
print(os.listdir(data_dir))
for image_class in os.listdir(data_dir):
class_path = os.path.join(data_dir, image_class)
# Skip files, only process directories
if not os.path.isdir(class_path):
continue
print(f"Processing class: {image_class}")
for image in os.listdir(class_path):
# Skip non-image files
if not is_valid_image_file(image):
print(f"Skipping system file: {image}")
continue
image_path = os.path.join(class_path, image)
try:
img = cv2.imread(image_path)
if img is None:
print(f"Cannot read image: {image_path}")
os.remove(image_path)
continue
format_type = get_image_format(image_path)
if format_type not in image_exts:
print(f"Skipping {image_path} - unsupported format: {format_type}")
os.remove(image_path)
except Exception as e:
print(f"Error processing {image_path}: {e}")
if os.path.exists(image_path):
os.remove(image_path)
data = tf.keras.utils.image_dataset_from_directory('data')
data_iterator = data.as_numpy_iterator()
batch = data_iterator.next()
fig, ax = plt.subplots(ncols=4, figsize=(20, 20))
for idx, img in enumerate(batch[0][:4]):
ax[idx].imshow(img.astype(int))
ax[idx].set_title(batch[1][idx])
data = data.map(lambda x, y: (x/255, y))
scaled_iterator = data.as_numpy_iterator().next()[0].max()
print(len(data))
train_size = int(len(data) * 0.7)
val_size = int(len(data) * 0.2)+1
test_size = int(len(data)*0.1)+1
print(f"Train size: {train_size}, Validation size: {val_size}, Test size: {test_size}")
train = data.take(train_size)
val = data.skip(train_size).take(val_size)
test = data.skip(train_size + val_size).take(test_size)
model = Sequential()
#Input layer
model.add(Conv2D(16, (3, 3), 1, activation='relu', input_shape=(256, 256, 3)))
model.add(MaxPooling2D())
#Hidden layers
model.add(Conv2D(32, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D())
model.add(Conv2D(16, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile('adam',loss=tf.losses.BinaryCrossentropy(), metrics=['accuracy'])
model.summary()
logdir='logs'
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logdir)
history = model.fit(train, epochs=20, validation_data=val, callbacks=[tensorboard_callback])
precision =Precision()
recall = Recall()
accuracy = BinaryAccuracy()
for batch in test.as_numpy_iterator():
X, y = batch
yhat = model.predict(X)
precision.update_state(y, yhat)
recall.update_state(y, yhat)
accuracy.update_state(y, yhat)
model.save(os.path.join('models','text_detection_model.h5'))
I then tested it with a non text image and an image with text present. This is the code I used to do this:
import tensorflow as tf
import os
import cv2
from PIL import Image
from matplotlib import pyplot as plt
import numpy as np
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall
img=cv2.imread('tt2.jpg')
resized_img = tf.image.resize(img, (256, 256))
plt.imshow(resized_img.numpy().astype(int))
plt.show()
model = load_model(os.path.join('models','text_detection_model.h5'))
yhat = model.predict(np.expand_dims(resized_img/255, axis=0))
print(f"Prediction: {yhat[0][0]}")
yhat=model.predict(np.expand_dims(resized_img/255, axis=0))
if yhat[0][0] > 0.5:
print("Text detected with new model")
else:
print("No text detected with new model")
I tested it 2 times with 2 images of each type it had not seen before. I was correct 100% of the time.
Combining both a CNN and LSTM into a custom OCR
As of right now it is still training, but initial tests indicate that this will work well. I am somewhat limited with my current hardware, as it is a huge dataset it is taking hours to produce 1 generation of results. I have got a few models however which may work.
It moved loss from 15,572 to 14,250.
- OCR_train.py (Main Training Script)
- Purpose:
This is the main entry point for training the OCR model using a CRNN (CNN+RNN) architecture with CTC loss, leveraging the ‘mltu’ (Machine Learning Tools and Utilies) library for scalable data handling and training.
-
Key Sections:
-
Imports & Environment Setup: Imports TensorFlow, Keras callbacks, and mltu utilities for preprocessing, annotation, transformation, loss, metrics, and callbacks. Also sets up GPU memory growth for efficient resource usage.
-
Configuration: Loads model and training configuration from OCR_configs.py via ModelConfigs.
-
Data Paths: Points to the dataset directory (90kDICT32px) and annotation files for training and validation.
-
Annotation Parsing: Defines read_annotation_file to parse annotation files, extracting image paths and labels, building a dataset, vocabulary, and tracking max label length.
-
Dataset Preparation: Loads and limits the training and validation datasets to a manageable size for fast prototyping (first 1000 train, first 200 val).
-
Config Updates: Updates and saves vocabulary and max text length in the config object.
-
Data Providers: Uses mltu’s DataProvider to handle batching, preprocessing, and transformation of images and labels for training and validation.
-
Model Compilation: Instantiates the model via train_model, compiles it with Adam optimizer, CTC loss, and a custom character error rate metric.
-
Callbacks: Sets up callbacks for early stopping, checkpointing, logging, TensorBoard, learning rate reduction, and ONNX export.
-
Training: Trains the model using the data providers and callbacks.
-
Dataset Export: Saves the train and validation datasets as CSV files for reproducibility and analysis.
-
- OCR_model.py (Model Definition)
- Purpose:
Contains the definition of the CRNN model architecture, including CNN layers for feature extraction, RNN layers for sequence modeling, and a final dense layer for character prediction.
- Abilities:
Customizable architecture (e.g., residual blocks, layer sizes). Designed to work with variable-length text via CTC loss. Can be adapted for different input sizes and vocabularies.
- OCR_configs.py (Configuration Management)
- Purpose:
Stores and manages all configuration parameters for the OCR pipeline, such as image dimensions, batch size, vocabulary, max text length, learning rate, model save path, and number of epochs.
- Abilities:
Centralized config for easy experimentation and reproducibility. Can be saved/loaded to persist settings across runs.
- OCR_infmodel.py (Inference Model)
- Purpose:
Contains logic for loading the trained model and running inference on new images. Abilities Preprocesses input images. Decodes model outputs to readable text using the trained vocabulary and CTC decoding.
- annotation_train.txt, annotation_val.txt, annotation_test.txt (Annotation Files)
-
Purpose List image file paths and their corresponding text labels for training, validation, and testing.
-
Abilities Used by the data loader to map images to ground truth text. Can be subsetted for fast prototyping or full-scale training.
- 90kDICT32px/ (Dataset Directory)
-
Purpose Contains all image data and annotation files for OCR training and evaluation.
-
Abilities Supports large-scale training with tens of thousands of samples. Can be subsetted for quick experiments.
- mltu Library Integration
-
Purpose Provides robust utilities for image reading, preprocessing, label encoding, padding, batching, and metrics.
-
Abilities Handles complex data pipelines with minimal code. supports scalable training and evaluation.
- Flexible Data Loading: Can parse and use any annotation file in the expected format.
- Configurable Preprocessing: Easily change image size, batch size, and label handling via configs.
- Fast Prototyping: Dataset size limiting allows for quick experiments.
- Robust Training Pipeline: Includes all standard callbacks for monitoring, saving, and optimizing training.
- Metrics and Losses: Uses CTC loss for variable-length sequence prediction and CER for OCR accuracy.
- Export and Logging: Supports TensorBoard, CSV export, and ONNX conversion for deployment.
- Inference Support: Can run predictions on new images using the trained model.
-
Summary:
Your OCR pipeline is modular, scalable, and designed for both rapid experimentation and full-scale training. It leverages best practices for deep learning OCR, including CRNN architecture, CTC loss, and robust data handling via mltu. All major steps—data loading, preprocessing, training, evaluation, and inference—are covered and easily configurable.
Conclusion
I have to ensure it works over time, or maybe I shift my model using a more reasonable set of parameters and data to ensure it works, as it takes days to truly train. I will continue training it in the off-time with less data/ a smaller dataset that would still work to ensure it works to a sufficient level. Then I can hook it up to a HTML page to truly get it working to a useable standard for normal users. All Neural networks were built from scratch using tutorials present online.
Post Conclusion
I have realised through repeated attempts to train my data that the nature of the amount of training data I have is simply not enough when I switched my training data. While it allowed me to train in hours and not days it was useful, but as there was simply not enough data, it fell into a local maximum, so ended up either not producing any text when I attempted to test it, or would find 1 character and repeatedly write that character instead. In future this is a key part of any neural network, and I cannot take it for granted like I did when I initially wrote this blog post.