Experimenting with Optical Character Recognition

Short summary

The purpose of this repo is to collect details about various efforts across the Congruence Engine to experiment with Optical Character Recognition technology. It includes google colab notebooks and pipelines used by the project to harness OCR tools, mostly with the intention of extracting raw text for subsequent analysis, as well as post-processing.

Research question

What tools are available for extracting text from historic materials, and how are these changing in the wake of Large Language Models?

People

Max Long: Investigation, Data curation, Formal analysis, Methodology, Writing

Natasha Kitcher: Investigation, Data curation, Formal analysis, Methodology, Writing

Daniel Belteki: Investigation, Data curation, Formal analysis, Methodology, Writing

Alex Butterworth: Investigation, Methodology

Nayomi Kasthuri Arachchi: Software

Felix Needham-Simpson: Software

Data sources (used or developed)

Worralls trade directory and other trade directories
Weekly Wool Chart
1871 River Pollution report

Investigation methods/ tools/ code/ software (used or developed)

ABBYY, Surya, Tesseract, LLM vision models

Outputs

Google colab notebooks

Licence

This work is licensed under a Creative Commons Attribution 4.0 License - CC BY 4.0.