Congruence Engine investigations

Logo

Data, code and documenttion

Experimenting with Optical Character Recognition

Short summary

The purpose of this repo is to collect details about various efforts across the Congruence Engine to experiment with Optical Character Recognition technology. It includes google colab notebooks and pipelines used by the project to harness OCR tools, mostly with the intention of extracting raw text for subsequent analysis, as well as post-processing.

Research question

People

Max Long: Investigation, Data curation, Formal analysis, Methodology, Writing

Natasha Kitcher: Investigation, Data curation, Formal analysis, Methodology, Writing

Daniel Belteki: Investigation, Data curation, Formal analysis, Methodology, Writing

Alex Butterworth: Investigation, Methodology

Nayomi Kasthuri Arachchi: Software

Felix Needham-Simpson: Software

Data sources (used or developed)

Investigation methods/ tools/ code/ software (used or developed)

ABBYY, Surya, Tesseract, LLM vision models

Outputs

Google colab notebooks

Licence

This work is licensed under a Creative Commons Attribution 4.0 License - CC BY 4.0.