The Old Bailey, U.S. Reports, and OCR: Benchmarking AWS, Azure, and GCP on 360,000 Page Images

Ughetta, William

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01q811kn743

Title:	The Old Bailey, U.S. Reports, and OCR: Benchmarking AWS, Azure, and GCP on 360,000 Page Images
Authors:	Ughetta, William
Advisors:	Kernighan, Brian
Department:	Computer Science
Class Year:	2021
Abstract:	Court records spanning the entire eighteenth and nineteenth centuries present a compelling benchmark for leading Optical Character Recognition (OCR) cloud providers on historical, English-language documents. The Proceedings of the Old Bailey is a corpus of over 180,000 pages of court records and last words published in England from 1674 to 1913. The U.S. Reports, selected volumes 1 through 241, is a collection of over 180,000 pages from The United States Supreme Court and predecessor courts ranging from 1754 to 1915. The Old Bailey is uniquely suited for benchmarking OCR, since all 180,000 images have been transcribed by humans. The U.S. Reports will be useful as a relative measure of similarity between the providers, instead of an absolute comparison to human performance. Although these two datasets largely span the same period, there are significant differences in their layout, printing, preservation, scanning, and even digital formats. The goal of this thesis is to benchmark three leading cloud OCR services on the 360,000 historical documents from the Old Bailey and U.S. Reports datasets and to automate an explorative visualization of the results. The three OCR services are: Amazon Web Services’s Textract (AWS); Microsoft Azure’s Cognitive Services OCR (Azure); and Google Cloud Platform’s Vision (GCP). This represents the second time, approximately nine months apart, that the Old Bailey has been benchmarked on AWS, Azure, and GCP, and the first time for the U.S. Reports, volumes 1 through 241. We found that AWS had the lowest median Character Error Rate (CER) across both the Old Bailey and the U.S. Reports and that GCP had the lowest median round trip time of less than one second for both datasets. Towards the end, we added three more providers to our benchmark in order to evaluate Microsoft’s latest OCR service in addition to two free solutions. We also automated the process of visualizing the results with the tigerocr tool. Specifically, the tigerocr explore command takes a PDF, converts it to images, executes OCR on each image, and arranges the resulting metrics in a website Explorer in order to view the results, potentially of massive scale, all on one single dashboard.
URI:	http://arks.princeton.edu/ark:/88435/dsp01q811kn743
Type of Material:	Princeton University Senior Theses
Language:	en
Appears in Collections:	Computer Science, 1987-2024

Files in This Item:

File	Description	Size	Format
UGHETTA-WILLIAM-THESIS.pdf		8.83 MB	Adobe PDF	Request a copy

Show full item record

Search

Browse