GitHub - StellaAthena/ocr-comparison: Compare OCR engines (Tesseract vs EasyOCR) with visualization and accuracy metrics
Key Points
- 1The `ocr-comparison` system is a Python library for comparing Optical Character Recognition (OCR) engines, currently supporting Tesseract and EasyOCR, with robust visualization and accuracy metrics.
- 2It offers features like multi-engine comparison, various visualization modes (overlay, side-by-side, diff, and a split+flip viewer), and evaluates performance using metrics such as Character Error Rate (CER) and Word Error Rate (WER).
- 3The library is designed with an extensible adapter architecture, enabling users to easily integrate and compare new OCR engines by implementing a `BaseOCRAdapter` interface.
The paper introduces "OCR Comparison System," a Python library designed for the comprehensive comparison of Optical Character Recognition (OCR) engines. It provides tools for multi-engine evaluation, visualization, and accuracy metric calculation, supporting Tesseract and EasyOCR natively with an extensible architecture for additional engines.
The core methodology is built upon a standardized adapter pattern and a set of well-defined data models.
Core Methodology and Technical Details:
- Adapter Architecture: The system employs an
Adapterpattern to integrate diverse OCR engines. Each engine is encapsulated within a class inheriting fromBaseOCRAdapter. This adapter is responsible for:- Lazy Initialization (
_initialize_engine()): OCR engine instances are initialized only on their first use, optimizing resource consumption. This method must handleImportErrorfor missing dependencies, guiding users on installation. - Standardized Processing (): This is the crucial method where the adapter interacts with its specific OCR library (e.g., Tesseract or EasyOCR) to process an input image (provided as an RGB NumPy array). It then maps the raw, engine-specific OCR output (text, bounding boxes, confidence scores) into the library's canonical data models:
BoundingBox,OCRWord, andOCRResult. Theprocessing_timeandimage_sizeare also captured. - Standardized Output: Regardless of the underlying OCR engine, the
process()method (a public API provided by the base class) guarantees anOCRResultobject, ensuring consistency for subsequent processing, visualization, and evaluation steps.
- Lazy Initialization (
- Data Models:
BoundingBox: Represents a detected text region, defined by its top-left coordinates(x, y),width,height, and an optionalanglefor rotated text. It provides properties likex2,y2,center,area, and a method foriou(Intersection over Union).OCRWord: Encapsulates a single detected word, linking itstextstring to itsbbox(BoundingBox) and aconfidencescore (normalized 0.0-1.0).OCRResult: Aggregates the complete OCR output from a single engine for a given image. It contains aList[OCRWord], theengine_name,processing_time, andimage_size. It provides convenience properties likefull_text,word_count,average_confidence, and a methodfilter_by_confidence().AccuracyMetrics: Stores quantitative evaluation scores, includingcer,wer,precision,recall, andf1.
- Evaluation (
evaluator.py): When ground truth text is provided, the library calculates standard accuracy metrics:- Character Error Rate (CER): Measures the minimum number of edits (substitutions, insertions, deletions) required to change the OCR-extracted text into the ground truth text, normalized by the length of the ground truth. It is typically calculated using the Levenshtein distance algorithm.
- Word Error Rate (WER): Similar to CER but calculated at the word level. It quantifies the number of word-level edits (substitutions, insertions, deletions) needed to transform the OCR word sequence into the ground truth word sequence, normalized by the number of ground truth words.
- Precision, Recall, F1 Score: These metrics assess the accuracy of *detection* (i.e., whether bounding boxes correctly identify words) by comparing detected
OCRWordbounding boxes against ground truth bounding boxes using Intersection over Union (IoU) thresholds.- Intersection over Union (IoU): For two bounding boxes and , . A detected box is considered a True Positive if its IoU with a ground truth box exceeds a predefined threshold.
- Precision:
- Recall:
- F1 Score: The harmonic mean of precision and recall.
- Visualization (
visualizer.py): The library offers multiple visualization modes to qualitatively compare OCR outputs:- Split + Flip Viewer: Generates separate images for each engine's results, allowing interactive flipping between them while keeping the underlying image perfectly aligned, ideal for discerning subtle differences.
- Side-by-Side: Displays all engine results horizontally or in a grid.
- Overlay: Overlays bounding boxes from multiple engines onto a single image using distinct colors (e.g., Tesseract in blue, EasyOCR in green).
- Diff View: Highlights disagreements between engines, showing matched detections in gray and unique detections in engine-specific colors.
The OCRComparator class orchestrates this process, managing registered adapters, processing images, generating visualizations, and calculating accuracy metrics based on these standardized data flows.