GitHub - StellaAthena/ocr-comparison: Compare OCR engines (Tesseract vs EasyOCR) with visualization and accuracy metrics
Service

GitHub - StellaAthena/ocr-comparison: Compare OCR engines (Tesseract vs EasyOCR) with visualization and accuracy metrics

StellaAthena
2026.01.28
·GitHub·by 네루
#OCR#Python#Tesseract#EasyOCR#Comparison

Key Points

  • 1The `ocr-comparison` system is a Python library for comparing Optical Character Recognition (OCR) engines, currently supporting Tesseract and EasyOCR, with robust visualization and accuracy metrics.
  • 2It offers features like multi-engine comparison, various visualization modes (overlay, side-by-side, diff, and a split+flip viewer), and evaluates performance using metrics such as Character Error Rate (CER) and Word Error Rate (WER).
  • 3The library is designed with an extensible adapter architecture, enabling users to easily integrate and compare new OCR engines by implementing a `BaseOCRAdapter` interface.

The paper introduces "OCR Comparison System," a Python library designed for the comprehensive comparison of Optical Character Recognition (OCR) engines. It provides tools for multi-engine evaluation, visualization, and accuracy metric calculation, supporting Tesseract and EasyOCR natively with an extensible architecture for additional engines.

The core methodology is built upon a standardized adapter pattern and a set of well-defined data models.

Core Methodology and Technical Details:

  1. Adapter Architecture: The system employs an Adapter pattern to integrate diverse OCR engines. Each engine is encapsulated within a class inheriting from BaseOCRAdapter. This adapter is responsible for:
    • Lazy Initialization (_initialize_engine()): OCR engine instances are initialized only on their first use, optimizing resource consumption. This method must handle ImportError for missing dependencies, guiding users on installation.
    • Standardized Processing (process(image:np.ndarray)>OCRResult_process(image: np.ndarray) -> OCRResult): This is the crucial method where the adapter interacts with its specific OCR library (e.g., Tesseract or EasyOCR) to process an input image (provided as an RGB NumPy array). It then maps the raw, engine-specific OCR output (text, bounding boxes, confidence scores) into the library's canonical data models: BoundingBox, OCRWord, and OCRResult. The processing_time and image_size are also captured.
    • Standardized Output: Regardless of the underlying OCR engine, the process() method (a public API provided by the base class) guarantees an OCRResult object, ensuring consistency for subsequent processing, visualization, and evaluation steps.
  1. Data Models:
    • BoundingBox: Represents a detected text region, defined by its top-left coordinates (x, y), width, height, and an optional angle for rotated text. It provides properties like x2, y2, center, area, and a method for iou (Intersection over Union).
    • OCRWord: Encapsulates a single detected word, linking its text string to its bbox (BoundingBox) and a confidence score (normalized 0.0-1.0).
    • OCRResult: Aggregates the complete OCR output from a single engine for a given image. It contains a List[OCRWord], the engine_name, processing_time, and image_size. It provides convenience properties like full_text, word_count, average_confidence, and a method filter_by_confidence().
    • AccuracyMetrics: Stores quantitative evaluation scores, including cer, wer, precision, recall, and f1.
  1. Evaluation (evaluator.py): When ground truth text is provided, the library calculates standard accuracy metrics:
    • Character Error Rate (CER): Measures the minimum number of edits (substitutions, insertions, deletions) required to change the OCR-extracted text into the ground truth text, normalized by the length of the ground truth. It is typically calculated using the Levenshtein distance algorithm.
CER=LevenshteinDistance(OCR Text,Ground Truth Text)Length of Ground Truth Text\text{CER} = \frac{\text{LevenshteinDistance}(\text{OCR Text}, \text{Ground Truth Text})}{\text{Length of Ground Truth Text}}
  • Word Error Rate (WER): Similar to CER but calculated at the word level. It quantifies the number of word-level edits (substitutions, insertions, deletions) needed to transform the OCR word sequence into the ground truth word sequence, normalized by the number of ground truth words.
WER=LevenshteinDistance(OCR Word Sequence,Ground Truth Word Sequence)Number of Ground Truth Words\text{WER} = \frac{\text{LevenshteinDistance}(\text{OCR Word Sequence}, \text{Ground Truth Word Sequence})}{\text{Number of Ground Truth Words}}
  • Precision, Recall, F1 Score: These metrics assess the accuracy of *detection* (i.e., whether bounding boxes correctly identify words) by comparing detected OCRWord bounding boxes against ground truth bounding boxes using Intersection over Union (IoU) thresholds.
    • Intersection over Union (IoU): For two bounding boxes AA and BB, IoU(A,B)=Area(AB)Area(AB)\text{IoU}(A, B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A \cup B)}. A detected box is considered a True Positive if its IoU with a ground truth box exceeds a predefined threshold.
    • Precision: Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
    • Recall: Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
    • F1 Score: The harmonic mean of precision and recall. F1=2×Precision×RecallPrecision+Recall\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  1. Visualization (visualizer.py): The library offers multiple visualization modes to qualitatively compare OCR outputs:
    • Split + Flip Viewer: Generates separate images for each engine's results, allowing interactive flipping between them while keeping the underlying image perfectly aligned, ideal for discerning subtle differences.
    • Side-by-Side: Displays all engine results horizontally or in a grid.
    • Overlay: Overlays bounding boxes from multiple engines onto a single image using distinct colors (e.g., Tesseract in blue, EasyOCR in green).
    • Diff View: Highlights disagreements between engines, showing matched detections in gray and unique detections in engine-specific colors.

The OCRComparator class orchestrates this process, managing registered adapters, processing images, generating visualizations, and calculating accuracy metrics based on these standardized data flows.