feat: integrate Tesseract.js with improved language availability and font handling

- Refactored OCR page recognition to utilize a configured Tesseract worker. - Added functions to manage font URLs and asset filenames based on language. - Implemented language availability checks and error handling for unsupported languages. - Enhanced PDF workflow to display available OCR languages and handle user selections. - Introduced utility functions for resolving Tesseract asset configurations. - Added tests for OCR functionality, font loading, and Tesseract runtime behavior. - Updated global types to include environment variables for Tesseract and font configurations.
2026-03-14 15:50:30 +05:30
parent 58c78b09d2
commit 77da6d7a7d
23 changed files with 1906 additions and 564 deletions
--- a/.env.example
+++ b/.env.example
@@ -12,6 +12,15 @@ VITE_WASM_PYMUPDF_URL=https://cdn.jsdelivr.net/npm/@bentopdf/pymupdf-wasm@0.11.1
 VITE_WASM_GS_URL=https://cdn.jsdelivr.net/npm/@bentopdf/gs-wasm/assets/
 VITE_WASM_CPDF_URL=https://cdn.jsdelivr.net/npm/coherentpdf/dist/

+# OCR assets (optional)
+# Set all three together for self-hosted or air-gapped OCR.
+# Leave empty to use Tesseract.js runtime defaults.
+VITE_TESSERACT_WORKER_URL=
+VITE_TESSERACT_CORE_URL=
+VITE_TESSERACT_LANG_URL=
+VITE_TESSERACT_AVAILABLE_LANGUAGES=
+VITE_OCR_FONT_BASE_URL=
+
 # Default UI language (build-time)
 # Supported: en, ar, be, fr, de, es, zh, zh-TW, vi, tr, id, it, pt, nl, da
 VITE_DEFAULT_LANGUAGE=