feat: integrate Tesseract.js with improved language availability and font handling
- Refactored OCR page recognition to utilize a configured Tesseract worker. - Added functions to manage font URLs and asset filenames based on language. - Implemented language availability checks and error handling for unsupported languages. - Enhanced PDF workflow to display available OCR languages and handle user selections. - Introduced utility functions for resolving Tesseract asset configurations. - Added tests for OCR functionality, font loading, and Tesseract runtime behavior. - Updated global types to include environment variables for Tesseract and font configurations.
This commit is contained in:
12
Dockerfile
12
Dockerfile
@@ -35,6 +35,18 @@ ENV VITE_WASM_PYMUPDF_URL=$VITE_WASM_PYMUPDF_URL
|
||||
ENV VITE_WASM_GS_URL=$VITE_WASM_GS_URL
|
||||
ENV VITE_WASM_CPDF_URL=$VITE_WASM_CPDF_URL
|
||||
|
||||
# OCR asset URLs (optional, used for self-hosted or air-gapped OCR)
|
||||
ARG VITE_TESSERACT_WORKER_URL
|
||||
ARG VITE_TESSERACT_CORE_URL
|
||||
ARG VITE_TESSERACT_LANG_URL
|
||||
ARG VITE_TESSERACT_AVAILABLE_LANGUAGES
|
||||
ARG VITE_OCR_FONT_BASE_URL
|
||||
ENV VITE_TESSERACT_WORKER_URL=$VITE_TESSERACT_WORKER_URL
|
||||
ENV VITE_TESSERACT_CORE_URL=$VITE_TESSERACT_CORE_URL
|
||||
ENV VITE_TESSERACT_LANG_URL=$VITE_TESSERACT_LANG_URL
|
||||
ENV VITE_TESSERACT_AVAILABLE_LANGUAGES=$VITE_TESSERACT_AVAILABLE_LANGUAGES
|
||||
ENV VITE_OCR_FONT_BASE_URL=$VITE_OCR_FONT_BASE_URL
|
||||
|
||||
# Default UI language (e.g. en, fr, de, es, zh, ar)
|
||||
ARG VITE_DEFAULT_LANGUAGE
|
||||
ENV VITE_DEFAULT_LANGUAGE=$VITE_DEFAULT_LANGUAGE
|
||||
|
||||
Reference in New Issue
Block a user