feat: integrate Tesseract.js with improved language availability and font handling

- Refactored OCR page recognition to utilize a configured Tesseract worker.
- Added functions to manage font URLs and asset filenames based on language.
- Implemented language availability checks and error handling for unsupported languages.
- Enhanced PDF workflow to display available OCR languages and handle user selections.
- Introduced utility functions for resolving Tesseract asset configurations.
- Added tests for OCR functionality, font loading, and Tesseract runtime behavior.
- Updated global types to include environment variables for Tesseract and font configurations.
This commit is contained in:
alam00000
2026-03-14 15:50:30 +05:30
parent 58c78b09d2
commit 77da6d7a7d
23 changed files with 1906 additions and 564 deletions

View File

@@ -465,6 +465,11 @@ The default URLs are set in `.env.production`:
VITE_WASM_PYMUPDF_URL=https://cdn.jsdelivr.net/npm/@bentopdf/pymupdf-wasm@0.11.16/
VITE_WASM_GS_URL=https://cdn.jsdelivr.net/npm/@bentopdf/gs-wasm/assets/
VITE_WASM_CPDF_URL=https://cdn.jsdelivr.net/npm/coherentpdf/dist/
VITE_TESSERACT_WORKER_URL=
VITE_TESSERACT_CORE_URL=
VITE_TESSERACT_LANG_URL=
VITE_TESSERACT_AVAILABLE_LANGUAGES=
VITE_OCR_FONT_BASE_URL=
```
To override via Docker build args:
@@ -474,11 +479,18 @@ docker build \
--build-arg VITE_WASM_PYMUPDF_URL=https://your-server.com/pymupdf/ \
--build-arg VITE_WASM_GS_URL=https://your-server.com/gs/ \
--build-arg VITE_WASM_CPDF_URL=https://your-server.com/cpdf/ \
--build-arg VITE_TESSERACT_WORKER_URL=https://your-server.com/ocr/worker.min.js \
--build-arg VITE_TESSERACT_CORE_URL=https://your-server.com/ocr/core \
--build-arg VITE_TESSERACT_LANG_URL=https://your-server.com/ocr/lang-data \
--build-arg VITE_TESSERACT_AVAILABLE_LANGUAGES=eng,deu \
--build-arg VITE_OCR_FONT_BASE_URL=https://your-server.com/ocr/fonts \
-t bentopdf .
```
To disable a module (require manual user config via Advanced Settings), set its variable to an empty string.
For OCR, either leave all `VITE_TESSERACT_*` variables empty and use the default online assets, or set the worker/core/lang URLs together for self-hosted/offline OCR. If your self-hosted bundle only includes a subset such as `eng,deu`, also set `VITE_TESSERACT_AVAILABLE_LANGUAGES=eng,deu` so the UI only shows bundled languages and OCR fails with a descriptive message for unsupported ones. For fully offline searchable-PDF output, also set `VITE_OCR_FONT_BASE_URL` to the internal directory that serves the bundled OCR text-layer fonts.
Users can also override these defaults per-browser via **Advanced Settings** in the UI — user overrides take priority over the environment defaults.
> [!IMPORTANT]
@@ -496,6 +508,12 @@ The included `prepare-airgap.sh` script automates the entire process — downloa
git clone https://github.com/alam00000/bentopdf.git
cd bentopdf
# Show supported OCR language codes (for --ocr-languages)
bash scripts/prepare-airgap.sh --list-ocr-languages
# Search OCR language codes by name or abbreviation
bash scripts/prepare-airgap.sh --search-ocr-language german
# Interactive mode — prompts for all options
bash scripts/prepare-airgap.sh
@@ -508,7 +526,9 @@ This produces a bundle directory containing:
```
bentopdf-airgap-bundle/
bentopdf.tar # Docker image
*.tgz # WASM packages (PyMuPDF, Ghostscript, CoherentPDF)
*.tgz # WASM packages (PyMuPDF, Ghostscript, CoherentPDF, Tesseract)
tesseract-langdata/ # OCR traineddata files
ocr-fonts/ # OCR text-layer font files
setup.sh # Setup script for the air-gapped side
README.md # Instructions
```
@@ -525,23 +545,28 @@ The setup script loads the Docker image, extracts WASM files, and optionally sta
<details>
<summary><strong>Script options</strong></summary>
| Flag | Description | Default |
| ----------------------- | ------------------------------------------------ | --------------------------------- |
| `--wasm-base-url <url>` | Where WASMs will be hosted internally | _(required, prompted if missing)_ |
| `--image-name <name>` | Docker image tag | `bentopdf` |
| `--output-dir <path>` | Output bundle directory | `./bentopdf-airgap-bundle` |
| `--simple-mode` | Enable Simple Mode | off |
| `--base-url <path>` | Subdirectory base URL (e.g. `/pdf/`) | `/` |
| `--language <code>` | Default UI language (e.g. `fr`, `de`) | _(none)_ |
| `--brand-name <name>` | Custom brand name | _(none)_ |
| `--brand-logo <path>` | Logo path relative to `public/` | _(none)_ |
| `--footer-text <text>` | Custom footer text | _(none)_ |
| `--dockerfile <path>` | Dockerfile to use | `Dockerfile` |
| `--skip-docker` | Skip Docker build and export | off |
| `--skip-wasm` | Skip WASM download (reuse existing `.tgz` files) | off |
| Flag | Description | Default |
| ------------------------------ | ------------------------------------------------ | --------------------------------- |
| `--wasm-base-url <url>` | Where WASMs will be hosted internally | _(required, prompted if missing)_ |
| `--image-name <name>` | Docker image tag | `bentopdf` |
| `--output-dir <path>` | Output bundle directory | `./bentopdf-airgap-bundle` |
| `--simple-mode` | Enable Simple Mode | off |
| `--base-url <path>` | Subdirectory base URL (e.g. `/pdf/`) | `/` |
| `--language <code>` | Default UI language (e.g. `fr`, `de`) | _(none)_ |
| `--brand-name <name>` | Custom brand name | _(none)_ |
| `--brand-logo <path>` | Logo path relative to `public/` | _(none)_ |
| `--footer-text <text>` | Custom footer text | _(none)_ |
| `--ocr-languages <list>` | Comma-separated OCR languages to bundle | `eng` |
| `--list-ocr-languages` | Print supported OCR codes and names, then exit | off |
| `--search-ocr-language <term>` | Search OCR codes by name or abbreviation | off |
| `--dockerfile <path>` | Dockerfile to use | `Dockerfile` |
| `--skip-docker` | Skip Docker build and export | off |
| `--skip-wasm` | Skip WASM download (reuse existing `.tgz` files) | off |
</details>
The interactive prompt also accepts `list` to print the full supported Tesseract code list and `search <term>` to find matches such as `search german` or `search chi`.
> [!IMPORTANT]
> WASM files must be served from the **same origin** as the BentoPDF app. Web Workers use `importScripts()` which cannot load scripts cross-origin. For example, if BentoPDF runs at `https://internal.example.com`, the WASM base URL should also be `https://internal.example.com/wasm`.
@@ -550,12 +575,18 @@ The setup script loads the Docker image, extracts WASM files, and optionally sta
<details>
<summary>If you prefer to do it manually without the script</summary>
**Step 1: Download the WASM packages** (on a machine with internet)
**Step 1: Download the WASM and OCR packages** (on a machine with internet)
```bash
npm pack @bentopdf/pymupdf-wasm@0.11.16
npm pack @bentopdf/gs-wasm
npm pack coherentpdf
npm pack tesseract.js@7.0.0
npm pack tesseract.js-core@7.0.0
mkdir -p tesseract-langdata
curl -fsSL https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng/4.0.0_best_int/eng.traineddata.gz -o tesseract-langdata/eng.traineddata.gz
mkdir -p ocr-fonts
curl -fsSL https://raw.githack.com/googlefonts/noto-fonts/main/hinted/ttf/NotoSans/NotoSans-Regular.ttf -o ocr-fonts/NotoSans-Regular.ttf
```
**Step 2: Build the Docker image with internal URLs**
@@ -568,6 +599,10 @@ docker build \
--build-arg VITE_WASM_PYMUPDF_URL=https://internal-server.example.com/wasm/pymupdf/ \
--build-arg VITE_WASM_GS_URL=https://internal-server.example.com/wasm/gs/ \
--build-arg VITE_WASM_CPDF_URL=https://internal-server.example.com/wasm/cpdf/ \
--build-arg VITE_TESSERACT_WORKER_URL=https://internal-server.example.com/wasm/ocr/worker.min.js \
--build-arg VITE_TESSERACT_CORE_URL=https://internal-server.example.com/wasm/ocr/core \
--build-arg VITE_TESSERACT_LANG_URL=https://internal-server.example.com/wasm/ocr/lang-data \
--build-arg VITE_OCR_FONT_BASE_URL=https://internal-server.example.com/wasm/ocr/fonts \
-t bentopdf .
```
@@ -585,6 +620,10 @@ Copy these files via USB drive, internal artifact repository, or approved transf
- `bentopdf-pymupdf-wasm-0.11.14.tgz` — PyMuPDF WASM package
- `bentopdf-gs-wasm-*.tgz` — Ghostscript WASM package
- `coherentpdf-*.tgz` — CoherentPDF WASM package
- `tesseract.js-7.0.0.tgz` — Tesseract worker package
- `tesseract.js-core-7.0.0.tgz` — Tesseract core runtime package
- `tesseract-langdata/` — OCR traineddata files
- `ocr-fonts/` — OCR text-layer font files
**Step 5: Set up inside the air-gapped network**
@@ -593,16 +632,23 @@ Copy these files via USB drive, internal artifact repository, or approved transf
docker load -i bentopdf.tar
# Extract the WASM packages
mkdir -p ./wasm/pymupdf ./wasm/gs ./wasm/cpdf
mkdir -p ./wasm/pymupdf ./wasm/gs ./wasm/cpdf ./wasm/ocr/core ./wasm/ocr/lang-data ./wasm/ocr/fonts
tar xzf bentopdf-pymupdf-wasm-0.11.14.tgz -C ./wasm/pymupdf --strip-components=1
tar xzf bentopdf-gs-wasm-*.tgz -C ./wasm/gs --strip-components=1
tar xzf coherentpdf-*.tgz -C ./wasm/cpdf --strip-components=1
TEMP_TESS=$(mktemp -d)
tar xzf tesseract.js-7.0.0.tgz -C "$TEMP_TESS"
cp "$TEMP_TESS/package/dist/worker.min.js" ./wasm/ocr/worker.min.js
rm -rf "$TEMP_TESS"
tar xzf tesseract.js-core-7.0.0.tgz -C ./wasm/ocr/core --strip-components=1
cp ./tesseract-langdata/*.traineddata.gz ./wasm/ocr/lang-data/
cp ./ocr-fonts/* ./wasm/ocr/fonts/
# Run BentoPDF
docker run -d -p 3000:8080 --restart unless-stopped bentopdf
```
Make sure the WASM files are accessible at the URLs you configured in Step 2.
Make sure the files are accessible at the URLs you configured in Step 2, including `.../ocr/worker.min.js`, `.../ocr/core`, `.../ocr/lang-data`, and `.../ocr/fonts`.
</details>
@@ -613,6 +659,10 @@ Make sure the WASM files are accessible at the URLs you configured in Step 2.
> VITE_WASM_PYMUPDF_URL=https://internal-server.example.com/wasm/pymupdf/
> VITE_WASM_GS_URL=https://internal-server.example.com/wasm/gs/
> VITE_WASM_CPDF_URL=https://internal-server.example.com/wasm/cpdf/
> VITE_TESSERACT_WORKER_URL=https://internal-server.example.com/wasm/ocr/worker.min.js
> VITE_TESSERACT_CORE_URL=https://internal-server.example.com/wasm/ocr/core
> VITE_TESSERACT_LANG_URL=https://internal-server.example.com/wasm/ocr/lang-data
> VITE_OCR_FONT_BASE_URL=https://internal-server.example.com/wasm/ocr/fonts
> ```
**Subdirectory Hosting:**