feat: integrate Tesseract.js with improved language availability and font handling
- Refactored OCR page recognition to utilize a configured Tesseract worker. - Added functions to manage font URLs and asset filenames based on language. - Implemented language availability checks and error handling for unsupported languages. - Enhanced PDF workflow to display available OCR languages and handle user selections. - Introduced utility functions for resolving Tesseract asset configurations. - Added tests for OCR functionality, font loading, and Tesseract runtime behavior. - Updated global types to include environment variables for Tesseract and font configurations.
This commit is contained in:
86
README.md
86
README.md
@@ -465,6 +465,11 @@ The default URLs are set in `.env.production`:
|
||||
VITE_WASM_PYMUPDF_URL=https://cdn.jsdelivr.net/npm/@bentopdf/pymupdf-wasm@0.11.16/
|
||||
VITE_WASM_GS_URL=https://cdn.jsdelivr.net/npm/@bentopdf/gs-wasm/assets/
|
||||
VITE_WASM_CPDF_URL=https://cdn.jsdelivr.net/npm/coherentpdf/dist/
|
||||
VITE_TESSERACT_WORKER_URL=
|
||||
VITE_TESSERACT_CORE_URL=
|
||||
VITE_TESSERACT_LANG_URL=
|
||||
VITE_TESSERACT_AVAILABLE_LANGUAGES=
|
||||
VITE_OCR_FONT_BASE_URL=
|
||||
```
|
||||
|
||||
To override via Docker build args:
|
||||
@@ -474,11 +479,18 @@ docker build \
|
||||
--build-arg VITE_WASM_PYMUPDF_URL=https://your-server.com/pymupdf/ \
|
||||
--build-arg VITE_WASM_GS_URL=https://your-server.com/gs/ \
|
||||
--build-arg VITE_WASM_CPDF_URL=https://your-server.com/cpdf/ \
|
||||
--build-arg VITE_TESSERACT_WORKER_URL=https://your-server.com/ocr/worker.min.js \
|
||||
--build-arg VITE_TESSERACT_CORE_URL=https://your-server.com/ocr/core \
|
||||
--build-arg VITE_TESSERACT_LANG_URL=https://your-server.com/ocr/lang-data \
|
||||
--build-arg VITE_TESSERACT_AVAILABLE_LANGUAGES=eng,deu \
|
||||
--build-arg VITE_OCR_FONT_BASE_URL=https://your-server.com/ocr/fonts \
|
||||
-t bentopdf .
|
||||
```
|
||||
|
||||
To disable a module (require manual user config via Advanced Settings), set its variable to an empty string.
|
||||
|
||||
For OCR, either leave all `VITE_TESSERACT_*` variables empty and use the default online assets, or set the worker/core/lang URLs together for self-hosted/offline OCR. If your self-hosted bundle only includes a subset such as `eng,deu`, also set `VITE_TESSERACT_AVAILABLE_LANGUAGES=eng,deu` so the UI only shows bundled languages and OCR fails with a descriptive message for unsupported ones. For fully offline searchable-PDF output, also set `VITE_OCR_FONT_BASE_URL` to the internal directory that serves the bundled OCR text-layer fonts.
|
||||
|
||||
Users can also override these defaults per-browser via **Advanced Settings** in the UI — user overrides take priority over the environment defaults.
|
||||
|
||||
> [!IMPORTANT]
|
||||
@@ -496,6 +508,12 @@ The included `prepare-airgap.sh` script automates the entire process — downloa
|
||||
git clone https://github.com/alam00000/bentopdf.git
|
||||
cd bentopdf
|
||||
|
||||
# Show supported OCR language codes (for --ocr-languages)
|
||||
bash scripts/prepare-airgap.sh --list-ocr-languages
|
||||
|
||||
# Search OCR language codes by name or abbreviation
|
||||
bash scripts/prepare-airgap.sh --search-ocr-language german
|
||||
|
||||
# Interactive mode — prompts for all options
|
||||
bash scripts/prepare-airgap.sh
|
||||
|
||||
@@ -508,7 +526,9 @@ This produces a bundle directory containing:
|
||||
```
|
||||
bentopdf-airgap-bundle/
|
||||
bentopdf.tar # Docker image
|
||||
*.tgz # WASM packages (PyMuPDF, Ghostscript, CoherentPDF)
|
||||
*.tgz # WASM packages (PyMuPDF, Ghostscript, CoherentPDF, Tesseract)
|
||||
tesseract-langdata/ # OCR traineddata files
|
||||
ocr-fonts/ # OCR text-layer font files
|
||||
setup.sh # Setup script for the air-gapped side
|
||||
README.md # Instructions
|
||||
```
|
||||
@@ -525,23 +545,28 @@ The setup script loads the Docker image, extracts WASM files, and optionally sta
|
||||
<details>
|
||||
<summary><strong>Script options</strong></summary>
|
||||
|
||||
| Flag | Description | Default |
|
||||
| ----------------------- | ------------------------------------------------ | --------------------------------- |
|
||||
| `--wasm-base-url <url>` | Where WASMs will be hosted internally | _(required, prompted if missing)_ |
|
||||
| `--image-name <name>` | Docker image tag | `bentopdf` |
|
||||
| `--output-dir <path>` | Output bundle directory | `./bentopdf-airgap-bundle` |
|
||||
| `--simple-mode` | Enable Simple Mode | off |
|
||||
| `--base-url <path>` | Subdirectory base URL (e.g. `/pdf/`) | `/` |
|
||||
| `--language <code>` | Default UI language (e.g. `fr`, `de`) | _(none)_ |
|
||||
| `--brand-name <name>` | Custom brand name | _(none)_ |
|
||||
| `--brand-logo <path>` | Logo path relative to `public/` | _(none)_ |
|
||||
| `--footer-text <text>` | Custom footer text | _(none)_ |
|
||||
| `--dockerfile <path>` | Dockerfile to use | `Dockerfile` |
|
||||
| `--skip-docker` | Skip Docker build and export | off |
|
||||
| `--skip-wasm` | Skip WASM download (reuse existing `.tgz` files) | off |
|
||||
| Flag | Description | Default |
|
||||
| ------------------------------ | ------------------------------------------------ | --------------------------------- |
|
||||
| `--wasm-base-url <url>` | Where WASMs will be hosted internally | _(required, prompted if missing)_ |
|
||||
| `--image-name <name>` | Docker image tag | `bentopdf` |
|
||||
| `--output-dir <path>` | Output bundle directory | `./bentopdf-airgap-bundle` |
|
||||
| `--simple-mode` | Enable Simple Mode | off |
|
||||
| `--base-url <path>` | Subdirectory base URL (e.g. `/pdf/`) | `/` |
|
||||
| `--language <code>` | Default UI language (e.g. `fr`, `de`) | _(none)_ |
|
||||
| `--brand-name <name>` | Custom brand name | _(none)_ |
|
||||
| `--brand-logo <path>` | Logo path relative to `public/` | _(none)_ |
|
||||
| `--footer-text <text>` | Custom footer text | _(none)_ |
|
||||
| `--ocr-languages <list>` | Comma-separated OCR languages to bundle | `eng` |
|
||||
| `--list-ocr-languages` | Print supported OCR codes and names, then exit | off |
|
||||
| `--search-ocr-language <term>` | Search OCR codes by name or abbreviation | off |
|
||||
| `--dockerfile <path>` | Dockerfile to use | `Dockerfile` |
|
||||
| `--skip-docker` | Skip Docker build and export | off |
|
||||
| `--skip-wasm` | Skip WASM download (reuse existing `.tgz` files) | off |
|
||||
|
||||
</details>
|
||||
|
||||
The interactive prompt also accepts `list` to print the full supported Tesseract code list and `search <term>` to find matches such as `search german` or `search chi`.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> WASM files must be served from the **same origin** as the BentoPDF app. Web Workers use `importScripts()` which cannot load scripts cross-origin. For example, if BentoPDF runs at `https://internal.example.com`, the WASM base URL should also be `https://internal.example.com/wasm`.
|
||||
|
||||
@@ -550,12 +575,18 @@ The setup script loads the Docker image, extracts WASM files, and optionally sta
|
||||
<details>
|
||||
<summary>If you prefer to do it manually without the script</summary>
|
||||
|
||||
**Step 1: Download the WASM packages** (on a machine with internet)
|
||||
**Step 1: Download the WASM and OCR packages** (on a machine with internet)
|
||||
|
||||
```bash
|
||||
npm pack @bentopdf/pymupdf-wasm@0.11.16
|
||||
npm pack @bentopdf/gs-wasm
|
||||
npm pack coherentpdf
|
||||
npm pack tesseract.js@7.0.0
|
||||
npm pack tesseract.js-core@7.0.0
|
||||
mkdir -p tesseract-langdata
|
||||
curl -fsSL https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng/4.0.0_best_int/eng.traineddata.gz -o tesseract-langdata/eng.traineddata.gz
|
||||
mkdir -p ocr-fonts
|
||||
curl -fsSL https://raw.githack.com/googlefonts/noto-fonts/main/hinted/ttf/NotoSans/NotoSans-Regular.ttf -o ocr-fonts/NotoSans-Regular.ttf
|
||||
```
|
||||
|
||||
**Step 2: Build the Docker image with internal URLs**
|
||||
@@ -568,6 +599,10 @@ docker build \
|
||||
--build-arg VITE_WASM_PYMUPDF_URL=https://internal-server.example.com/wasm/pymupdf/ \
|
||||
--build-arg VITE_WASM_GS_URL=https://internal-server.example.com/wasm/gs/ \
|
||||
--build-arg VITE_WASM_CPDF_URL=https://internal-server.example.com/wasm/cpdf/ \
|
||||
--build-arg VITE_TESSERACT_WORKER_URL=https://internal-server.example.com/wasm/ocr/worker.min.js \
|
||||
--build-arg VITE_TESSERACT_CORE_URL=https://internal-server.example.com/wasm/ocr/core \
|
||||
--build-arg VITE_TESSERACT_LANG_URL=https://internal-server.example.com/wasm/ocr/lang-data \
|
||||
--build-arg VITE_OCR_FONT_BASE_URL=https://internal-server.example.com/wasm/ocr/fonts \
|
||||
-t bentopdf .
|
||||
```
|
||||
|
||||
@@ -585,6 +620,10 @@ Copy these files via USB drive, internal artifact repository, or approved transf
|
||||
- `bentopdf-pymupdf-wasm-0.11.14.tgz` — PyMuPDF WASM package
|
||||
- `bentopdf-gs-wasm-*.tgz` — Ghostscript WASM package
|
||||
- `coherentpdf-*.tgz` — CoherentPDF WASM package
|
||||
- `tesseract.js-7.0.0.tgz` — Tesseract worker package
|
||||
- `tesseract.js-core-7.0.0.tgz` — Tesseract core runtime package
|
||||
- `tesseract-langdata/` — OCR traineddata files
|
||||
- `ocr-fonts/` — OCR text-layer font files
|
||||
|
||||
**Step 5: Set up inside the air-gapped network**
|
||||
|
||||
@@ -593,16 +632,23 @@ Copy these files via USB drive, internal artifact repository, or approved transf
|
||||
docker load -i bentopdf.tar
|
||||
|
||||
# Extract the WASM packages
|
||||
mkdir -p ./wasm/pymupdf ./wasm/gs ./wasm/cpdf
|
||||
mkdir -p ./wasm/pymupdf ./wasm/gs ./wasm/cpdf ./wasm/ocr/core ./wasm/ocr/lang-data ./wasm/ocr/fonts
|
||||
tar xzf bentopdf-pymupdf-wasm-0.11.14.tgz -C ./wasm/pymupdf --strip-components=1
|
||||
tar xzf bentopdf-gs-wasm-*.tgz -C ./wasm/gs --strip-components=1
|
||||
tar xzf coherentpdf-*.tgz -C ./wasm/cpdf --strip-components=1
|
||||
TEMP_TESS=$(mktemp -d)
|
||||
tar xzf tesseract.js-7.0.0.tgz -C "$TEMP_TESS"
|
||||
cp "$TEMP_TESS/package/dist/worker.min.js" ./wasm/ocr/worker.min.js
|
||||
rm -rf "$TEMP_TESS"
|
||||
tar xzf tesseract.js-core-7.0.0.tgz -C ./wasm/ocr/core --strip-components=1
|
||||
cp ./tesseract-langdata/*.traineddata.gz ./wasm/ocr/lang-data/
|
||||
cp ./ocr-fonts/* ./wasm/ocr/fonts/
|
||||
|
||||
# Run BentoPDF
|
||||
docker run -d -p 3000:8080 --restart unless-stopped bentopdf
|
||||
```
|
||||
|
||||
Make sure the WASM files are accessible at the URLs you configured in Step 2.
|
||||
Make sure the files are accessible at the URLs you configured in Step 2, including `.../ocr/worker.min.js`, `.../ocr/core`, `.../ocr/lang-data`, and `.../ocr/fonts`.
|
||||
|
||||
</details>
|
||||
|
||||
@@ -613,6 +659,10 @@ Make sure the WASM files are accessible at the URLs you configured in Step 2.
|
||||
> VITE_WASM_PYMUPDF_URL=https://internal-server.example.com/wasm/pymupdf/
|
||||
> VITE_WASM_GS_URL=https://internal-server.example.com/wasm/gs/
|
||||
> VITE_WASM_CPDF_URL=https://internal-server.example.com/wasm/cpdf/
|
||||
> VITE_TESSERACT_WORKER_URL=https://internal-server.example.com/wasm/ocr/worker.min.js
|
||||
> VITE_TESSERACT_CORE_URL=https://internal-server.example.com/wasm/ocr/core
|
||||
> VITE_TESSERACT_LANG_URL=https://internal-server.example.com/wasm/ocr/lang-data
|
||||
> VITE_OCR_FONT_BASE_URL=https://internal-server.example.com/wasm/ocr/fonts
|
||||
> ```
|
||||
|
||||
**Subdirectory Hosting:**
|
||||
|
||||
Reference in New Issue
Block a user