How to OCR to searchable PDF in Linux

 Author:   Posted on:   Updated on:  -
There are multiple OCR (optical character recognition) engines for Linux, but most have a major drawback. They can only export plain text of the OCR'ed image and do not support embedding text into the PDF in order to make a searchable PDF.

By searchable PDF, we refer to a scanned PDF document that contains invisible OCR'ed text over the scanned image. The text should have the right size in order to be placed over the text portions from image. Every word from the text layer should overlay exactly on the portion of the image that contains that word.

Here are two software solutions that are able to create searchable PDFs. One is a native Linux OCR engine and the other is a free PDF reader with OCR capabilities running in Wine.

How to OCR to searchable PDF in Linux

1. Tesseract & PDFsandwich

Tesseract is the first and currently the only OCR engine for Linux that supports direct searchable PDF output (starting from version 3.03). The only problem is that it only accepts image input. So you can't feed it a PDF document. You can install it on APT based Linux (like Ubuntu) using the following command:
sudo apt-get install tesseract-ocr tesseract-ocr-all
If you have a bunch of images resulted from a scanner, you can make a simple script that will OCR each image into single page searchable PDF then join pages into a single PDF document:
LANG=eng #replace with your language code

shopt -s nullglob

for f in *.tif; do
    echo "Running OCR on $f"
    tesseract -psm 1 -l $LANG $f $f pdf

echo "Joining files into single PDF..."
pdftk *.pdf cat output ../outdocument.pdf
rm -r -f *.pdf
This script takes all .tif files from the directory where it is run and processes them with tesseract. To use it, you need also pdftk installed. Copy the above snippet into a new file, make it executable (chmod +x, then place it in the folder with scanned images and run it.

Things get complicated if you already have a PDF document that you want to make searchable. In order to use tesseract, it must be exported to images. And to do this, you must know the resolution of the scanned image. And this can be a problem if you didn't scan the document and have no idea what resolution it is.

In this situation, you can use the pdfsandwich script by Tobias Elze. Not only it extracts all pages from PDF as images, but it also pre-processes them for OCR using multiple threads. You can download the DEB package from the website and you can install it with GDebi. It's easy to use, but there are some command line arguments that need attention:
  • -nopreproc is useful when the PDF already contains processed images and you don't want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.
  • -resolution has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know/believe it may have been scanned at a higher DPI, specify it.
  • -lang must always be specified if you need to OCR in other than English language. This parameter is passed to tesseract. The availability of languages depends on installed tesseract-ocr-<lang_code> packages.
A simple pdfsandwich command will be:
pdfsandwich -lang eng input_document.pdf
The result will be input_document_ocr.pdf in the same folder as the initial document.

2. PDF X-Change Viewer

This is a free PDF reader with a lot of other functions provided by Tracker Software. It is a Windows only application that runs in Wine. I tested the viewer in Wine 1.6, 1.7 and 1.8 and it worked great in all these versions. Yet the OCR engine only worked with Wine 1.8 which is available in PPA.

To install it in Linux, you must have Wine 1.8 installed (wine1.8:i386 package) and download the following files from Tracker Software:

  • Portable PDF Viewer archive: Portable version (ZIP) | 8 MB
  • Portable PDF Viewer OCR engine: Portable Version (OCR Lang Files) | 8 MB
  • Additional OCR languages: choose a package that contains the language(s) you are interested in.
Extract the ZIP file by right clicking it and choosing Extract Here. You should get a folder PDFX_Vwr_Port. Extract the OCR Lang files archive and you will get an ocrdats folder. Put this folder in the PDFX_Vwr_Port folder. You can now start PDFXCview.exe with wine and you can OCR English, German, French and Spanish documents.

If you want additional languages, extract the Additional language packs archive. You will get an exe file. Don't launch it because it will not install. Instead install innoextract package and extract it. Here is what I did with the EU language pack:
innoextract OCRAdditionalLangsEU.exe
You will get two folders (code:SetAppFolder|inst and code:SetEditorFolder|inst) with identical content. A language pack is contains two files: <lang>.lng and <lang>_pxvocr.dat. You need to copy both files to ocrdats folder. Fot example, to run OCR in Romanian, I copied rom.lng and ron_pxvocr.dat from one of those two folders.

PDF X-Change Viewer OCR Linux
OCR in PDF X-Change Viewer
To launch OCR, load a document in the viewer and press the OCR button (1). Select page range (2), choose a language (3) and start (4).

Notes: in Wine 1.6, PDF X-Change Viewer crashed when launching OCR (on click on the OK button). In Wine 1.7 it crashed after reaching 99% OCR progress. It Wine 1.8 it works without issues.


As you can see, you can OCR and make searchable PDF document on Linux. And with free software. Comparing the two applications presented here, PDF X-Change Viewer is faster than Tesseract. The processing time depends on accuracy, and Tesseract is known for being highly accurate.

No comments :

Post a Comment

Please read the comments policy before posting.