How to create high fidelity PDFs with copyable text from scans?

Some companies provide software for Windows with their scanners* that can create PDFs from scanned pages which look exactly like the scanned material (as if it were just full-page images) but the text is recognized and copyable.

How can I create PDFs like this on Ubuntu?

Note that I don't want to convert scanned text into regular text. I would like to keep the resulting PDF looking picture-perfect as the original pages but add a recognized text layer over it for ease of use.

I have a working high-resolution scanner which I use with XSane currently. It scans the pages fine and creates beautiful, high-DPI images.

* namely, Canon with LiDE 220

7

1 Answer

Preamble

You are looking for a PDF sandwich, i.e. a scanned PDF with an invisible layer of text (or a layer of text which is simply placed behind the picture of each page).

There are several ways to create one. I will use the paper Term Weighting Approaches in Automatic Text Retrieval as an example of a document that needs OCR.

The pdfsandwich command

First of all, install this tool from the repositories:

sudo apt install pdfsandwich

Then you can just run it on your PDF file and wait:

pdfsandwich document.pdf

Screenshot of Evince showing a PDF sandwich

In the past, this method was not very precise, especially w.r.t. text positioning. It seems that now things got a lot better. Example from the PDF:

Abstract–The experimental evidence accumulated over the past 20 years indicates that

If you highlight the text in Evince, black boxes are shown.

PDF-XChange Viewer

This is a freeware, Windows-only program that works perfectly under Wine if you use the 32-bit version in a 32-bit Wine prefix. For this, I suggest using PlayOnLinux because it's very easy to select the latest Wine version and the fact that you want a 32-bit prefix.

Once installed, you can run it and select the OCR icon on the toolbar:

Screenshot of PDF-XChange Viewer under Wine

The output is usually very good and the text placement is precise. Example from the PDF:

Abstract--The experimental evidence accumulated over the past 20 years indicates that

If you highlight the text in Evince, the text is shown in a sans-serif font.

OCR.space

This is actually a web service. Go to ocr.space and select your file and language, then check the "Create searchable PDF with invisible text layer" option. Push the button and wait until the document is uploaded and converted.

Unfortunately, there is a bug for horizontal pages and they do not get rendered correctly in the output. I have notified the authors of this and they have acknowledged the problem.

3

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like