Some companies provide software for Windows with their scanners* that can create PDFs from scanned pages which look exactly like the scanned material (as if it were just full-page images) but the text is recognized and copyable.
How can I create PDFs like this on Ubuntu?
Note that I don't want to convert scanned text into regular text. I would like to keep the resulting PDF looking picture-perfect as the original pages but add a recognized text layer over it for ease of use.
I have a working high-resolution scanner which I use with XSane currently. It scans the pages fine and creates beautiful, high-DPI images.
* namely, Canon with LiDE 220
71 Answer
Preamble
You are looking for a PDF sandwich, i.e. a scanned PDF with an invisible layer of text (or a layer of text which is simply placed behind the picture of each page).
There are several ways to create one. I will use the paper Term Weighting Approaches in Automatic Text Retrieval as an example of a document that needs OCR.
The pdfsandwich command
First of all, install this tool from the repositories:
sudo apt install pdfsandwichThen you can just run it on your PDF file and wait:
pdfsandwich document.pdfIn the past, this method was not very precise, especially w.r.t. text positioning. It seems that now things got a lot better. Example from the PDF:
Abstract–The experimental evidence accumulated over the past 20 years indicates that
If you highlight the text in Evince, black boxes are shown.
PDF-XChange Viewer
This is a freeware, Windows-only program that works perfectly under Wine if you use the 32-bit version in a 32-bit Wine prefix. For this, I suggest using PlayOnLinux because it's very easy to select the latest Wine version and the fact that you want a 32-bit prefix.
Once installed, you can run it and select the OCR icon on the toolbar:
The output is usually very good and the text placement is precise. Example from the PDF:
Abstract--The experimental evidence accumulated over the past 20 years indicates that
If you highlight the text in Evince, the text is shown in a sans-serif font.
OCR.space
This is actually a web service. Go to ocr.space and select your file and language, then check the "Create searchable PDF with invisible text layer" option. Push the button and wait until the document is uploaded and converted.
Unfortunately, there is a bug for horizontal pages and they do not get rendered correctly in the output. I have notified the authors of this and they have acknowledged the problem.
3