# Building a Simple OCR Application with Tesseract

Welcome to this exciting post on building a simple application for optical character recognition (OCR) on the Linux terminal!

Like any other project, we need to identify the problem we currently have, the proposed solution to the problem (meaning the way the application ought to work to address the problem) and show how we can put that into practice.

We will of course be using the Bash script—which means even if you use any other shell, you can still write shell scripts powered by the Bash that comes already installed on your system.

The issue we are to deal with is that of text-processing: how do we extract text from images or scanned PDF documents?

So let’s start!

## The Problem

You just came across that great speech made by a 19th century scientist on the future of electricity. The problem is that it is in a scanned document saved as a png file, or pdf. How are you going to access it?

## The Solution: Use the Tesseract Engine

Fortunately, you can work with digitised content when you use the Tesseract Engine to convert your image files to plain text.

Tesseract falls into the class of programs known as optical character recognition tools. This is where such applications like Open Book, Abbyy Fine Reader and Kurzweil belong.

The only minor setback with Tesseract is that it does not directly work with PDF files, which means you have to use the pdftoppm command to do the intermediary job.

### Getting Tesseract

On Debian/Ubuntu platforms, you can get the Tesseract application from your repository. Simply do:

sudo apt-get -y tesseract-ocr


On Fedora, you can use the dnf and type:

dnf install tesseract


### Using Tesseract

The syntax for using Tesseract is to call it, passing it the name of the image you want to convert and the name of the final destination file in plain text.

For example, to convert faraday-paper.png, we can do it like this:

tesseract faraday-paper.png faraday-paper


Note that the second argument to Tesseract has no file extension. This is optional: you can provide the final .txt if you like.

### Dealing with PDF files

However, most of the time, the scanned documents you come across may already be saved as pdf files. For the most part, a number of PDF files produced these days are accessible, but for those containing digitised material, they may not.

Tesseract, unfortunately does not handle PDF files. The first time I tried to run a PDF file, I got this message:

ishe@brainpower:~/conversions$tesseract trial-of-dedan-kimathi.pdf trial  Tesseract Open Source OCR Engine v4.1.1 with Leptonica Error in pixReadStream: Pdf reading is not supported Error in pixRead: pix not read Error during processing. ishe@brainpower:~/conversions$


So to get around that limitation, we have to employ another tool: pdftoppm.

### How pdftoppm Works

pdftoppm is already installed on most modern Linux platforms, so you should find it on your computer. Its purpose is to extract PDF pages and turn them into images such as png.

If a document has 30 pages, for example, it takes it and convert each page as a PNG file. This means that you often end up with a folder full of the pages of the PDF file.

What I normally do when I want to convert a document is to move it into a dedicated directory in my home and call pdftoppm like this:

pdftoppm -png trial-of-dedan-kimathi.pdf trial


You can see that the syntax for conversion is the same as with that of Tesseract:

• You call it with at least two arguments;
• We pass the option -png to tell pdftoppm that we want PNG output;
• The first argument is the name of the pdf file you want to convert;
• The second argument is the name of the destination file;
• As pdftoppm outputs many files, the name of the second argument only serves as the prefix for each output file. For example, trial-01.png, trial-02.png, etc.

### Final conversion with Tesseract from PDF

Having converted the document to multiple PNG files, we can choose to convert each file to plain text. However, this is time-consuming and at most, boring.

So like most boring and repetitive tasks, we have to write a script to handle that dirty part for us. The script is simple:

• It receives as input the base prefix of PNG files;

• It then goes through the list of all possible PNG files in the directory and convert them individually to plain text with tesseract;

• Finally, it combines all the text files into a single output file we passed to it as the second argument.

Let’s call this Bash script ocr-convert and call it like this:

ocr-convert base finaltext


Where:

• base is the prefix that our script should look for. The base name is simple to obtain, because it is the second argument we passed above to pdftoppm. As pdftoppm was converting its pdf pages, it used the second argument we passed to it to form the base prefix for its output files. Each file began with that name, followed by a dash, then a number representing the page number in the original pdf file.

• finaltext means the final document name that our converted document should have from our script.

To do its work, the script uses one other command, the cat command for reading in text files. So let’s go to work!

#### Writing the Script

So let us fire up our favourite text editor such as vim or emacs and type in the following:

                            input=$1 output=$2
for i in ${input}-??.png; do tesseract "$i" "page-$i" done cat page* >$output


And save this as ocr-convert.sh

Next, grant appropriate permissions to tell our system that this is an executable file with:

chmod +x ocr-convert.sh


And to run this, either,

1. Place it in your PATH where you can simply call it like any other builtin command. E.g. ocr-convert.

2. Or, just type the full path to where it is. For example, if it is in the current working directory, just type ./ocr-convert.sh and pass the arguments you want.

That’s it for the script!

However, note that this script is basic and does not do any error-checking to see whether the files passed to it as the first argument indeed exist. In a production program, this is the first thing you must do before deploying it.

#### Is there a Simpler way?

Certainly, there is. In fact, the reason why many programs exist is to shield you from this stuff.

Instead of us going through the process of having to call the pdftoppm ourselves, I put up a script that I use for my daily use. While it is rough, it performs the job.

• It is called, ocr-convert;1
• You call it like this:
ocr-convert input.pdf output.txt


Thus, you simply pass it as its first argument the PDF file you want to convert, and it processes it to give you back the plain text document with the name you supply it as its second argument.

You can download the script here, and make sure to chmod it to grant the necessary permissions. Just type chmod +x ocr-convert before using it.

If you just want to make sure that this script is safe, you can open it in your text editor and check for its code. It is just basic and heavily commented:

1. It takes the name of your pdf file.

2. It calls pdftoppm using that file.

3. It then calls tesseract using the output of the pdftoppm.

4. Finally, it combines all the pages into a single document and then cleans up.

the end result is that you will only have the document you started out with (in PDF form) and the one you wanted as output in plain text. Thus, you do not even need to create a dedicated directory for carrying out your conversions.

• Caveats

There are two caveats to using this script you have to be aware of:

• First, if you choose to run this script in any directory, then there mustn’t be any other PNG files there whose names start with “temp”. This is because they will be deleted soon after the conversion process is done.

• Second, your output file must not begin with the string, “Page-temp”. This is also because this will result in it being deleted during the cleanup stage.

I think I will revise the script someday to address these problems. But as I might not see the urgency in that, this may be a bug you can address yourself: just edit the script to suit your working behaviour. This is the beauty with open source.2

## Conclusion

Being able to work with both computer-produced text and digitised material is great in this age where you are expected to be on top of your game. If you are a researcher, a student or an instructor, it is imperative that you master the skills of converting text from one form to the other. I understand that programming may not be the attractive option when you think of conversion, but like most things in life, the more you invest in whatever can enhance your career, then this may be the best path open for your development and productivity.

thanks for going through this article, until next time, have a pleasant conversion experience!

1. YOu should have guessed it already! ↩︎

2. But honestly, I think it just makes sense to name your documents appropriately rather than ‘tem’ or ‘base-temp’ that are used during processing. ↩︎

##### Ishe Chinyoka
###### Access Technology Instructor

My research interests include operating systems, access technology, programming, and science fiction.