Welcome to 2021

Welcome to 2021, and a new outlook for this site!

There is no doubt that the past year had been a bleak one, when we

  • lost our loved ones to the new form of a flu virus, now known as the Covid-19.

  • When the world transformed itself into a masked world, as it became fashionable to put on a mask in the public;

  • When business and all educational activities grounded to a virtual halt as a consequence of humankind’s failure to come to terms with what was taking place.

    Whether this could be interpreted as apocalyptic, a historic tragedy or a passing phase—its end result remained a disaster for us.

    Anyway, the picture looks promising with the talk of imminent global vaccination programme. Currently, countries like the United Kingdom have already activated the programme, we are now looking at our own Africa to join before the end of this first quarter.

    While this blog is personal and reflects my musings from time to time, I am also changing tack.

    • No longer is this site dedicated to general musings: instead, I will be focusing on Linux accessibility.

    • Old posts on various subjects will remain for historical reasons, but won’t be updated to reflect any change in fact or opinion.

    • As a result, I have chosen to help not only my close friends, but also colleagues whether here in Zimbabwe or abroad, who may wish to understand more about Linux accessibility.

Thanks once again for patronising this site. I hope to keep you hooked with any post, tutorial or tip on how a blind or visually impaired professional can use Linux successfully.

OCR from the Terminal with Tesseract

:EXPORT\_FILE\_NAME: ocr-from-the-terminal

:EXPORT_DESCRIPTION: While text is nowadays cheap to access due to it being the primary form of encoding printed material, there still remains a challenge when it comes to dealing with the digitised material. This post is going to look at one other option of working with digitised content on Linux.

The Problem

You just came across that great speech made by a 19th century scientist on the future of electricity. The problem is that it is in a scanned document saved as a png file, or pdf. How are you going to access it?

The Solution: Use the Tesseract Engine

Fortunately, you can work with digitised content when you use the Tesseract Engine to convert your image files to plain text.

Tesseract falls into the class of programs known as optical character recognition tools. This is where such applications like Open Book, Abbyy Fine Reader and Kurzweil belong.

The only minor setback with Tesseract is that it does not directly work with PDF files, which means you have to use the pdftoppm command to do the intermediary job.

Getting Tesseract

On Debian/Ubuntu platforms, you can get the Tesseract application from your repository. Simply do:

sudo apt-get -y tesseract-ocr

On Fedora, you can use the dnf and type:

dnf install tesseract

Using Tesseract

The syntax for using Tesseract is to call it, passing it the name of the image you want to convert and the name of the final destination file in plain text.

For example, to convert faraday-paper.png, we can do it like this:

tesseract faraday-paper.png faraday-paper

Note that the second argument to Tesseract has no file extension. This is optional: you can provide the final .txt if you like.

Dealing with PDF files

However, most of the time, the scanned documents you come across may already be saved as pdf files. For the most part, a number of PDF files produced these days are accessible, but for those containing digitised material, they may not.

Tesseract, unfortunately does not handle PDF files. The first time I tried to run a PDF file, I got this message:

ishe@brainpower:~/conversions$ tesseract trial-of-dedan-kimathi.pdf trial
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.
ishe@brainpower:~/conversions$

So to get around that limitation, we have to employ another tool: pdftoppm.

How pdftoppm Works

pdftoppm is already installed on most modern Linux platforms, so you should find it on your computer. Its purpose is to extract PDF pages and turn them into images such as png.

If a document has 30 pages, for example, it takes it and convert each page as a PNG file. This means that you often end up with a folder full of the pages of the PDF file.

What I normally do when I want to convert a document is to move it into a dedicated directory in my home and call pdftoppm like this:

pdftoppm -png trial-of-dedan-kimathi.pdf trial

You can see that the syntax for conversion is the same as with that of Tesseract:

  • You call it with at least two arguments;
  • We pass the option -png to tell pdftoppm that we want PNG output;
  • The first argument is the name of the pdf file you want to convert;
  • The second argument is the name of the destination file;
  • As pdftoppm outputs many files, the name of the second argument only serves as the prefix for each output file. For example, trial-01.png, trial-02.png, etc.

Final conversion with Tesseract from PDF

Having converted the document to multiple PNG files, we can choose to convert each file to plain text. However, this is time-consuming and at most, boring.

So like most boring and repetitive tasks, we have to write a script to handle that dirty part for us. The script is simple:

  • It receives as input the base prefix of PNG files;

  • It then goes through the list of all possible PNG files in the directory and convert them individually to plain text with tesseract;

  • Finally, it combines all the text files into a single output file we passed to it as the second argument.

    Let’s call this Bash script ocr-convert and call it like this:

    ocr-convert base finaltext
    

    Where:

    • base is the prefix that our script should look for. The base name is simple to obtain, because it is the second argument we passed above to pdftoppm. As pdftoppm was converting its pdf pages, it used the second argument we passed to it to form the base prefix for its output files. Each file began with that name, followed by a dash, then a number representing the page number in the original pdf file.

    • finaltext means the final document name that our converted document should have from our script.

      To do its work, the script uses one other command, the cat command for reading in text files. So let’s go to work!

Writing the Script

So let us fire up our favourite text editor such as vim or emacs and type in the following:

                            input=$1
                    output=$2
            for i in ${input}-??.png; do
        tesseract "$i" "page-$i"
done
    cat page* > $output

And save this as ocr-convert.sh

Next, grant appropriate permissions to tell our system that this is an executable file with:

chmod +x ocr-convert.sh

And to run this, either,

  1. Place it in your PATH where you can simply call it like any other builtin command. E.g. ocr-cofnvert.

  2. Or, just type the full path to where it is. For example, if it is in the current working directory, just type ./ocr-convert.sh and pass the arguments you want.

    That’s it for the script!

    However, note that this script is basic and does not do any error-checking to see whether the files passed to it as the first argument indeed exist. In a production program, this is the first thing you must do before deploying it.

Is there a Simpler way?

Certainly, there is. In fact, the reason why many programs exist is to shield you from this stuff.

Instead of us going through the process of having to call the pdftoppm ourselves, I put up a script that I use for my daily use. While it is rough, it serves me a lot.

  • It is called, ocr-conver;
  • You call it like this:
ocr-convert input.pdf output.txt

Thus, you simply pass it as its first argument the PDF file you want to convert, and it processes it to give you back the plain text document with the name you supply it as its second argument.

You can download the script here, and make sure to chmod it to grant the necessary permissions. Just type chmod +x ocr-convert before using it.

If you just want to make sure that this script is safe, you can open it in your text editor and check for its code. It is just basic and heavily commented:

  1. It takes the name of your pdf file.

  2. It calls pdftoppm using that file.

  3. It then calls tesseract using the output of the pdftoppm.

  4. Finally, it combines all the pages into a single document and then cleans up.

    the end result is that you will only have the document you started out with (in PDF form) and the one you wanted as output in plain text. Thus, you do not even need to create a dedicated directory for carrying out your conversions.

  • Caveats

    There are two caveats to using this script you have to be aware of:

    • First, if you choose to run this script in any directory, then there mustn’t be any other PNG files there. This is because they will be deleted soon after the conversion process is done.

    • Second, your output file must not begin with the string, “Page-temp”. This is also because this will result in it being deleted during the cleanup stage.

      I think I will revise the script someday to address these problems. But as I might not see the urgency in that, this may be a bug you can address yourself: just edit the script to suit your working behaviour. This is the beauty with open source.

Conclusion

Being able to work with both computer-produced text and digitised material is great in this age where you are expected to be on top of your game. If you are a researcher, a student or an instructor, it is imperative that you master the skills of converting text from one form to the other. I understand that programming may not be the attractive option when you think of conversion, but like most things in life, the more you invest in whatever can enhance your career, then this may be the best path open for your development and productivity.

thanks for going through this article, until next time, have a pleasant conversion experience!

Ishe Chinyoka
Ishe Chinyoka
Access Technology Instructor

My research interests include operating systems, access technology, programming, and science fiction.