Tooling Tuesday: pdftotext

Need to read a PDF file into your project? You got this!

So what is it?

PDFtotext from Jason Alan Palmer is a Python library to extract text from PDF files. It works with most PDF files including password protected files.

If you like what you read...

Sorry to interrupt! But would you like to buy me a cup of coffee? It helps me to pay for hosting this blog, and to buy stuff to hack from Poundshops / Dollar Stores which are used in free projects on this blog. Thanks!

So how do I install it?

Debian / Ubuntu

First ensure that you have all the dependencies.

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Then install pdftotext.

sudo pip3 install pdftotext

So how do I use it?

Open your favourite Python editor and lets get started.

import pdftotext

Import the pdftotext library

with open("/path/to/PDF/file.py", "rb") as f:
    pdf = pdftotext.PDF(f)

Load the PDF file into Python and read it into an object.

for page in pdf:
    print(page)

For every page in the PDF document, read it and print it to the Python shell / REPL.

But I want to read just a specific page!

No worries!

print(pdf[0])

This will read the first page of text, just change the number to get the correct page.

Can you use it for something a little more useful?

Perhaps we can build a quick project to read a PDF and then store the document as an audio file for playback in a media player.

For this I am using Google Text to Speech, which I covered in an earlier Tooling Tuesday :)

We start by importing three libraries.

import pdftotext
from gtts import gTTS
from sys import argv

PDF to text to get the data from the PDF file.
gTTS to convert text to speech.
argv which is used to take arguments, extra information for a command.

with open(argv[1], "rb") as f:
    pdf = pdftotext.PDF(f)

To open a file which I specify in the terminal as an argument, I tell Python to look for argv[1] which will be the filename. The contents of the PDF are read into the object f.

    document = "\n\n".join(pdf)
    tts = gTTS(document)

Next I create an object called document and in there I store the entire PDF as one big string! This is then sent to tts to be converted into audio.

    print("Saving Audio file")
    tts.save(argv[1]+".mp3")

The audio is then saved to an MP3 file, which shares the same filename as the original PDF document. I also added a quick print to say that the file is being saved.

Complete Code Listing

import pdftotext
from gtts import gTTS
from sys import argv
with open(argv[1], "rb") as f:
    pdf = pdftotext.PDF(f)
    document= "\n\n".join(pdf)
    tts = gTTS(document)
    print("Saving Audio file")
    tts.save(argv[1]+".mp3")

Testing the code

I saved the code and then opened the terminal to where the code was saved, and also where a test PDF was stored.

I ran the code.

python pdf-test.py test.pdf

And after a few moments I found an audio file in the directory which I can play in a media player.

Taking it a little further

I wanted to make a system wide command out of this project, so I added a line at the top of the Python code to tell Python where to find the Python 3 interpreter.

#! /usr/bin/env python3

Then in the terminal I changed the file permissions so that the file can be executed as an application.

chmod +x pdf-test.py

The last thing was to copy the command to the /usr/bin directory and change the name to pdftoaudio

sudo cp pdf-test.py /usr/bin/pdftoaudio

Now typing the command pdftoaudio and passing the file name of the PDF will create an MP3 of the PDF in the current directory.