/ tuesdaytooling

Tooling Tuesday: pdftotext

Need to read a PDF file into your project? You got this!

So what is it?

PDFtotext from Jason Alan Palmer is a Python library to extract text from PDF files. It works with most PDF files including password protected files.

If you like what you read...
BuyMeACoffee_blue@2x-1
Sorry to interrupt! But would you like to buy me a cup of coffee? It helps me to pay for hosting this blog, and to buy stuff to hack from Poundshops / Dollar Stores which are used in free projects on this blog. Thanks!

So how do I install it?

Debian / Ubuntu
First ensure that you have all the dependencies.

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Then install pdftotext.

sudo pip3 install pdftotext

So how do I use it?

Open your favourite Python editor and lets get started.

import pdftotext

Import the pdftotext library

with open("/path/to/PDF/file.py", "rb") as f:
    pdf = pdftotext.PDF(f)

Load the PDF file into Python and read it into an object.

for page in pdf:
    print(page)

For every page in the PDF document, read it and print it to the Python shell / REPL.

But I want to read just a specific page!

No worries!

print(pdf[0])

This will read the first page of text, just change the number to get the correct page.

Can you use it for something a little more useful?

Perhaps we can build a quick project to read a PDF and then store the document as an audio file for playback in a media player.

For this I am using Google Text to Speech, which I covered in an earlier Tooling Tuesday :)

We start by importing three libraries.

import pdftotext
from gtts import gTTS
from sys import argv
  • PDF to text to get the data from the PDF file.
  • gTTS to convert text to speech.
  • argv which is used to take arguments, extra information for a command.
with open(argv[1], "rb") as f:
    pdf = pdftotext.PDF(f)

To open a file which I specify in the terminal as an argument, I tell Python to look for argv[1] which will be the filename. The contents of the PDF are read into the object f.

    document = "\n\n".join(pdf)
    tts = gTTS(document)

Next I create an object called document and in there I store the entire PDF as one big string! This is then sent to tts to be converted into audio.

    print("Saving Audio file")
    tts.save(argv[1]+".mp3")

The audio is then saved to an MP3 file, which shares the same filename as the original PDF document. I also added a quick print to say that the file is being saved.

Complete Code Listing

import pdftotext
from gtts import gTTS
from sys import argv
with open(argv[1], "rb") as f:
    pdf = pdftotext.PDF(f)
    document= "\n\n".join(pdf)
    tts = gTTS(document)
    print("Saving Audio file")
    tts.save(argv[1]+".mp3")

Testing the code

I saved the code and then opened the terminal to where the code was saved, and also where a test PDF was stored.
I ran the code.

python pdf-test.py test.pdf

Screenshot_2020-01-13_21-19-50
And after a few moments I found an audio file in the directory which I can play in a media player.
Screenshot_2020-01-13_21-24-19

Taking it a little further

I wanted to make a system wide command out of this project, so I added a line at the top of the Python code to tell Python where to find the Python 3 interpreter.

#! /usr/bin/env python3

Then in the terminal I changed the file permissions so that the file can be executed as an application.

chmod +x pdf-test.py

The last thing was to copy the command to the /usr/bin directory and change the name to pdftoaudio

sudo cp pdf-test.py /usr/bin/pdftoaudio

Screenshot_2020-01-13_21-35-22
Now typing the command pdftoaudio and passing the file name of the PDF will create an MP3 of the PDF in the current directory.

Happy Hacking