Need to read a PDF file into your project? You got this!
So what is it?
PDFtotext from Jason Alan Palmer is a Python library to extract text from PDF files. It works with most PDF files including password protected files.
If you like what you read...
Sorry to interrupt! But would you like to buy me a cup of coffee? It helps me to pay for hosting this blog, and to buy stuff to hack from Poundshops / Dollar Stores which are used in free projects on this blog. Thanks!
So how do I install it?
Debian / Ubuntu
First ensure that you have all the dependencies.
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
Then install pdftotext.
sudo pip3 install pdftotext
So how do I use it?
Open your favourite Python editor and lets get started.
Import the pdftotext library
with open("/path/to/PDF/file.py", "rb") as f: pdf = pdftotext.PDF(f)
Load the PDF file into Python and read it into an object.
for page in pdf: print(page)
For every page in the PDF document, read it and print it to the Python shell / REPL.
But I want to read just a specific page!
This will read the first page of text, just change the number to get the correct page.
Can you use it for something a little more useful?
Perhaps we can build a quick project to read a PDF and then store the document as an audio file for playback in a media player.
We start by importing three libraries.
import pdftotext from gtts import gTTS from sys import argv
- PDF to text to get the data from the PDF file.
- gTTS to convert text to speech.
- argv which is used to take arguments, extra information for a command.
with open(argv, "rb") as f: pdf = pdftotext.PDF(f)
To open a file which I specify in the terminal as an argument, I tell Python to look for
argv which will be the filename. The contents of the PDF are read into the object
document = "\n\n".join(pdf) tts = gTTS(document)
Next I create an object called
document and in there I store the entire PDF as one big string! This is then sent to tts to be converted into audio.
print("Saving Audio file") tts.save(argv+".mp3")
The audio is then saved to an MP3 file, which shares the same filename as the original PDF document. I also added a quick print to say that the file is being saved.
Complete Code Listing
import pdftotext from gtts import gTTS from sys import argv with open(argv, "rb") as f: pdf = pdftotext.PDF(f) document= "\n\n".join(pdf) tts = gTTS(document) print("Saving Audio file") tts.save(argv+".mp3")
Testing the code
I saved the code and then opened the terminal to where the code was saved, and also where a test PDF was stored.
I ran the code.
python pdf-test.py test.pdf
And after a few moments I found an audio file in the directory which I can play in a media player.
Taking it a little further
I wanted to make a system wide command out of this project, so I added a line at the top of the Python code to tell Python where to find the Python 3 interpreter.
#! /usr/bin/env python3
Then in the terminal I changed the file permissions so that the file can be executed as an application.
chmod +x pdf-test.py
The last thing was to copy the command to the
/usr/bin directory and change the name to
sudo cp pdf-test.py /usr/bin/pdftoaudio
Now typing the command
pdftoaudio and passing the file name of the PDF will create an MP3 of the PDF in the current directory.