PythonProgramBeginner · 2 min read

Python Program to Convert PDF to Text Easily

Use the PyPDF2 library in Python to convert a PDF to text by opening the file, reading pages with PdfReader, and extracting text using page.extract_text().

📋

Examples

InputA PDF file with one page containing 'Hello World!'

OutputHello World!

InputA PDF file with two pages: first page 'Page One', second page 'Page Two'

OutputPage One Page Two

InputAn empty PDF file with no text

Output

🧠

How to Think About It

To convert a PDF to text, first open the PDF file in binary mode. Then use a PDF reading library to access each page. Extract the text content from each page and combine it into one string. Finally, output or save the combined text.

📐

Algorithm

Open the PDF file in binary read mode.

Create a PDF reader object to read the file.

Initialize an empty string to hold all text.

Loop through each page in the PDF.

Extract text from the current page and add it to the string.

After all pages are processed, print or return the combined text.

💻

Code

python

from PyPDF2 import PdfReader

with open('sample.pdf', 'rb') as file:
    reader = PdfReader(file)
    text = ''
    for page in reader.pages:
        text += page.extract_text() or ''
print(text)

Output

Hello World!

🔍

Dry Run

Let's trace a PDF file with one page containing 'Hello World!' through the code.

Open PDF file

Open 'sample.pdf' in binary mode as file.

Create PDF reader

Create PdfReader object from file.

Initialize text

Set text = '' (empty string).

Loop through pages

For the single page, extract text 'Hello World!' and add to text.

Print text

Print the combined text: 'Hello World!'.

Step	Page Text Extracted	Combined Text
1	N/A	''
2	N/A	''
3	N/A	''
4	'Hello World!'	'Hello World!'
5	N/A	'Hello World!'

💡

Why This Works

Step 1: Open PDF file

We open the PDF file in binary mode to read its raw data correctly.

Step 2: Read PDF pages

Using PdfReader, we access each page's content inside the PDF.

Step 3: Extract and combine text

We extract text from each page with extract_text() and add it to a string to get all text together.

🔄

Alternative Approaches

pdfplumber

python

import pdfplumber

with pdfplumber.open('sample.pdf') as pdf:
    text = ''
    for page in pdf.pages:
        text += page.extract_text() or ''
print(text)

pdfplumber can handle more complex PDFs with better layout and text extraction but requires an extra library.

textract

python

import textract

text = textract.process('sample.pdf').decode('utf-8')
print(text)

textract supports many file types and uses external tools but can be slower and needs dependencies.

⚡

Complexity: O(n) time, O(n) space

Time Complexity

The program reads each page once, so time grows linearly with the number of pages.

Space Complexity

All extracted text is stored in memory, so space grows with the total text size.

Which Approach is Fastest?

PyPDF2 is fast for simple text extraction; pdfplumber is slower but better for complex layouts; textract is slower due to external dependencies.

Approach	Time	Space	Best For
PyPDF2	O(n)	O(n)	Simple PDFs with mostly text
pdfplumber	O(n)	O(n)	Complex PDFs with layout
textract	O(n)	O(n)	Multiple file types, OCR support

💡

Always install PyPDF2 with pip install PyPDF2 before running the code.

⚠️

Beginners often forget to open the PDF file in binary mode ('rb'), causing errors.