Python Program to Convert PDF to Text Easily
PyPDF2 library in Python to convert a PDF to text by opening the file, reading pages with PdfReader, and extracting text using page.extract_text().Examples
How to Think About It
Algorithm
Code
from PyPDF2 import PdfReader with open('sample.pdf', 'rb') as file: reader = PdfReader(file) text = '' for page in reader.pages: text += page.extract_text() or '' print(text)
Dry Run
Let's trace a PDF file with one page containing 'Hello World!' through the code.
Open PDF file
Open 'sample.pdf' in binary mode as file.
Create PDF reader
Create PdfReader object from file.
Initialize text
Set text = '' (empty string).
Loop through pages
For the single page, extract text 'Hello World!' and add to text.
Print text
Print the combined text: 'Hello World!'.
| Step | Page Text Extracted | Combined Text |
|---|---|---|
| 1 | N/A | '' |
| 2 | N/A | '' |
| 3 | N/A | '' |
| 4 | 'Hello World!' | 'Hello World!' |
| 5 | N/A | 'Hello World!' |
Why This Works
Step 1: Open PDF file
We open the PDF file in binary mode to read its raw data correctly.
Step 2: Read PDF pages
Using PdfReader, we access each page's content inside the PDF.
Step 3: Extract and combine text
We extract text from each page with extract_text() and add it to a string to get all text together.
Alternative Approaches
import pdfplumber with pdfplumber.open('sample.pdf') as pdf: text = '' for page in pdf.pages: text += page.extract_text() or '' print(text)
import textract text = textract.process('sample.pdf').decode('utf-8') print(text)
Complexity: O(n) time, O(n) space
Time Complexity
The program reads each page once, so time grows linearly with the number of pages.
Space Complexity
All extracted text is stored in memory, so space grows with the total text size.
Which Approach is Fastest?
PyPDF2 is fast for simple text extraction; pdfplumber is slower but better for complex layouts; textract is slower due to external dependencies.
| Approach | Time | Space | Best For |
|---|---|---|---|
| PyPDF2 | O(n) | O(n) | Simple PDFs with mostly text |
| pdfplumber | O(n) | O(n) | Complex PDFs with layout |
| textract | O(n) | O(n) | Multiple file types, OCR support |
pip install PyPDF2 before running the code.