How to Extract Images from pdf in Python

How to Extract Images from pdf in Python

In this tutorial, we will learn about how to extract images from pdf in python with different python libraries.

Here first, we will learn about how to read pdf files in python, then extract them, and at last, we will save them.

Read Pdf file in Python

We cannot read pdf files directly using python. Instead, we need to install the necessary libraries using pip package installation.

To read pdf files, we will use the PyMuPDF python package that can access files like PDF, OpenXPS, XPS, EPUB, and many other extensions. And to install PyMuPDF, we can follow the below step.

pip install PyMuPDF

We will use fitz() function, which is used to read or process pdf or other files with PyMuPDF.

Then we will use a fantastic python package called Pillow, which is used for image processing and image manipulation.

To install Pillow, we will use the below pip command.

pip install Pillow

We have to install the necessary libraries now. After that, we can follow the below steps to extract images from pdf files.

Extract Images from pdf

Step 1: First, we will import the required packages.

import fitz # PyMuPDF
import io
from PIL import Image

Step 2: Now, we will read and process the pdf file into python.

# file path you want to extract images from
file = "DemoFile.pdf"
# open the file
pdf_file = fitz.open(file)

Step 3: In the final step, we will do the main code of the program by iterating a pdf file using for loop to process pdf pages one by one.

# iterate over PDF pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

Here we will first check the number of pages inside the pdf file, and one by one, it will process the pages on the pdf file and detect the images inside the page, and once it finds it and saves it in the desired locations.

Inside the iterator, we are making a list of all the images available inside the page using the getImageList(), and after that, we use the extractImage() function. 

Also, if you are interested to learn Mouse and Keyboard automation using Python, you must check this out.

The whole program will look as follow.

import fitz # PyMuPDF
import io
from PIL import Image
# file path you want to extract images from
file = "DemoFile.pdf"
# open the file
pdf_file = fitz.open(file)
# iterate over PDF pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

Extract text from pdf using PyPDF2 

In this method, we will use the PyPDF2 package to extract the text, and in the method, we don’t require other packages like the above method. We can directly extract text from pdf.

To install the PyPDF2 package, we will follow the below command on your respected operating systems.

pip install PyPDF2

You can also use the PyPDF or PyPDF3 version, but all three versions will work.

Once the PyPDF2 package is installed, we will start to wring the program to read the pdf file, convert all the pages into text, and print it on the given destination terminal or IDE.

Follow the below steps to extract text from the pdf file.

Step 1: The first step will be to import the PyPDF2 package.

#import the PyPDF2 module
import PyPDF2

Step 2: Now, we will read the pdf file and process it will the PyPDF2 using PdfFileReader() function.

#open the PDF file
PDFfile = open('DemoFile.pdf', 'rb')
PDFfilereader = PyPDF2.PdfFileReader(PDFfile)

Step 3: Here, we will find the number of pages in our pdf files. This will print the total number of pages with an index starting from zero.

#print the number of pages
print(PDFfilereader.numPages)

Step 4: Now, we will specify the page we want to extract and print the text content of the given page.

#provide the page number
pages = PDFfilereader.getPage(8)

#extracting the text in PDF file
print(pages.extractText())

The extractText() function will extract all the text from the page specify in getPage() function.

Step 5: We will close a pdf file as our text has been extracted.

#close the PDF file
PDFfile.close()

The Whole program will look like this.

#import the PyPDF2 module
import PyPDF2

#open the PDF file
PDFfile = open('DemoFile.pdf', 'rb')
PDFfilereader = PyPDF2.PdfFileReader(PDFfile)

#print the number of pages
print(PDFfilereader.numPages)

#provide the page number
pages = PDFfilereader.getPage(8)

#extracting the text in PDF file
print(pages.extractText())

#close the PDF file
PDFfile.close()

Final Words

In this article, we have learned how to extract images from text from the pdf file, and reading pdf files in python code is not easy; it needs separate libraries to process and read it. But with our easy tutorial, we can very quickly extract the images and text from the pdf file. Also, please let us know via email if you have a suggestion for our blogs.

FAQs

How do I extract images from a PDF?

PyMuPDF package is used to extract images from a pdf file in python.PyMuPDF extract images from PDF detecting all the images from the pdf file and please note that it will not convert pdf pages into images inside it will just extract image if there is one.

How do I convert a PDF to an image in Python?

There are many ways to convert a pdf to an image in python we can use the pdf2image library which is the most popular in converting  pages into images.

How do I read a scanned PDF in Python?

Yes, you can read a scanned pdf in python using the PyMuPDF library.

How do I convert a PDF to a DataFrame in Python?

You can convert PDF to a DataFrame in python using pandas and tabula-py library but pdf must contain tabular data inside otherwise one data will be converted into a  dataframe. It will Extract specific data from PDF using Python if it will get the tabular or table data.

How do I convert PDF to PNG in Python?

Yes you can convert pdf pages into png format in python using pdf2image python package, which very easy to code and it will convert all the pdf pages into images in what format you want. We can also Convert PDF to image python opencv but it will be very hand and take long to convert it.

Can I resize the images when I extract them from a PDF file using PyPDF2?

Yes, you can resize the images when you extract them from a PDF file using PyPDF2. Once you have accessed the image using the XObject attribute of the page object, you can use the resize method from the Image module of PIL to resize the image.

Are there any other libraries I can use to extract images from a PDF in Python?

Yes, there are other libraries you can use to extract images from a PDF in Python, such as PyMuPDF, pdfminer, and pdftotext. Each library has its own strengths and weaknesses, so you may want to try different libraries to see which one works best for your specific use case.