Blog

Extract Text From a Multi-Column Document Using PyMuPDF in Python

By Harald Lieder - Wednesday, June 07, 2023

PyMuPDF is a Python binding for MuPDF, a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit. It offers many features for manipulating PDF and other documents, such as extracting text and images, creating and modifying pages, adding annotations and form fields, encrypting and decrypting files, and more.

One of the advanced features of PyMuPDF is the ability to detect multi-column pages in supported document types. This can be useful for processing documents that have complex layouts, such as reports, newspapers, magazines, or academic papers. By identifying the text belonging to different columns on the page, you can extract it more accurately and preserve its logical structure.

In this blog post, we will show you how to use a PyMuPDF utility for detecting multiple columns in pages and extracting text along these columns. The utility supports a variable number of columns on the page. Text written on top of images can optionally be excluded, as well as footer lines by using an appropriate bottom margin.

The Utility

The utility is a Python script named multi_column.py, which can be used as a command-line tool or imported as a module. The script contains a function named column_boxes, which takes a PyMuPDF page object as an input and returns a list of text boundary boxes that correspond to the columns on the page.

The function uses MuPDF’s text block detection capability to identify text blocks and uses their bounding boxes as the primary structuring principle. It also supports ignoring footers via a footer margin parameter and optionally ignoring text written above images.

The function has the following signature:


def column_boxes(
    page: pymupdf.Page,
    footer_margin: int = 0,
    no_image_text: bool = True) -> list:
    """Return list of column bboxes for page."""

The parameters are:

page: a PyMuPDF page object. 

footer_margin: an integer that specifies the height of the bottom stripe to ignore on each page. Default is 50. 

no_image_text: a boolean that indicates whether to ignore text written above images. Default is True. 

The return value is a list of pymupdf.IRect objects that represent the column boundary boxes. The list is sorted ascending by their top-left coordinates.

How to Use It

There are two ways to use the utility: as a command-line tool or as a module. In any case, PyMuPDF must be installed. There are no other dependencies.

As a Command-Line Tool

To use the utility as a command-line tool, run the following command:

python multi_column.py input.pdf footer_margin

Where input.pdf is the name of the PDF file you want to process and footer_margin is the height of the footer margin you want to ignore. The code is currently intended for demonstration purposes, in that on every page of “input.pdf” the identified column boundary boxes are given a red border. Inside these rectangles, near their top-left corner, the sequence number of the rectangle is written such that the sequence of text extraction can be easily followed.

Modify this code as needed, for instance extract the text of each rectangle and write it to some file.

As a Module

To use the utility as a module, you need to import it in your Python script and call the column_boxes function with a PyMuPDF page object as an argument.

For example, if you want to extract text from each column on each page of a PDF file named sample.pdf, you can write something like this:


import pymupdf
from multi_column import column_boxes

doc = pymupdf.open("sample.pdf")
for page in doc:
    bboxes = column_boxes(page, footer_margin=50, no_image_text=True)
    for rect in bboxes:
        print(page.get_text(clip=rect, sort=True))
    print("-" * 80)

This will print the text from each column on each page separated by dashes.

Features of multi_column.py

  • Identifies text belonging to (a variable number of) columns on the page.
  • Text with different background color is handled separately, allowing for easier treatment of side remarks, comment boxes, etc.
  • Uses text block detection capability to identify text blocks and uses the block boundary boxes as the primary structuring principle.
  • Supports ignoring footers via a footer margin parameter.
  • Supports ignoring text written upon images to avoid its influence of layout detection.
  • Returns re-computed text boundary boxes (integer coordinates) sorted such that their text should reflect the intended reading sequence.

Notes

  • Only horizontal text, written left-to-right is currently supported.
  • Standard page layout is supported, including changes of column counts. The utility works best for text-oriented pages.
  • Different text background colors and horizontal or vertical lines are used to help find column borders and other layout characteristics.
  • The utility depends on some fairly properly designed page layouts. Pages with overlaps between boundary boxes are likely to cause errors in finding the correct layout.
  • Currently, there is no support for detecting caption text for images and other objects. This text will be treated like normal text and may thus have unwanted influence on layout detection.

Examples

Here are some examples of successful column detection:

  • Simple Layout: Header followed by two columns:
Simple multi-column detection.

 

  • Intermediate Layout: Header followed by three columns, followed by a comment:
Intermediate multi-column detection.

 

  • Complex Layout: Header, two 3-column areas separated by an intermediate header and a comment:
Complex multi-column detection.

 

Here are some examples of problem cases:

  • Overlapping text boundary boxes — see the blue circle. When extracting text, extra code is required to omit text belonging to subsequent boxes:
Overlapping text boundary boxes.

 

  • Text columns including unrecognized caption text are interrupting the normal text flow, and will need extra code to detect caption-specific text properties (like e.g. bold text in this case, or different indentation) to ignore these text pieces:
Unrecognized caption text.

Conclusion

In this blog post, we learned how to use a PyMuPDF utility for detecting multi-column pages in supported documents. The utility can separate text with different background colors, ignore footers and text written upon images, and supports a variable number of columns on the page.

The PyMuPDF library offers many other features to work with PDF documents, such as extracting images, annotations, and much more. Be sure to explore the official PyMuPDF documentation to discover more of its capabilities.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.