top of page
Post Cover.png

From PDFs to Excel: How I Saved Time with Tabula-py

This is placeholder text. To change this content, double-click on the element and click Change Content.

In today’s digital world, PDFs are virtually ubiquitous. From lengthy reports and research papers to product manuals and legal contracts, PDFs have become the go-to format for sharing and preserving documents across industries and platforms. As such, PDFs hold a wealth of information. But there’s a catch.

As essential as PDFs have become, extracting meaningful data from them can be a challenge. This is particularly true when it comes to financial documents filed with the SEC. These filings contain a wealth of information that is integral to the work that investors, analysts, and researchers do. Traditional methods often require manual copying and pasting, which leaves too much room for error, and realistically, still too much work for me to do. So, inspired by this nuisance, I decided to put my lazy attitude and CS degree to work.

This post will both serve as a project walkthrough, but can also serve as a walkthrough for the main libraries that we use, namely PyPDF2, camelot-py, andtabula-py.

Setup & Installation

Now, equipped with PyPDF2, camelot-py, andtabula-py, we’ll be able to tackle some of the most common tasks and problems that we PDFs present – finding the pages that specific text appears, identifying where a table is on a page, and, finally, taking a table, cleaning the data, and exporting it to Excel.

So, let’s go ahead and install them

pip install PyPDF2 camelot-py tabula-py

Note that I had some issues when I installed camelot-py, so please refer to the footnotes where I outlined how I overcame that.

I also downloaded Apple’s 10-K to test my code on, but any PDF with tables will work.

There are other libraries that we’ll need, but we’ll just install them using pip when we run into them.

Using PyPDF Locate Pages with Specific Text

Oftentimes, when we’re looking at financial documents like the 10-K, we’re looking for the three financial statements or other keywords. Yes, we can just do a simple ctrl+f to find the pages, but coding this up will help us down the road.

We will need to utilize PyPDF's PdfReader class to process the 10-K. With that, the logic is pretty simple. For each page, we want to extract the text and see if there’s any text in there that matches what we’re looking for, if so add the index of that page to a list that accumulates the pages where there are matches.

import re
from PyPDF2 import PdfReader

def search_pdf(file_path, search_text):
    with open(file_path, "rb") as file:
        reader = PdfReader(file)
        total_pages = len(reader.pages)
        matched_pages = []

        for page_num in range(total_pages):
            page = reader.pages[page_num]
            page_text = page.extract_text()
            if re.search(search_text.replace(' ','').lower(), 
                         page_text.replace(' ','').lower(), 
                         flags=re.IGNORECASE):
                # Add 1 as pages are 0-indexed
                matched_pages.append(page_num + 1)  

    return matched_pages

So, we’re looking for the financial statement, which Apple’s accountants call “Consolidated Statements of Operations”, then we just do the following

test = search_pdf('aapl10-K.pdf', 'Consolidated Statements of Operations')
test

We’d get

Power in Numbers

2023-03-16

Programs

50

Locations

200

Volunteers

bottom of page