Enabling Data Extraction from PDFs using Tabula-Py

Summary

In a continuation of the first blog that we published in September of this year, in which we detailed the excel add-ins that the Emerging Markets team developed to retrieve and streamline legacy processes, the team’s automation efforts have been boosted by similar measures to automate data retrieval from tables in PDF formats.

The Portable Document Format (PDF) is one of the most widely used formats for publishing documents, reports, and tables, especially from public and paid sources. While this data can be extracted manually, transcribing the data to another format is time-consuming and repetitive, and it often leads to errors. The extraction of tabulated data can be automated using the Tabula library, which is available as Tabula-Py in its Python version.

Tabula-Py comes in very handy when it comes to automating and extracting tabulated data from a single PDF or a batch of PDF files that are periodically updated.

The Challenge

As an investment banking research team, we realize that a large portion of public information (like data from central banks or treasuries) is published in PDF format. In dealing with such data, many manual tasks were being performed, including copying the data from the document to a more friendly format (like an Excel or csv file). This process had two major, inherent problems:

Inefficiency: It takes a lot of time to manually export data from PDFs to Excel files, and when the team is dealing with years of historical data, the task becomes monumental and takes hours if not days.
Human error: It is almost inevitable for an analyst to skip a number or insert a wrong decimal while transcribing. This leads to errors that are sometimes very difficult to spot.

Our Solution

After a thorough investigation, the team decided to use the Tabula-Py Python library to automate reading, modification, and saving of data from tables in PDF formats. Tabula-Py is a library available in Python that allows easy interpretation of tables in PDF format and helps the user convert PDFs into CSV, TSV, and JSON file formats or directly into a Pandas DataFrame, which can be saved as excel files. This library also allows the user to either extract the data from a single file or from a batch.

Tabula-Py allows for a relatively greater extent of customization, and while the default configuration works in most cases, the user may occasionally be required to rely on different parameters while dealing with certain tables. Tabula can also read PDF files in batches, making the extraction of larger datasets extremely convenient.

Business Impact

The team used to take 2-3 hours per week to complete these tasks, as they required multiple reviews due to the nature of the work. Using this advanced solution, the same process can now be executed in less than five minutes with zero possibility of human error.

Tabula-Py could be a game changer in financial research, as other teams could leverage our success by implementing this solution in their daily tasks.

Synergy with FinTex: FinTex is also an in-house solution that Evalueserve has developed, and it is used internally to upload any PDF and download tables from existing PDFs with about 90% accuracy (tested on SEC filings). Just like the EM PDF tool, the tables in FinTex are also downloaded as tabular data to be exported into Excel and other platforms. The FinTex is a web-based tool. As such, it is limited to the Evalueserve environment only, whereas the EM PDF extractor is used to streamline processes in Python, and it can be easily integrated into client environments. The PDF extractor is one clear example of how existing synergies are being used at Evalueserve to achieve company-wide goals and how similar solutions are being implemented to serve client needs.

Limitations

Apart from technological limitations (e.g., requiring Python and a specific library to run), the use of this tool is limited to tabulated data. While the tool works well with tables, it fails to identify other datasets that PDFs can contain. Hence, this solution is recommended in use cases where the user is dealing with tables and columns alone.

As mentioned earlier, this tool works well if the PDF file is periodically updated and if the format and structure of the file does not change. Any reporting or structural change in the PDF leads to failure of the script. Therefore, the tool cannot be applied to sources that are more dynamic in nature.

Authors

Talk to One of Our Experts

Get in touch today to ﬁnd out about how Evalueserve can help you improve your processes, making you better, faster and more efﬁcient.

Talk To An Expert

Data Analytics

Customer Analytics

Pricing

Business Intelligence

Supply Chain

Data Engineering & Cloud

Insights & Intelligence

Competitive & Market Intelligence

Supply Chain & Procurement

Decarbonization & ESG

Strategy & Planning

Investment Banking Advisory

Business Information Services

CRM & Business Management Support

Deal Execution

Desktop & Publishing Services

Product Support

Transaction Advisory

Valuation

Investment Management & Research

Asset Management

Publishing & Distribution

Core Research

Research Support

Data Solutions

Lending Services

Facility Origination & Sales Support

Independent Review

Risk Management

Loan Servicing

Credit Underwriting

Portfolio Monitoring

Private Markets

Private Equity Advisory

Private Markets Technology

Investment Operations Support

Private Credit

Index & QIS

Risk Transformation

Model Risk Management

KYC Compliance

Intellectual Property & R&D

IP and R&D Market Solutions

IP Consulting

IP Process Redesign

Patent Search

R&D and Innovation

Toxicology Consulting

Technology

Artificial Intelligence

Architecture

Agentic AI

Accelerators

Products

Insightsfirst

Spreadsmart

Publishwise

Partners

Technology Partners

Google Cloud Partnership

Partner Advisory Council

Industry Specific Expert-Driven Solutions.

Elevate Magazine

Resource Library

Client Stories

eBooks & Whitepapers

Ask an Expert

Blogs

Events & Webinars

Executive Exchange Series

Industry Insights

Podcasts