Enabling Data Extraction from PDFs using Tabula-Py

Summary

In a continuation of the first blog that we published in September of this year, in which we detailed the excel add-ins that the Emerging Markets team developed to retrieve and streamline legacy processes, the team’s automation efforts have been boosted by similar measures to automate data retrieval from tables in PDF formats.

The Portable Document Format (PDF) is one of the most widely used formats for publishing documents, reports, and tables, especially from public and paid sources. While this data can be extracted manually, transcribing the data to another format is time-consuming and repetitive, and it often leads to errors. The extraction of tabulated data can be automated using the Tabula library, which is available as Tabula-Py in its Python version.

Tabula-Py comes in very handy when it comes to automating and extracting tabulated data from a single PDF or a batch of PDF files that are periodically updated.

The Challenge

As an investment banking research team, we realize that a large portion of public information (like data from central banks or treasuries) is published in PDF format. In dealing with such data, many manual tasks were being performed, including copying the data from the document to a more friendly format (like an Excel or csv file). This process had two major, inherent problems:

  • Inefficiency: It takes a lot of time to manually export data from PDFs to Excel files, and when the team is dealing with years of historical data, the task becomes monumental and takes hours if not days.
  • Human error: It is almost inevitable for an analyst to skip a number or insert a wrong decimal while transcribing. This leads to errors that are sometimes very difficult to spot.

Our Solution

After a thorough investigation, the team decided to use the Tabula-Py Python library to automate reading, modification, and saving of data from tables in PDF formats. Tabula-Py is a library available in Python that allows easy interpretation of tables in PDF format and helps the user convert PDFs into CSV, TSV, and JSON file formats or directly into a Pandas DataFrame, which can be saved as excel files. This library also allows the user to either extract the data from a single file or from a batch.

Tabula-Py allows for a relatively greater extent of customization, and while the default configuration works in most cases, the user may occasionally be required to rely on different parameters while dealing with certain tables. Tabula can also read PDF files in batches, making the extraction of larger datasets extremely convenient.

Business Impact

The team used to take 2-3 hours per week to complete these tasks, as they required multiple reviews due to the nature of the work. Using this advanced solution, the same process can now be executed in less than five minutes with zero possibility of human error.

Tabula-Py could be a game changer in financial research, as other teams could leverage our success by implementing this solution in their daily tasks.

Synergy with FinTex: FinTex is also an in-house solution that Evalueserve has developed, and it is used internally to upload any PDF and download tables from existing PDFs with about 90% accuracy (tested on SEC filings). Just like the EM PDF tool, the tables in FinTex are also downloaded as tabular data to be exported into Excel and other platforms. The FinTex is a web-based tool. As such, it is limited to the Evalueserve environment only, whereas the EM PDF extractor is used to streamline processes in Python, and it can be easily integrated into client environments. The PDF extractor is one clear example of how existing synergies are being used at Evalueserve to achieve company-wide goals and how similar solutions are being implemented to serve client needs.

Limitations

Apart from technological limitations (e.g., requiring Python and a specific library to run), the use of this tool is limited to tabulated data. While the tool works well with tables, it fails to identify other datasets that PDFs can contain. Hence, this solution is recommended in use cases where the user is dealing with tables and columns alone.

As mentioned earlier, this tool works well if the PDF file is periodically updated and if the format and structure of the file does not change. Any reporting or structural change in the PDF leads to failure of the script. Therefore, the tool cannot be applied to sources that are more dynamic in nature.

Authors

Talk to One of Our Experts

Get in touch today to find out about how Evalueserve can help you improve your processes, making you better, faster and more efficient.  

Share: