It’s a common misconception that PDFs are the graveyard for data. While PDFs make a great archive format for documents, that doesn’t mean that the data inside them is trapped forever.
OCR (Optical Character Recognition), is a technology that has been used for many years to extract data from printed documents. OCR is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text. OCR is often paired with technology that looks at specific zones of the document where data can be found. This works well for extracting data from paper forms that have a standardized layout. It can also involve expensive scanning and processing equipment with complex software to manage it.
But what about getting data from dynamic tables, reports or even just reading and comparing the content of a PDF? The structure of a PDF is such that it makes it very difficult to machine process. What looks like a paragraph or a table to the human eye is a collection of (somewhat disconnected) PDF objects. This makes programmatic processing of the content of a PDF very difficult if not impossible. However, there is a solution!
Adobe now offers a set of cloud services to create, consume and convert PDF files called the Adobe PDF Services API. These APIs (Application Programming Interfaces) include the PDF Extract Service. Adobe offers an SDK (Software Development Kit) in Java, Node.js, .Net, and Python as well as REST API to invoke the cloud service. If you are not a programmer, low code / no code platforms like Microsoft’s Power Automate make using the cloud service a drag and drop affair.
Figure 1 Extract PDF Services in Power Automate
The service converts the PDF to JSON format, a structured machine readable text format. Never heard of JSON? Well, you really should get to know it. In the process of converting to this format, Adobe uses AI to detect the structure of the PDF that humans see so easily.
Once your application returns the JSON file, the now well organized and structured text can easily be read by any programming language or low code/no code tool to be:
- Used in process automation
- Stored in systems of record
- Natural language processing
- Content Republishing