A fortune 500 financial services company laid out a challenge for Aftia - help us find a needle in a haystack. The haystack in this case is 900+ PDF forms and documents. The needle(s) are common text and form components called “fragments”.
Our customer has seen the benefits of using the form fragments capability in Adobe Experience Manager (AEM) Forms. They’ve implemented about 150 forms using about 200 fragments and saved over 7,000 hours of development time.
They have about 750 forms left to bring on to the platform. These forms can be up to 40 pages in length and are packed with text outlining terms and conditions, disclosures and obligations. The challenge is to identify the common content in these forms which can be converted into time saving fragments. Manually searching these forms is laborious and costly - not to mention error prone.
To meet this challenge we examined print stream comparison tools, PDF Java libraries and image comparison libraries. None of them were up to the task. Then we found the PDF Extract API from Adobe while it was still in Beta. This AI/ML cloud-based service converts PDFs to JSON format which makes them easy to machine process. While PDFs may look beautiful and well structured - they are notoriously difficult to parse. Adobe’s service makes that problem disappear.
Aftia’s Pat Legault and Adia Lane developed an algorithm that compares the JSON elements - text, bounding box size and font properties to quickly identify potential matches using fuzzy logic. This algorithm has become the basis of an application we are building for our customer’s form experts. The comparison tells the form managers which elements matched and the confidence level of that match. Using the PDF Viewer API we can highlight the fragment candidate right in the form.
As we neared a minimum viable product (MVP) for this solution, our customer found many other use cases. The solution will help them crawl content and rationalize their forms collection by consolidating common forms. There are all kinds of applications for finding common content across organizations large and small. What needle can we help you find