Why is pulling financial data out of documents so hard?
Financial information should, theoretically, be an easy task for AI. After all, financial data is structured, number based, and often labelled for the human reader. However, OCR tools have a difficult time doing the work finance that users need.
Surprisingly, the early forms of optical character recognition (or OCR) were designed for banks. Business, military, or scientific use often provided the financial justification for many of AI’s early applications. In the case of OCR, a high volume of bank checks following WWII led Bank of America to collaborate with the Stanford Research Institute. The collaboration with Stanford led to ERMA, a system to automate bookkeeping and proofing. ERMA allowed banks to manage the rising numbers of paper checks amid shortages of staff, and ultimately resulted in the electronic banking we have today. (For OCR history buffs, also notable is Ray Kurzweil and the “omni-font” system he sold to Xerox, Oliver Selfridge’s Pandemonium, and the idea of ‘pooling layers’ put forward by Yann LeCun.)
Financial documents come in a dizzying array of formats, numbering methods, and styles. Documents like invoices, payroll slips, receipts, or tax documents all contain financial data, but differ greatly in how this data must be organized. These differences compound in large datasets – which makes OCR hard to rely on.
Shortcomings of OCR
Despite its early success in banking, there are still challenges associated with OCR. These challenges are well known to users of financial documents and other unstructured PDFs. A smattering of errors or hallucinated data can be expected while using OCR – hallucinated data can invent street names like “Lamefront” instead of “Lanefront” if it wrongly interprets the curve of a poorly scanned “n.”
While not necessarily a problem when working with a single document, irregular results are extremely challenging when working with large datasets with financial data. To ensure accuracy with OCR, you might end up needing to review the entire dataset by hand – eliminating the time-saving benefits entirely.
The unstructured nature of documents like pdfs are designed to show the reader concepts in a way that they find easy to read. Data is often shown in images or tables. These tables don’t contain metadata – it’s easy for a human reader to interpret a bar chart, but less intuitive for AI. OCR tools struggle to reconstruct the structure of these images or tables in spreadsheets or other displays, and may misalign columns or misinterpret row breaks.
OCR does not understand the meaning of the data that it pulls. For example, in many financial documents, accountants will recognize that the number at the bottom of a column with a double underline is the sum total of that column. If you train an OCR model on the idea that the sum of each column lives in the double underlined number below it, it will assume this is true even in cases where it isn’t.
The reason for many OCR errors is probabilities – OCR uses probabilities to predict labels and relationships. OCR systems don’t "understand" the meaning of the data. The model cannot recognize that "tax" or "subtotal" are contextually different from "sum total" without additional training or contextual rules.
So, if the system sees a number at the bottom without a clear label (or a label it hasn’t seen before), it will rely on statistical likelihood. If 90% of columns end in a double underline, and the model is taught that double underline means sum total, it will get the label correct (hopefully) around 90% of the time. However, if the model is given a new document from a real estate agent, who uses a double underline for signatures and dates. The OCR model now assumes each date in the signature column is a sum total. The signatory, Bobby, writes in all caps and a shaky hand. Thus the OCR gives him a new name – 8088Y.
Moving Beyond OCR
While OCR has its place in digitizing text, it’s clear that extracting financial data requires more than static character recognition. Modern solutions, like the ones underlying Discrepancy, combine machine learning, natural language processing (NLP), and advanced algorithms to address these challenges – so users can finally access their unstructured data, in real time.