Breach Review with Analytics and Automation
-
Published on Jun 19, 2025
Faced with a complex data breach review involving thousands of impacted individuals, our client initially anticipated a manual effort totaling nearly 700 review hours. Knowing traditional workflows weren’t going to meet timeline or budget goals, Innovative Driven’s Project Management and Analytics Consulting team developed a targeted extraction and linking strategy for over 18,000 names and PII elements buried in unstructured data types (PDFs). By combining data deduplication, pattern recognition, and automation, our teams reduced the workload from an estimated 696 hours to just 20 – and delivered the structured PII log one week ahead of schedule. The result: more than $17,000 in cost savings and a significantly faster path to breach notification.
Thank you to Jai Chai, Teresa Cole, and Andrew Erland for their contributions to this project. Their coordination across project management, analytics, and review ensured a smooth execution and a successful outcome.
Challenge
We were tasked with preparing a data breach PII log for notification of involved parties. Within the data were several massive lists of parties and their PII (around 18,000 names and associated PII) within unstructured data types (PDFs). Using a traditional approach, exacting these would have taken an estimated 696 reviewer hours or $20,880.
Solution
We were to analyze the data, reduce it to item-level duplicates, extract it in a semi-structured manner, split it into structured parts of PII values, expand back to item-level duplicates, and then mass incorporate these values into the PII log and link to the origin documents.
Execution
We first deduplicated the data at the item-level, as is appropriate in a data breach review to reduce the population. Then, we identified unique patterns within the PDFs to isolate the PII. From these patterns, we drafted Regular Expressions to capture them. In capturing PII, there were some disjointed pieces, namely names were referred to by an office code. However, office codes were separately linked to the state PII, thus they had to be captured separately. Once extracted, we broke these excerpts further into structured parts of name, first name, last name, employee ID, office code, and states. As mentioned, we had to normalize and then link the office code and states to the disjointed names. Once we built the structured data file, we linked the entries to and expanded them to their item-duplicates. Finally, this structured data was overlaid directly into the review platform PII log alongside the manual review entries ready for reporting and notification.
Workflow for Mass Extraction:
There were 139 documents tagged as containing over 20+ log lines. Our PM Team and Analytics Consultant (AC) further assessed the documents and determined that the same monthly reports were included in multiple documents. Of the 139, there were only 36 unique monthly reports. Of those 36, 13 were within searchable PDFs, which our AC mass extracted the PII. Once the 36 documents were completed, we mass linked those individuals to the remaining 103 documents containing the same monthly reports.
Breakdown of volume regarding the mass extraction/bulk linking
13 documents mass extracted from searchable PDFs
Resulted in ~2,130 names and associated data that did not need to be manually entered or linked from existing individuals
103 documents mass linked from the 36 unique monthly reports
Resulted in ~15,600 names and associated data that did not need to be manually entered or linked to existing individuals
Results
Our Review Project Management and Analytics Consulting teams spent roughly 20 combined hours to complete the above mass extractions/linking for approximately $3,000 – a total savings of about $17,000 and completed about a week ahead of schedule.