Data Disorder to Organized Efficiency: How Data Hygiene Enhances AI Outcomes

  • Published on Jun 28, 2024

Data cleanup is a critical and challenging endeavor, from handling unstructured data to ensuring compliance with ever-evolving privacy regulations. The volume and complexity of information that organizations handle can be overwhelming, especially with the proliferation of AI-enabled tools generating mountains of data. To help us understand, we spoke with two experts in our Information Governance (IG) group: David Starkweather, Director of Professional Services, and Jeff Dunning, Director of Corporate Cloud Services. The conversation explores the expertise needed and provides actionable insights for organizations to manage data effectively.  

The following has been edited for length and clarity. 

Give us an overview of what data cleanup is and what does AI has to do with it?  

 Jeff: The first thing to jump into is there’s so much data in organizations. What we’re focused on right now is unstructured data. What is unstructured data? It’s all that data that you own, create, and consume. It’s your Word files and presentations. It’s your spreadsheets. It’s human-generated data. And now, we have AI-generated data, exponentially adding to the pace of data creation. I recently attended an Association of Records Managers and Administrators (ARMA) conference. A statement was made that over 2 billion documents are created in Microsoft 365 a day. That growth is just tremendous. And that’s really what data cleanup is all about. We want to take advantage of our data, but at that pace of growth, we have to do something to control it and understand the value we’re getting from data.  

And now, AI is doing the work for us in generating all this content. It’s that unstructured data that’s growing at such a rapid pace. And it’s also the data we don’t have enough information about. We know our structured data. We understand that this column has first name, this has last name, or we know if we’re capturing the last four of someone’s social security number. It’s called structured for that reason. So we know what we’re dealing with. It’s this mammoth volume of unstructured, and we really don’t know what’s happening there.  

What are the challenges in data cleanup?  

David. There’s no doubt data cleanup is not for the faint of heart. There are some real challenges you have to overcome. But if you’ve got the right people on board, you’ve got proven processes and the right technology to help you reach those goals – it’s a very doable project with many upsides. Most companies look at how much data they have and don’t get out of the gate just because they’re overwhelmed by the data volumes. Companies have hundreds of terabytes, if not petabytes, of information of unstructured data out on their file shares to deal with. So, data volume is the 1st challenge to overcome.  

You’ve also got legal holds. Some of your data may be subject to a legal hold for pending or active litigation, and you will have to work with your legal team to ensure you’re keeping everything that needs to be retained for that reason.  

Everybody has (or should have) a policy regarding where corporate records need to reside. In reality, many of those documents exist on the file share, out in the wild, if you will. So, we need to ensure that we’re not deleting valuable corporate records when we’re cleaning up our data.  

Orphan data is another problem. You’re going to run into orphan data, meaning you don’t know who owns it, and this is especially true if you’ve grown by any acquisitions or mergers. You often don’t know who owns that data and you spend a lot of time tracking down ownership.  

Lastly, there are data hoarders. You’ll run into people who say, don’t touch my stuff; I might need it someday. You have to overcome pushback and resistance from your end users. There are many challenges, but this is a doable project with many upsides if you have the right people, processes, and technology in place. 

Companies that feel like they have their data under control. From a broad context, what’s in it for them to care more about data clean up as an organization? 

David: It’s interesting. We go into organizations, and we’ll do a scan of their data. Without exception, they’re all shocked at what we find. They think data is under control, but a lot of sensitive data exists out on the file shares, probably more than most of us would like to believe. And based on our experience, a lot of it has open access, meaning anyone in the organization can see it. We commonly go into an organization and find the CEO’s social security number, credit cards, and personal information in a spreadsheet with open access. It’s obviously a massive security threat.  

As Jeff alluded, we don’t know what sits in our unstructured data. A lot of it is dark data. Most organizations are lucky if they can trace that data to a department or ownership. But many organizations we work with aren’t even able to do that. It really is trying to determine what you have and then mitigating that risk from there. 

One of the terms associated with AI data is cannibalism, a concept where data is taken from the past and applied to the future. Based on what you guys are saying, that isn’t the right way to do it.  

Jeff: No. Here’s the thing. You don’t get two petabytes of data over the course of two months. It’s over multiple years. Some of that data was valuable five years ago. Given the rapid change of organizations and enterprise today versus five years ago, do we really want to use the data from the past to define the future?  

That’s what this cannibalism term is about. It’s the Copilots and Geminis of the world. They go out through your unstructured data environment, look at everything, and respond to your prompts based on the data they find. So cannibalism is all about that. It’s creating new things with bad data, with old, rescinded five-year-old data. That creation of new content gets back into the data pools to be reused by others. And now that bad data is being used in a customer pitch deck or a sales forecast, for example, and no one has the context of where it came from because it’s on a document that’s five years old. That’s the concept behind cannibalism.  

All that impacts how we utilize data clean up in the context of AI. Data cleanup ensures we’re giving AI the ability to do what it needs to do with the right data. I explain it using the knowledge versus wisdom conversation. AI has the knowledge to create your work product. It understands what you’re doing when you prompt it. We want to give AI the wisdom not to use old, sensitive, confidential data when creating those work products. Part of it is getting rid of the old stuff. Other concepts, like data classification, help us put guardrails to protect our data. If you want to get the value from AI, we have to give it the wisdom not to use the stuff we don’t want it to use.  

Knowledge versus wisdom in relation to describing AI. What’s an example in Copilot?  

Jeff: When you start the Copilot process, it will go out and look at your data estate. It’s going to look at everything you have. If we position Copilot in such a way that we say, here’s the data estate we want it to use – when I prompt it to use its “knowledge” to create a PowerPoint presentation or a spreadsheet, it now knows how to use that particular data we’ve cleaned up. We’ve put our guardrails in place, giving it the wisdom to create a sales presentation on our new product. That’s the data I want to use because that’s the most current. That’s the information that’s available to me. That’s what we want to share with the public. The knowledge is, how do I create the spreadsheet? The wisdom is, let’s ensure we’re giving it the data space we want.  

I think AI is probably the best development in the whole concept of IG, data cleanup, and records management. It’s no longer all about risk. It’s about productivity. The organizations that went after these programs two years ago and cleaned up their data are taking advantage of Copilot in a way that most organizations are not, and that’s a big deal. 

This is a good segue. You can’t stop progress from happening. What company programs are needed to allow progress while maintaining a good data cleanup policy? 

David: There are many tools out there that can do classification and AI data cleanup. We’ve worked with many of those, and we learned early on that it’s not enough just to give people a tool and expect them to have successful governance or disposition of your data cleanup programs. 

It’s essential to have a methodology for this type of work. As we talked about earlier, there will be roadblocks. At the end of the day, data cleanup is primarily a change management initiative because we’re asking people to change long-held beliefs and behaviors about how we manage data. Having good change management built into your program is really important as well. It’s not enough just to have a smart tool. You actually have to have smart people to implement.  

Walk us through a change management example for a successful data cleanup program.  

David: A few thoughts on that. It’s easy to kick the can down the road and put off this project until later. That’s easy to do if you have yet to experience the pain of over-retention, either through litigation, a data breach, or some similar event. But, if an organization doesn’t take on this initiative now, you’re delaying the inevitable. Someone will need to do it, and their job will be much harder if we don’t make progress today. 

And that’s what we like to do. We bring the people, the process, and the technology to get this done now. We work with companies that have hundreds of terabytes, if not over a petabyte, of information. We’ve worked with multinational companies, and we’ve done projects that start out as pilots and move into production. We start scanning, quarantining, and eventually deleting data in a process that may take six months to a year. We can do things to speed that up, but much of it has to do with change management. How engaged will we get our user community to help facilitate that process?  

Jeff: Very early on, especially in a situation where you’re going months and months, a lot of what we do around the processes when we work with organizations is we try to find the small wins in the areas that they’re going to see the most benefit. We see results, such as a reduction of data, within a short period within one of the very critical areas. We focus on it. We show some results, and people start buying in. It’s interesting to watch what happens when naysayers are not buying into your change management, then see other departments be successful, seeing the benefits they’re getting from it. The buy-in to the change management initiative starts to pick up steam, and people want to get involved.  

Sounds like the product adoption curve. You find those early adopters who see the value. Then those laggards come aboard later. It seems like the same process.  

David: Absolutely. Get quick wins early. Develop champions for the program, and once they start seeing the benefits, word spreads. You get these champions to be vocal, then comes executive support, and then you get a lot of good involvement from the end users. Ultimately, you end up cleaning a lot more data, and you get less resistance when you get that kind of buy-in. 

Are there specific industries that need to be more on top of data cleaning than others? 

Jeff: The highly regulated industries were early adopters. They started a couple of years ago for many reasons, like regulations. Cleaning up data at a retail enterprise versus a healthcare enterprise was easier, but the same processes applied. Now, we’re at a point where the pendulum is swinging. AI and the fact that we’re looking for productivity gains within our workforce make data cleansing beneficial and necessary for everyone. 

David: One of the surprising areas we’re seeing interest in is with local city governments. They have Freedom of Information Act (FOIA) or public records requests they have to fulfill. With the data volumes, finding everything that’s responsive to those requests can be daunting. They are very motivated to manage and retain information correctly and dispose of it at the end of their retention period. They also want to search for information without sorting through all the garbage. Those types of FOIA requests hinge on unstructured data.  

City governments and some industries that previously never had to worry much because their data was structured are now utilizing Copilot, ChatGPT, and AI tools to be more productive. This makes data cleanup an everybody problem.  

If I’m a company that just realized this all pertains to me, what do I need to know to get going? 

Jeff: My response typically is to draw a line in the sand tomorrow. Manage as much going forward so that 700 terabytes don’t become 800 terabytes in three months. Put yourself in a better place first, and then slowly go back and take care of the data behind you. But if we don’t have anything in place to start doing a better job going forward, our problem continues to get bigger. Let’s not even worry about the 700 terabytes you have. Draw a line in the sand and slow down all the new data you create. Have processes, security, and some structure to manage growth going forward. Then, go back and clean up the other data over time.  

David: Understandably, 700 terabytes might seem inconceivable for organizations, so it’s about more than just drawing a line in the sand. We have to give people the confidence that they can do this. We’ve seen companies throw up their hands and say the problem’s too big; we won’t be able to solve it. Obviously, we have to solve it. If you help them understand the roadblocks and the challenges while proactively addressing them with change management and through a proven methodology, you’ll get the right people on board. Then, you bring technology that can facilitate and get it done relatively efficiently. In a short time, you’ll get some quick wins to show the initiative is possible and build confidence.  

Do you have a favorite project that changed somebody’s mind about data hygiene or had an unexpected outcome?  

David: I’ll speak to one project we did for a large multinational healthcare provider with over 65,000 employees. Our scope included over 500 terabytes of data. The entire project lasted about 15 months after we completed the initial pilot. Change management went well on this project. We had a dedicated executive who communicated why this was important. We had a team meeting with department heads and data owners, explaining the ask, what the timeframe was, and addressing any questions or concerns they had. It gave users that white glove treatment. 

As a result, 399 out of 400 data owners completed their data reviews. They reduced the number of active file shares they had by 76%. We also looked at sensitive data on this project and identified that 91 percent of the file shares we looked at had sensitive data. We ended up removing open access on 63% of those file shares. 

Not only did this company benefit greatly from data cleanup and risk reduction, but we also really set them up for the future to use AI because we removed a lot of that garbage content, that ROT content, if you will.  

Jeff: We worked with the healthcare payer once, and it was very hands-off. I think it was a project that was being pushed from the top a little bit. We were working with the GC at the time, and it felt like there wasn’t a lot of buy-in across the board. At first, we struggled to get people on board, but we found a couple of champions. We showed some early wins, and before you know it, there were people on board. People wanted to learn about technology and get involved in the process. 

They went from really hands-off to hands-on in the process. As we finished what we were initially looking at, they pushed it out to other areas within the organization because they had so much success. To see the ability of our team to go in there and fight through that resistance and to watch an organization change and become active in the process was kind of cool. That, for me, is the most rewarding.  

Written by: Innovative Driven