What is dark data? And how to deal with it
It’s easier than ever to collect data without a specific purpose, under the assumption that it may be useful later. Often, though, that data ends up unused and even forgotten because of several simple factors: The fact that the data is being collected isn’t effectively communicated to potential users within an organization. The repositories that hold the data aren’t widely known. Or perhaps there simply isn’t enough analysis capacity within the company to process it. This data that is collected but not used is often termed 'dark data'.
Dark data presents an organization with tremendous opportunities, as well as liabilities. If it is harnessed effectively, it can be used to produce insights that wouldn’t otherwise be available. With that in mind, it’s important to make this dark data accessible so it can power those innovative use cases.
On the other hand, lack of visibility into all the data being collected within an organization can make it difficult to accurately manage costs, and easy to accidentally run afoul of retention policies. It can also hamper efforts to ensure compliance with regulations like the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).
So what can be done to maximize the benefits of dark data and avoid these problems?
Some best practices
When dealing with dark data, the foremost best practice is to shine a spotlight on it by communicating to potential users within the organization what data is being collected.
Secondly, organizations need to evaluate whether and for how long it makes sense to retain the data. This is crucial to avoid incurring potentially substantial costs collecting and storing data that isn’t being used and won’t be used in the future, and even more importantly to ensure that the data is being handled and secured properly.
Perhaps the biggest challenge when working with dark data is simply getting access to it, as it’s often stored in siloed repositories close to where the data is being collected. Additionally, it may be stored in systems and formats that are difficult to query or have limited analytics capabilities.
So the next step is to ensure that the data that is collected can actually be used effectively. The two main approaches are: (1) investing in tooling that can query the data where it is currently stored, and (2) moving the data into centralized data storage platforms.
I recommend combining these two approaches. Firstly, adopt tools that provide the ability to discover, analyze, and visualize data from multiple platforms and locations via a single interface, which will increase data visibility and reduce the tendency to store the same data multiple times. Second, leverage storage platforms that can efficiently aggregate and store data that would otherwise be inaccessible, in order to reduce the number of data stores that must be tracked and managed.
Considering the potential power and pitfalls that come with having dark data in your organization, it’s definitely worth the effort to bring it out of the dark.
Author: Dan Cech