Strengthening Analytics with Data Documentation
Data documentation is a new term used to describe the capture and use of information about your data. It is used mainly in the context of data transformation, whereby data engineers and analysts can better describe the data models created in data transformation workflows.
Data documentation is critical to your analytics processes. It helps all personas involved in the data modeling and transformation process share, assist, and participate in the data and analytics engineering process.
Let’s take a deeper dive into data documentation, explore what makes for good data documentation, and see how a deep set of data documentation helps add greater value to your analytics processes.
What is Data Documentation?
At the simplest level, data documentation is information about your data. This information ranges from raw schema information to system-generated information to user-supplied information.
While many people associate information about your data with data catalogs, data catalogs are a more general-purpose solution that spans all of your data and tends to be in the domain of IT. If an organization uses an enterprise data catalog, data documentation should further enhance data from the data catalog.
Data documentation refers to information captured about your data in the data modeling and transformation process. Data documentation is highly specific to the data engineering and analytics processes and is in the domain of data engineering and analytics teams.
How is Data Documentation Used?
Data documentation is used throughout your analytics processes, including data engineering, analytics generation, and analytics consumption by the business. Each persona in the process will contribute and use data documentation based on their knowledge about the data and how they participate in the process:
- Data engineers – This persona tends to know more about the data itself – where it resides, how it is structured and formatted, and how to get it – and less about how the business uses the data. They will document the core information about the data and how it was transformed. They will also use this information when vetting and trouble-shooting models and datasets.
- Data analysts and scientists – These personas tend to know less about the core data itself but completely understand how the data is incorporated into analytics and how the business would use the data. They will document the data with this type of information: what the data is good for, how it is used, if it is good and trusted, and what analytics are generated from it.
- Business analysts and teams – These teams will interpret the analytics from the analytics teams to make decisions and resulting actions. The business side needs to understand where the data came from and how it was brought together to best interpret the analytics results. They will consume information captured by the data engineering and analytics teams but will also add information about how they use the data and the business results from the data.
What Should You Expect for Data Documentation?
The data documentation in many data transformation tools focuses on the data engineering side of the analytics process to ensure that data workflows are defined and executed properly. This basic form of data documentation is one way these tools help facilitate software development best practices within data engineering.
Only basic information about the data is captured in these data transformation tools, such as schema information. Any additional information is placed by data engineers within their data modeling and transformation code – SQL – as comments and is used to describe how the data was manipulated for other data engineers to use when determining how to best reuse data models.
The basic information capture and use in most data transformation tools limit the spread of information, knowledge capture, and knowledge sharing across the broader data, analytics, and business teams. This hinders the overall analytics process, makes analytics teams hesitant to trust data, and could lead to analytics and business teams misinterpreting data.
As you evaluate data transformation tools, you should look for much broader and deeper data documentation facilities that your extended data, analytics, and business teams can use and participate in the process. Information that can be captured, supplied, and used should include what is described below.
Auto-generated documentation and information
- The technical schema information about the data,
- The transformations performed both within each model and across the entire data workflow,
- Deep data profiles at each stage in the data workflow as well as in the end data model delivered to analytics teams,
- System-defined properties such as owner, create date, created by, last modified date, last modified by, and more,
- The end to end data lineage for any data workflow from raw data to the final consumed data model, and
- Auditing and status information such as when data workflows are run, and data models have been generated.
User-supplied information
- Descriptions that can be applied at the field level, data model level, and entire data workflow level,
- Tags that can be used for a standardized means to label datasets for what the data contains to how it is used,
- Custom properties that allow analytics and business users to add business-level properties to the data,
- Status and certification fields that have specific purposes of adding trust levels to the data such as status (live or in-dev) or certified,
- Business metadata that allows analytics and business teams to describe data in their terms, and
- Comments that allow the entire team to add ad-hoc information about the data and communicate effectively in the data engineering process.
Let’s explore how this broader and deeper set of data documentation positively impacts your analytics processes.
Collaboration and Knowledge-sharing
The broader and deeper data documentation described above helps the extended team involved in the analytics process to better collaborate and share the knowledge each has with the rest of the team. This level of collaboration allows the broader, diverse team to:
- Efficiently handoff models or components between members at various phases,
- Contribute and use their skills in the most effective manner,
- Provide and share knowledge for more effective reuse of models and promote proper use of the data in analytics,
- Crowdsourcing tasks such as testing, auditing, and governance.
Beyond making the process efficient and increasing team productivity, a collaborative data transformation workflow eliminates manual handoffs and misinterpretation of requirements. This adds one more valuable benefit: it eliminates errors in the data transformations and ensures models get done right the first time.
Discovery
When specific analytics team members are waiting for data engineering to complete a project and deliver analytics-ready datasets, they are typically involved in the process and receive a handoff of the datasets. But what about the rest of the analytics team? Perhaps they can use these new datasets as well.
Your data modeling and transformation tool should have a rich, Google-like faceted search capability that allows any team member to search for datasets across ALL the information in the broad and deep data documentation. This allows:
- Analysts to easily discover what datasets are out there, how they can use these datasets, and quickly determine if datasets apply to the analytics problem they are currently trying to solve,
- Data engineers to easily find data workflows and data models created by other data engineers to determine if they may already solve the problem they are tasked with or if they can reuse them in their current project, and
- Business teams to discover the datasets used in the analytics they are consuming for complete transparency and to best interpret the results.
Facilitating Data Literacy and Strong Analytics
The broader and deeper data documentation we have described here can be used as a lynchpin for facilitating greater data literacy. This happens across all four personas:
- Data engineers – the data documentation information provided by the downstream consumers of the data workflows allows data engineering teams to have greater knowledge of how data is used and helps them get greater context into their future projects,
- Analysts – the information provided by data engineers, other analysts, and business teams allows analysts to gain a better understanding of how to use data and produce faster and more meaningful analytics,
- Data scientists – they can use the information provided about the data to best understand the best form and fit for their AI and ML projects for faster execution of projects and highly accurate models, and
- Business teams – these teams can use the information to increase the overall understanding of the datasets used to increase their trust in the results and perform fast, decisive actions based on the analytics.
Wrap Up
Your data documentation should be better than basic schema information and comments left by data engineers in their SQL code. Everyone involved in the analytics process – data engineers, analytics producers, and analytics consumers – all have knowledge and information about the data that should be captured and shared across the entire team that helps everyone.
Using a data transformation tool that provides the richer data documentation we’ve described here delivers a faster analytics process, fosters collaboration and easy discoverability, and promotes greater data literacy. This leads to greater and better use of your data, strong and accurate analytics and data science, highly trusted results, and more decisive actions by the business.
Author: John Morrell
Source: Datameer