3 items tagged "data warehouse"

  • Data Lakes and the Need for Data Version Control  

    Data Lakes and the Need for Data Version Control

    In the ever-evolving world of big data, managing vast amounts of information efficiently has become a critical challenge for businesses across the globe. As data lakes gain prominence as a preferred solution for storing and processing enormous datasets, the need for effective data version control mechanisms becomes increasingly evident. 

    In this article, we will delve into the concept of data lakes, explore their differences from data warehouses and relational databases, and discuss the significance of data version control in the context of large-scale data management.

    Understanding Data Lakes

    A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its raw format. Unlike traditional data warehouses or relational databases, data lakes accept data from a variety of sources, without the need for prior data transformation or schema definition. As a result, data lakes can accommodate vast volumes of data from different sources, providing a cost-effective and scalable solution for handling big data.

    Before we address the questions, ‘What is data version control?’ and ‘Why is it important for data lakes?’, we will discuss the key characteristics of data lakes.

    Schema-on-Read vs. Schema-on-Write

    Data lakes follow the ‘Schema-on-Read’ approach, which means data is stored in its raw form, and schemas are applied at the time of data consumption. In contrast, data warehouses and relational databases adhere to the ‘Schema-on-Write’ model, where data must be structured and conform to predefined schemas before being loaded into the database.

    Flexibility and Agility

    Data lakes provide flexibility, enabling organizations to store diverse data types without worrying about immediate data modeling. This allows data scientists, analysts, and other stakeholders to perform exploratory analyses and derive insights without prior knowledge of the data structure.

    Cost-Efficiency

    By leveraging cost-effective storage solutions like the Hadoop Distributed File System (HDFS) or cloud-based storage, data lakes can handle large-scale data without incurring prohibitive costs. This is particularly advantageous when dealing with exponentially growing data volumes.

    Data Lakes vs. Data Warehouses and Relational Databases

    It is essential to distinguish data lakes from data warehouses and relational databases, as each serves different purposes and has distinct characteristics.

    Data Warehouses

    Some key characteristics of data warehouses are as follows:

    • Data Type: Data warehouses primarily store structured data that has undergone ETL (Extract, Transform, Load) processing to conform to a specific schema.
    • Schema Enforcement: Data warehouses use a “schema-on-write” approach. Data must be transformed and structured before loading, ensuring data consistency and quality.
    • Processing: Data warehouses employ massively parallel processing (MPP) for quick query performance. They are optimized for complex analytical queries and reporting.
    • Storage Optimization: Data warehouses use columnar storage formats and indexing to enhance query performance and data compression.
    • Use Cases: Data warehouses are tailored for business analysts, decision-makers, and executives who require fast, reliable access to structured data for reporting, business intelligence, and strategic decision-making.

    In summary, data lakes prioritize data variety and exploration, making them suitable for scenarios where the data landscape is evolving rapidly, and the initial data structure might not be well-defined also data lakes are more suitable for storing diverse and raw data for exploratory analysis, while data warehouses focus on structured data, ensuring data quality and enabling efficient querying for business-critical operations like business intelligence and reporting.

    Relational Databases

    Some key characteristics of relational databases are as follows:

    • Data Structure: Relational databases store structured data in rows and columns, where data types and relationships are defined by a schema before data is inserted.
    • Schema Enforcement: Relational databases use a “schema-on-write” approach, where data must adhere to a predefined schema before it can be inserted. This ensures data consistency and integrity.
    • Processing: Relational databases are optimized for transactional processing and structured queries using SQL. They excel at managing structured data and supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions.
    • Scalability: Relational databases can scale vertically by upgrading hardware, but horizontal scaling can be more challenging due to the need to maintain data integrity and relationships.
    • Use Cases: Relational databases are commonly used for applications requiring structured data management, such as customer relationship management (CRM), enterprise resource planning (ERP), and online transaction processing (OLTP) systems.

    Data lakes are designed for storing and processing diverse and raw data, making them suitable for exploratory analysis and big data processing. Relational databases are optimized for structured data with well-defined schemas, making them suitable for transactional applications and structured querying.

    The Importance of Data Version Control in Data Lakes

    As data lakes become the backbone of modern data infrastructures, the management of data changes and version control becomes a critical challenge. Data version control refers to the ability to track, manage, and audit changes made to datasets over time. This is particularly vital in data lakes for the following reasons.

    Data Volume and Diversity

    Data lakes often contain vast and diverse datasets from various sources, with updates and additions occurring continuously. Managing these changes efficiently is crucial for maintaining data consistency and accuracy.

    Collaborative Data Exploration

    In data lakes, multiple teams and stakeholders collaboratively explore data to derive insights. Without proper version control, different users may inadvertently overwrite or modify data, leading to potential data integrity issues and confusion.

    Auditing and Compliance

    In regulated industries or environments with strict data governance requirements, data version control is essential for tracking changes, understanding data lineage, and ensuring compliance with regulations.

    Handling Changes at Scale with Data Version Control

    To effectively handle changes at scale in data lakes, robust data version control mechanisms must be implemented. Here are some essential strategies:

    • Time-Stamped Snapshots: Maintaining time-stamped snapshots of the data allows for a historical view of changes made over time. These snapshots can be used to roll back to a previous state or track data lineage.
    • Metadata Management: Tracking metadata, such as data schema, data sources, and data transformation processes, aids in understanding the evolution of datasets and the context of changes.
    • Access Controls and Permissions: Implementing fine-grained access controls and permissions ensures that only authorized users can make changes to specific datasets, reducing the risk of unauthorized modifications.
    • Change Tracking and Notifications: Setting up change tracking mechanisms and notifications alerts stakeholders about data modifications, ensuring transparency and awareness.
    • Automated Testing and Validation: Automated testing and validation procedures help detect and rectify any anomalies or inconsistencies resulting from data changes.

    Conclusion

    Data lakes have revolutionized the way organizations manage and analyze large-scale data. Their ability to store diverse data types without predefined schemas makes them highly flexible and cost-efficient. However, managing changes in data lakes requires careful attention to ensure data consistency, accuracy, and compliance. 

    Data version control plays a crucial role in addressing these challenges, enabling organizations to handle changes at scale and derive valuable insights from their data lakes with confidence and reliability. By implementing robust version control mechanisms and following best practices, businesses can leverage data lakes to their full potential, driving innovation and informed decision-making.

    Date: September 21, 2023

    Author: Kruti Chapaneri

    Source: ODSC

     

  • Data Warehousing Lessons for A Data Lake World

     

     

    Over the past 2 decades, we have spent considerable time and effort trying to perfect the world of data warehousing. We took the technology that we were given and the data that would fit into that technology, and tried to provide our business constituents with the reports and dashboards necessary to run the businesses.

    It was a lot of hard work and we had to do many “unnatural” acts to get these OLTP (Online Transaction Processing)-centric technologies to work; aggregated tables, plethora of indices, user defined functions (UDF) in PL/SQL, and materialized views just to name a few. Kudos to us!!

    Now as we get ready for the full onslaught of the data lake, what lessons can we take away from our data warehousing experiences? I don’t have all the insights, but I offer this blog in hopes that others will comment and contribute. In the end, we want to learn from our data warehousing mistakes, but we don’t want to throw out those valuable learnings.

    Special thanks to Joe DosSantos (@JoeDosSantos) for his help on this blog.

    Why Did Data Warehousing Fail?

    Below is the list of areas where data warehousing struggled or outright failed. Again, this list is not comprehensive, and I encourage your contributions.

    • Adding New Data Takes Too Long. It took too long to load new data into the data warehouse. The general rule to add new data to a data warehouse was 3 months and $1 million. Because of the need to pre-build a schema before loading data into the data warehouse, the addition of new data sources to the data warehouse was a major effort. We had to conduct weeks of interviews with every potential user to capture every question they might ever want to ask in order to build a schema that handled all of their query and reporting requirements. This greatly hindered our ability to quickly explore new data sources, so organizations resorted to other options, which leads to…
    • Data Silos. Because it took so long to add new data sources to the data warehouse, organizations found it more expedient to build their own data marts, spreadmarts[1] or Access databases. Very quickly there was a wide-spread proliferation of these purpose built data stores across the organization. The result: no single version of the truth and lots of executive meetings wasting time debating whose version of the data was most accurate, which leads to…
    • Lack of Business Confidence. Because there was this proliferation of data across the organization and the resulting executive debates around whose data was most accurate, business leaders’ confidence in the data (and the data warehouse) quickly faded. This became especially true when the data being used to run a business unit was redefined for corporate use in such a way that it was not useful to the business. Take, for instance, a sales manager looking to assign a quota to his rep that manages the GE account and wants a report of historical sales. For him, sales might be Gross and GE might include Synchrony, whereas the corporate division might look at sales as Net or Adjusted and GE as its legal entities. It’s not so much a question of right and wrong as much as it is the enterprise introducing definitions that undermines confidence, which leads to…
    • Underinvestment In Metadata. No business leader had the time to verify the accuracy of the data, and no IT person knew the business well enough to make those data accuracy decisions. Plus, spending the money to hire consultants to do our job for us was always a hard internal sell, which leads to the metadata management denial cycle:
      • IT: “You business users need to own the data.”
      • Business: “We don’t have time to do that.”
      • IT: “Okay, let’s hire consultants.”
      • Business: “Shouldn’t we know our data better than consultants?”
      • IT: “Okay, you business users need to own the data”
      • And so forth…
    • Inability to Easily Share Data. The data warehouse lacked the ability to quickly ingest and consequently easily share data across different business functions and use cases. The data warehouse failed to become that single repository for the storage of the organization’s data assets because of the complexity, difficulty and slowness to add new data to the data warehouse, which leads to…
    • Shadow IT Spend. Nothing confirms the failure of the data warehouse more than shadow IT spend. Business users did not have confidence in how the data warehouse could help them address urgent business needs. Consequently, many line of business leaders pursued their own one-off IT initiatives (call center operations, sales force automation, campaign marketing, logistics planning, financial planning, etc.), which also further contributed to the unmanageable proliferation of data across the organizational data silos.
    • Inability to Handle Unstructured Data. Data warehouses cannot handle unstructured data. Unfortunately the bulk of the world’s data is now found in semi-structured data (log files, sensors, beacons, routers, MAC addresses) and unstructured data (text files, social media postings, audio files, photos, video files). Organizations who wanted a holistic view of the business had to make do with only 10 to 20% of the available organizational data. Hard to provide a holistic view with a 80% to 90% hole in that view.
    • No Predictive Analytic Capabilities. Business Intelligence solutions provide the summarized data necessary to support the organization’s operational and management reporting needs (descriptive analytics). However, most data warehouses lacked the detailed data across a wide variety of structured and unstructured data sources to support the organization’s predictive and prescriptive analytic needs.
    • Too Damned Expensive. Data science is about creating behavioral analytics at the individual levels (e.g., customers, employees, jet engine, train engine, truck, wind turbine, etc.). To uncover these behavioral analytics at the individual level, data scientists need the complete history of detailed transactional, operational and engagement data. The data scientists don’t want 13 months of aggregated data; they want 17 years of detailed transactions, even if that data is now located on mag tape. Trying to gather all of the voluminous data on a data warehouse is a recipe for organizational bankruptcy.
    • Inadequate Processing Power. Let’s face it; data warehouses lacked the economical processing power necessary to analyze petabytes of customer and machine data to uncover behavioral patterns and propensities. The data lake is built on modern, big data scale-out environments using open source software built on commodity servers are game changers in allowing organizations to store and analyze data volumes magnitudes bigger than one could ever economically fit into a data warehouse.

    What Did Data Warehousing Get Right?

    Okay, I was pretty harsh on the data warehouse world in which I grew up. But again, it was amazing what we were able to do with technology designed to deal with single records (insert, update, delete). I have never constructed analytics that uses only a single record. Analytics requires a massive number of records in order to uncover individual behaviors, propensities, tendencies, patterns, etc.

    So what did we get right, and what should we preserve as we move into the modern data lake world?

    • Data Governance. Data governance, into which I also group things like data accuracy, data lineage and data traceability, is as important now as it was in the data warehouse world. Having a process that allows the data science team to quickly ingest and explore the data unencumbered by data governance is a good practice. However you will need data governance rules, policies and procedures once you have determined that there is value in that data to support key decisions. If the business users do not have confidence in the data, then all is lost.
    • Metadata Management. The importance of metadata only becomes clearer as we begin to integrate data and analytics into the organization’s key business processes. The more metadata that we have about the data, the easier it is to get value from that data. Investing in the associated metadata carries the same economic value as investing it the data itself, IMHO. We want to enrich the data as much as possible, and a solid metadata management strategy is key for making that happen.
    • Conformed Dimensions. Having a single master file – or conformed dimension – for key business entities (e.g., products, customers, employees, physicians, teachers, stores, jet engines, locomotives, delivery trucks, etc.) is critical. It is these conformed dimensions that allow the data science team to tie together the wide variety of data sources to create the detailed analytic and behavioral profiles. Maintaining these conformed dimensions is hard work, but without them, there is no way to turn all this valuable data (and metadata) into actionable insights.
    • Single Version of The Truth. While I have always hated the term “single version of the truth,” operationally it is important to have all the data about your key business entities in a single (physical or logical) location. Also, in the Big Data world, the notion of data that is fit for purpose becomes critical. There may not be one truth, but there should be clarity as to how numbers are produced to provide transparency and trust.
    • Analytics Self-service. The idea of creating a self-service environment around analytics is very powerful. How do I pull IT out of the middle of the analytics request and provisioning process? If I truly want to create an environment where the analysts can quickly spin up an analytics sandbox and populate with data, I can’t have heavy manual processes in the middle of that process.
    • Reports Starting Point. The many reports and dashboards that have been built upon your data warehouse are a great starting point for your data lake journey. Business users have requested those reports for a reason. Instead of focusing time and effort to create yet more reports, first try to understand what questions and decisions the business users hoped to address with those reports, and what additional predictive and prescriptive insights do they need from those reports.
    • Yeah, SQL is still the query language of choice and we need to embrace how we help SQL-trained analysts to use that tool on the data lake. Open-source tools like Hive, HBase, and HAWQ are all designed to enable that army of SQL-trained business users and analysts to have access to the wealth of data in the data lake.

    Summary

    There is much that can be learned from our data warehousing experiences. The key is to understand what to keep and what to throw out. That means a single data lake (not data lakes). That means data governance. That means metadata management, and even more that we learned from our data warehousing experiences. We must learn from our experiences, otherwise…

    “Those who do not learn history are doomed to repeat it.”

    [1] Spreadmart (short for “spreadsheet data mart”) is a business intelligence term that refers to the propensity of some organizations or departments within organizations to use individual, desktop-based databases like spreadsheets as a primary means of data organization, storage, and dissemination.

     

     

  • How modern ETL processes lead to better decision making

    How modern ETL processes lead to better decision making

    We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive.

    In recent years we’ve seen data become vastly more available to businesses. This is mostly due to the rise of data warehouses, decrease in cost for data storage, and modern ETL tools that make storing and accessing data more approachable than ever before. This has allowed companies to become more and more data driven in all areas of their business. In fact, being data driven has become ubiquitous and imperative to survival in today’s climate. This article will discuss at a high level how modern businesses are leveraging new technology to ingest a wider variety of data sources. 

    As of 2019, according to Bissfully’s 2020 SaaS trends report, companies in all phases of business growth make use of many SaaS aaplications. smaller companies (0-50 employees) use an average of 102 SaaS applications, while mid-sized ones (101-250 employees) use about 137 SaaS applications. Enterprises, meanwhile, can have as many as 288 SaaS applications in use across the entire organization!

    These SaaS products range from CRMs to marketing automation apps to billing software to help desk tools, etc. These applications can log thousands to millions of rows of data made available via API, much of which can hold valuable insight for businesses. Add to this the company’s own internally-gathered product data and it’s clear that even a small or medium-sized business can do a lot with the data at its disposal. In this article, we’ll cover what the ETL process is, why it’s important to powering data-driven businesses, and how the rise of modern ETL tools has changed the ETL process forever.

    The ETL process: foundation of data-driven businesses

    Every company wants every team within their business to make smarter, data-driven decisions. Customer support teams look at trends in support tickets or do text analysis on conversations to understand where they can provide better onboarding and documentation. Marketing teams want better visibility into their ad performance across different platforms and the ROI on their spend. Product and engineering teams dig into productivity metrics or bug reports to help them better prioritize their resources. 

    The ETL process empowers all these different teams to get the information they need to understand and perform their jobs better. Businesses ingest data from a wide array of sources through the ETL process, which stands for Extract, Transform, Load. The prepared data is then available for analysis and use by the various teams who need it, as well as for advanced analytics, embedding into applications, and use for other data monetization efforts. Whatever you want to do with data, you have to ETL it first. 

    Explaining the traditional ETL process

    Traditionally, the extraction step of the process involves batch-pulling the data from the source via files, application database, or an API. In the transformation step, the data is cleaned and modeled so that it’s fit for analysis and ingestible by the data warehouse. Finally, the cleaned tables are loaded into the data warehouse for analysis. 

    This traditional ETL process is very difficult to scale. It often requires full-time data engineers to develop and maintain the scripts that keep the data flowing. This is because the data providers often make changes to their schemas or APIs, which then break the scripts that power the ETL process. Every time there’s a change, the data engineers scramble to update their scripts to accommodate them, resulting in downtime. With businesses now needing to ingest data from so many disparate (often fast-moving) data sources, maintaining ETL scripts for each one is not scalable. 

    Better living through ETL tools: the modern ETL process

    The modern ETL process follows a slightly different order of operations, dubbed ELT. This new process arose as a result of the introduction of tools to update the ETL process, as well as the rise of modern data warehouses with relatively low storage costs.

    Today, ETL tools do the heavy lifting for you. They have integrations for many of the major SaaS applications, and have teams of engineers who maintain those integrations, taking the pressure off of your in-house data team. These ETL tools are built to connect to most major data warehouses, allowing businesses to plug in their apps on one end and their warehouse on the other, while the ETL tools do the rest.

    Users can usually control orchestration via a simple drop-down selections within the apps, alleviating the need to stand up your own servers or EC2 box or building DAGs to run on platforms like Airflow. ETL tools can also typically offer more robust options for appending new data incrementally, or only updating new and modified rows, which can allow for more frequent loads, and closer to real-time data for the business. With this simplified process for making data available for analysis, data teams can focus on finding new applications for data to generate value for the business.

    The ETL process and data warehouses

    Data warehouses are the present and future of data and analytics. Storage costs on data warehouses have drastically reduced in recent years, which allows businesses to load as many raw data sources as possible without the same concerns they might have had before.

    Today, data teams can ingest raw data before transforming it, allowing them to do the transformations in the warehouse instead of a separate staging area. With the increased availability of data and a common language to access that data, SQL, it allows the business more flexibility in leveraging their data to make the right decisions

    The modern ETL process: delivering better results, faster

    Under the traditional ETL process, as data and processing requirements grew, the chance that on-premise data warehouses would fail grew as well. When this happened, IT would have to swoop in to fix the issue, which usually meant adding more hardware.

    The modern ETL process in today’s data warehouses sidesteps this issue by offloading the compute resource management to the cloud data warehouse. Many cloud data warehouses offer compute scaling that allows for dynamic scaling when needs spike. This allows data teams to still see scalable performance while holding increased numbers of computationally expensive data models and ingesting more large data sources. The decreased cost in compute power along with compute scaling in cloud data warehouses allows data teams to efficiently scale resources up or down to suit their needs and better ensure no downtime. The bottom line is that, instead of having your in-house data and/or IT team fretting over your data storage and computing issues, you can offload that pretty much completely to the data warehouse provider.

    Data teams can then build tests on top of their cloud data warehouse to monitor their data sources for quality, freshness, etc. giving them quicker, more proactive visibility into any problems with their data pipelines.

    From ETL to ELT and beyond

    Data, analytics, and BI have radically evolved since their inception. We are leaps and bounds beyond Excel tables and on-prem-centric data sources. Cloud-native data warehouses, cloud-native architecture in analytics and BI platforms, and embedded analytics powered by these systems have redefined what it means to be truly data-driven in our modern age.

    The ETL process has been updated and can now deliver insights from a wide array of datasets, which helps companies and teams of all kinds make smarter decisions, faster. It also opens the doors for advanced analytics, next-level data monetization, and much more. Whatever you’re building with your data, a modern ELT setup will help you get more from your data, easier.

    While we have seen a large migration to data warehouses, looking to the future, we might see even another evolution that involves data lakes being more widely used as the initial destination for all raw data sources, offering even further cost benefits. Additionally, new tools are starting to take form that allow for transformation within these data lakes, thus continuing the evolution of tools and processes within the data pipeline following the path of ETL to ELT.

    Author: Adam Luba

    Source: Sisense

EasyTagCloud v2.8