3 items tagged "structured data"

  • Building Your Data Structure: FAIR Data

    Building Your Data Structure: FAIR Data

    Obtaining access to the right data is a first, essential step in any Data Science endeavour. But what makes the data “right”?

    The difference in datasets

    Every dataset can be different, not only in terms of content, but in how the data is collected, structured and displayed. For example, how national image archives store and annotate their data is not necessarily how meteorologists store their weather data, nor how forensic experts store information on potential suspects. The problem occurs when researchers from one field need to use a dataset from a different field. The disparity in datasets is not conducive to the re-use of (multiple) datasets in new contexts.

    The FAIR data principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The emphasis is placed on the ability of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention. Launched at a Lorentz workshop in Leiden in 2014, the principles quickly became endorsed and adopted by a broad range of stakeholders (e.g. European Commission, G7, G20) and have been cited widely since their publication in 2016 [1]. The FAIR principles are agnostic of any specific technological implementation, which has contributed to their broad adoption and endorsement.

    Why do we need datasets that can be used in new contexts?

    Ensuring that data sources can be (re)used in many different contexts can lead to unexpected results. For example, combining mental depression data with weather data can establish a correlation between mental states and weather conditions. The original data resources were not created with this reuse in mind, however, applying FAIR principles to these datasets makes this analysis possible.

    FAIRness in the current crisis

    A pressing example of the importance of FAIR data is the current COVID-19 pandemic. Many patients worldwide have been admitted to hospitals and intensive care units. While global efforts are moving towards effective treatments and a COVID-19 vaccine, there is still an urgent need to combine all the available data. This includes information from distributed multimodal patient datasets that are stored at local hospitals in many different, and often unstructured, formats.

    Learning about the disease and its stages, and which drugs may or may not be effective, requires combining many data resources, including SARS-CoV-2 genomics data, relevant scientific literature, imaging data, and various biomedical and molecular data repositories.

    One of the issues that needs to be addressed is combining privacy-sensitive patient information with open viral data at the patient level, where these datasets typically reside in very different repositories (often hospital bound) without easily mappable identifiers. This underscores the need for federated and local data solutions, which lie at the heart of the FAIR principles.

    Examples of concerted efforts to build an infrastructure of FAIR data to combat COVID-19 and future virus outbreaks are in the VODAN initiative [2], the COVID-19 data portal organised by the European Bioinformatics Institute and the ELIXIR network [3].

    FAIR data in Amsterdam

    Many scientific and commercial applications require the combination of multiple sources of data for analysis. While providing a digital infrastructure and (financial) incentives are required for data owners to share their data, we will only be able to unlock the full potential of existing data archives when we are also able to find the datasets needed and use the data within them.

    The FAIR data principles allow us to better describe individual datasets and allow easier re-use in many diverse applications beyond the sciences for which they were originally developed. Amsterdam provides fertile ground for finding partners with appropriate expertise for developing both digital and hardware infrastructures.

    Author: Jaap Heringa

    Source: Amsterdam Data Science

  • Conquering the 4 Key Data Integration Challenges

    Conquering the 4 Key Data Integration Challenges

    The ability to integrate data successfully into a single platform can be a challenge. Well-integrated data makes it easy for the appropriate staff to access and work with it. Poorly integrated data creates problems. Data integration can be described as the process of collecting data from a variety of sources and transforming it into a format compatible with the data storage system – typically, a database or a data warehouse. The use of integrated data, when making business decisions, has become a common practice for many organizations. Unfortunately, the data integration process can be troublesome, making it difficult to use the data when it is needed.

    Successful data integration allows researchers to develop meaningful insights and useful business intelligence.

    Integrated data creates a layer of informational connectivity that lays a base for research and analytics. Data integration maximizes the value of a business’s data, but the integration process requires the right tools and strategies. It allows a business to increase its returns, optimize its resources, and improve customer satisfaction. Data integration promotes high-quality data and useful business intelligence. 

    With the amount of data consistently growing in volume, and the variety of data formats, data integration tools (such as data pipelines) become a necessity. 

    By sharing this high-quality data across departments, organizations can streamline their processes and improve customer satisfaction. Other benefits of integrated data include:

    • Improved communication and collaboration
    • Increased data value 
    • Faster, better decisions based on accurate data
    • Increased sales and profits

    For data to be useful, it must be available for analysis, which means it must be in a readable format. 

    A Variety of Sources

    Data can be gathered from internal sources, plus a variety of external sources. The data taken from internal sources is referred to as “primary data,” and “secondary data” is often collected from outside sources, but not always. The sources of data selected can vary depending on the needs of the research, and each data storage system is unique and different.

    Secondary data is not limited to that from a different organization. It can also come from within an organization itself. Additionally, there are open data sources. 

    With the growing volume of data, the large number of data sources, and their varying formats, data integration has become a necessity for doing useful research. It has become an integral part of developing business intelligence. Some examples of data sources are listed below.

    Primary Data

    • Sensors: Recorded data from a sensor, such as a camera or thermometer
    • Survey: Answers to business and quality of service questions
    • User Input: Often used to record customer behavior (clicks, time spent)
    • Geographical Data: The location of an entity (a person or machine) using equipment at a point in time
    • Transactions: Business transactions (typically online)
    • Event Data: Recording of the data is triggered by an event (email arriving, sensor detecting motion)

    Secondary Data

    • World Bank Open Data
    • Data.gov (studies by the U.S. government)
    • NYU Libraries Research Guides (Science)

    Internal Secondary Data

    • Quickbooks (for expense management)
    • Salesforce (for customer information/sales data)
    • Quarterly sales figures
    • Emails 
    • Metadata
    • Website cookies

    Purchased, third-party data can also be a concern. Two fairly safe sources of third-party data are the Data Supermarket and Databroker. This type of data is purchased by businesses having no direct relationship with the consumers.

    Top Data Integration Challenges

    Data integration is an ongoing process that will evolve as the organization grows. Integrating data effectively is essential to improve the customer experience, or to gain a better understanding of the areas in the business that need improving. There are a number of prominent data integration problems that businesses commonly encounter:

    1. Data is not where it should be: This common problem occurs when the data is not stored in a central location. Instead, data is spread throughout the organization’s various departments. This situation promotes the risk of missing crucial information during research.

    A simple solution is to store all data in a single location (or perhaps two, the primary database and a data warehouse). Apart from personal information that is protected by law, departments must share their information, and data silos would be forbidden. 

    2. Data collection delays: Often, data must be processed in real time to provide accurate and meaningful insights. However, if data technicians must be involved to manually complete the data integration process, real-time processing is not possible. This, in turn, leads to delays in customer processing and analytics. 

    The solution to this problem is automated data integration tools. They have been developed specifically to process data in real time, prompting efficiency and customer satisfaction.

    3. Unstructured data formatting issues: A common challenge for data integration is the use of unstructured data (photos, video, audio, social media). A continuously growing amount of unstructured data is being generated and collected by businesses. Unstructured data often contains useful information that can impact business decisions. Unfortunately, unstructured data is difficult for computers to read and analyze. 

    There are new software tools that can assist in translating unstructured data (e.g., MonkeyLearn, which uses machine learning for finding patterns and Cogito, which uses natural language processing).

    4. Poor-quality data: Poor-quality data has a negative impact on research, and can promote poor decision-making. In some cases, there is an abundance of data, but huge amounts reflect “old” information that is no longer relevant, or directly conflicts current information. In other cases, duplicated data, and partially duplicated data, can provide an inaccurate representation of customer behavior. Inputting large amounts of data manually can also lead to mistakes.

    The quality of data determines how valuable an organization’s business intelligence will be. If an organization has an abundance of poor-quality data, it must be assumed there is no Data Governance program in place, or the Data Governance program is poorly designed. The solution to poor data quality is the implementation of a well-designed Data Governance program. (A first step in developing a Data Governance program is cleaning up the data. This can be done in-house with the help of data quality tools or with the more expensive solution of hiring outside help.)

    The Future of Data Integration

    Data integration methods are shifting from ETL (extract-transform-load) to automated ELT (extract-load-transform) and cloud-based data integration. Machine learning (ML) and artificial intelligence (AI) are in the early stages of development for working with data integration. 

    An ELT system loads raw data directly to a data warehouse (or a data lake), shifting the transformation process to the end of the pipeline. This allows the data to be examined before being transformed and possibly altered. This process is very efficient when processing significant amounts of data for analytics and business intelligence.

    A cloud-based data integration system helps businesses merge data from various sources, typically sending it to a cloud-based data warehouse. This integration system improves operational efficiency and supports real-time data processing. As more businesses use Software-as-a-Service, experts predict more than 90% of data-driven businesses will eventually shift to cloud-based data integration. From the cloud, integrated data can be accessed with a variety of devices.

    Using machine learning and artificial intelligence to integrate data is a recent development, and still evolving. AI- and ML-powered data integration requires less human intervention and handles semi-structured or unstructured data formats with relative ease. AI can automate the data transformation mapping process with machine learning algorithms.

    Author: Keith D. Foote

    Source: Dataversity

  • The difference between structured and unstructured data

    The difference between structured and unstructured data

    Structured data and unstructured data are both forms of data, but the first uses a single standardized format for storage, and the second does not. Structured data must be appropriately formatted (or reformatted) to provide a standardized data format before being stored, which is not a necessary step when storing unstructured data.

    The relational database provides an excellent example of how structured data is used and stored. The data is normally formatted into specific fields (for example, credit card numbers or addresses), allowing the data to be easily found using SQL.

    Non-relational databases, also called NoSQL, provide a way to work with unstructured data.

    Edgar F. Codd invented relational databases (RDBMs) in 1970, and they became popular during the 1980s. Relational databases allow users to access data and write in SQL (Structured Query Language). RDBMs and SQL gave organizations the ability to analyze stored data on demand, providing a significant advantage against the competition of those times. 

    Relational databases are user-friendly, and very, very efficient at maintaining accurate records. Regrettably, they are also quite rigid and cannot work with other languages or data formats.

    Unfortunately for relational databases, during the mid-1990s, the internet gained significantly in popularity, and the rigidity of relational databases could not handle the variety of languages and formats that became accessible. This made research difficult, and NoSQL was developed as a solution between 2007 and 2009. 

    A NoSQL database translates data written in different languages and formats efficiently and quickly and avoids the rigidity of SQL. Structured data is often stored in relational databases and data warehouses, while unstructured data is often stored in NoSQL databases and data lakes.

    For broad research, unstructured data used by NoSQL databases, compared to relational databases, are the better choice because of their speed and flexibility.

    The Expanded Use of the Internet and Unstructured Data

    During the late 1980s, the low prices of hard disks, combined with the development of data warehouses, resulted in remarkably inexpensive data storage. This, in turn, resulted in organizations and individuals embracing the habit of storing all data gathered from customers, and all the data collected from the internet for research purposes. A data warehouse allows analysts to access research data more quickly and efficiently.

    Unlike a relational database, which is used for a variety of purposes, a data warehouse is specifically designed for a quick response to queries.

    Data warehouses can be cloud-based, or part of a business’s in-house mainframe server. They are compatible with SQL systems because by design, they rely on structured datasets. Generally speaking, data warehouses are not compatible with unstructured, or NoSQL, databases. Before the 2000s, businesses focused only on extracting and analyzing information from structured data. 

    The internet began to offer unique data analysis opportunities and data collections in the early 2000s. With the growth of web research and online shopping, businesses such as Amazon, Yahoo, and eBay began analyzing their customer’s behavior by including such things as search logs, click-rates, and IP-specific location data. This abruptly opened up a whole new world of research possibilities. The profits resulting from their research prompted other organizations to begin their own expanded business intelligence research.

    Data lakes came about as a way to deal with unstructured data in roughly 2015. Currently, data lakes can be set up both in-house and in the cloud (the cloud version eliminates in-house installation difficulties and costs). The advantages of moving a data lake from an in-house location to the cloud for analyzing unstructured data can include:

    • Cloud-based tools that are more efficient: The tools available on the cloud can build data pipelines much more efficiently than in-house tools. Often, the data pipeline is pre-integrated, offering a working solution while saving hundreds of hours of in-house set up costs.
    • Scaling as needed: A cloud provider can provide and manage scaling for stored data, as opposed to an in-house system, which would require adding machines or managing clusters.
    • A flexible infrastructure: Cloud services provide a flexible, on-demand infrastructure that is charged for based on time used. Additional services can also be accessed. (However, confusion and inexperience will result in wasted time and money.) 
    • Backup copies: Cloud providers strive to prevent service interruptions, so they store redundant copies of the data, using physically different servers, just in case your data gets lost.

    Data lakes, sadly, have not become the perfect solution for working with unstructured data. The data lake industry is about seven years old and is not yet mature – unlike structured/SQL data systems. 

    Cloud-based data lakes may be easy to deploy but can be difficult to manage, resulting in unexpected costs. Data reliability issues can develop when combining batch and streaming data and corrupted data. A lack of experienced data lake professionals is also a significant problem.

    Data lakehouses, which are still in the development stage, have the goal of storing and accessing unstructured data, while providing the benefits of structured data/SQL systems. 

    The Benefits of Using Structured Data

    Basically, the primary benefit of structured data is its ease of use. This benefit is expressed in three ways:

    • A great selection of tools: Because this popular way of organizing data has been around for a while, a significant number of tools have been developed for structured/SQL databases.
    • Machine learning algorithms: Structured data works remarkably well for training machine learning algorithms. The clearly defined nature of structured data provides a language machine learning can understand and work with.
    • Business transactions: Structured data can be used for business purposes by the average person because it’s easy to use. There is no need for an understanding of different types of data.

    The Benefits of Using Unstructured Data 

    Examples of unstructured data include such things as social media posts, chats, email, presentations, photographs, music, and IoT sensor data. The primary strength of NoSQL and data lakes working with unstructured data is their flexibility in working with a variety of data formats. The benefits of working with NoSql databases or data lakes are:

    • Faster accumulation rates: Because there is no need to transform different types of data into a standardized format, it can be gathered quickly and efficiently.
    • More efficient research: A broader base of data taken from a variety of sources typically provides more accurate predictions of human behavior.

    The Future of Structured and Unstructured Data

    Over the next decade, the use of unstructured data will become much easier to work with, and much more commonplace. It will have no problems working with structured data. Tools for structured data will continue to be developed, and it will continue to be used for business purposes. 

    Although very much in the early stages of development, artificial intelligence algorithms have been developed that help find meaning automatically when searching unstructured data.

    Currently, Microsoft’s Azure AI is using a combination of optical character recognition, voice recognition, text analysis, and machine vision to scan and understand unstructured collections of data that may be made up of text or images. 

    Google offers a wide range of tools using AI algorithms that are ideal for working with unstructured data. For example, Vision AI can decode text, analyze images, and even recognize the emotions of people in photos.

    In the next decade, we can predict that AI will play a significant role in processing unstructured data. There will be an urgent need for “recognition algorithms.” (We currently seem to be limited to image recognitionpattern recognition, and facial recognition.) As artificial intelligence evolves, it will be used to make working with unstructured data much easier.

    Author: Keith D. Foote

    Source: Dataversity

EasyTagCloud v2.8