The 4 steps of the big data life cycle
Simply put, from the perspective of the life cycle of big data, there are nothing more than four aspects:
- Big data collection
- Big data preprocessing
- Big data storage
- Big data analysis
All above four together constitute the core technology in the big data life cycle.
Big data collection
Big data collection is the collection of structured and unstructured massive data from various sources.
Database collection: Sqoop and ETL are popular, and traditional relational databases MySQL and Oracle still serve as data storage methods for many enterprises. Of course, for the open source Kettle and Talend itself, big data integration content is also integrated, which can realize data synchronization and integration between hdfs, hbase and mainstream Nosq databases.
Network data collection: A data collection method that uses web crawlers or website public APIs to obtain unstructured or semi-structured data from web pages and unify them into local data.
File collection: Including real-time file collection and processing technology flume, ELK-based log collection and incremental collection, etc.
Big data preprocessing
Big data preprocessing refers to a series of operations such as “cleaning, filling, smoothing, merging, normalization, consistency check” and other operations on the collected raw data before data analysis, in order to improve the data Quality lays the foundation for later analysis work. Data preprocessing mainly includes four parts
- Data cleaning
- Data integration
- Data conversion
- Data specification
Data cleaning refers to the use of cleaning tools such as ETL to deal with missing data (missing attributes of interest), noisy data (errors in the data, or data that deviates from expected values), and inconsistent data.
Data integration refers to the consolidation and storage of data from different data sources in a unified database. The storage method focuses on solving three problems: pattern matching, data redundancy, and data value conflict detection and processing.
Data conversion refers to the process of processing the inconsistencies in the extracted data. It also includes data cleaning, that is, cleaning abnormal data according to business rules to ensure the accuracy of subsequent analysis results.
Data specification refers to the operation of minimizing the amount of data to obtain a smaller data set on the basis of keeping the original appearance of the data to the maximum extent, including: data party aggregation, dimension specification, data compression, numerical specification, concept layering, etc.
Big data storage
Big data storage refers to the process of using memory to store the collected data in the form of a database in three typical routes:
New database cluster based on MPP architecture: Using Shared Nothing architecture, combined with the efficient distributed computing model of MPP architecture, through column storage, coarse-grained indexing and other big data processing technologies, the focus is on data storage methods developed for industry big data. With the characteristics of low cost, high performance, high scalability, etc., it has a wide range of applications in the field of enterprise analysis applications.
Compared with traditional databases, its PB-level data analysis capabilities based on MPP products have significant advantages. Naturally, MPP database has also become the best choice for a new generation of enterprise data warehouse.
Technology expansion and packaging based on Hadoop: Hadoop-based technology expansion and encapsulation is aimed at data and scenarios that are difficult to process with traditional relational databases (for storage and calculation of unstructured data, etc.), using Hadoop open source advantages and related features (good at handling unstructured and semi-structured data), Complex ETL processes, complex data mining and calculation models the process of deriving relevant big data technology.
With the advancement of technology, its application scenarios will gradually expand. The most typical application scenario at present is to support the Internet big data storage and analysis by expanding and encapsulating Hadoop, involving dozens of NoSQL technologies.
Big data all-in-one: This is a combination of software and hardware designed for the analysis and processing of big data. It consists of a set of integrated servers, storage devices, operating systems, database management systems, and pre-installed and optimized software for data query, processing, and analysis. It has good stability and vertical scalability.
Big data analysis and mining
From visual analysis, data mining algorithms, predictive analysis, semantic engine, data quality management, etc., the process of extracting, refining and analyzing the chaotic data.
Visual analysis: Visual analysis refers to an analysis method that clearly and effectively conveys and communicates information with the aid of graphical means. Mainly used in massive data association analysis, that is, with the help of a visual data analysis platform, the process of performing association analysis on dispersed heterogeneous data and making a complete analysis chart. It is simple, clear, intuitive and easy to accept.
Data mining algorithm: Data mining algorithms are data analysis methods that test and calculate data by creating data mining models. It is the theoretical core of big data analysis.
There are various data mining algorithms, and different algorithms show different data characteristics due to different data types and formats. But generally speaking, the process of creating a model is similar, that is, first analyze the data provided by the user, then search for specific types of patterns and trends, and use the analysis results to define the best parameters for creating a mining model, and apply these parameters In the entire data set to extract feasible patterns and detailed statistics.
Data quality management refers to the identification, measurement, monitoring, and early warning of various data quality problems that may be caused in each stage of the data life cycle (planning, acquisition, storage, sharing, maintenance, application, extinction, etc.) to improve data A series of quality management activities.
Predictive analysis: Predictive analysis is one of the most important application areas of big data analysis. It combines a variety of advanced analysis functions (special statistical analysis, predictive modeling, data mining, text analysis, entity analysis, optimization, real-time scoring, machine learning, etc.), to achieve the purpose of predicting uncertain events.
Help users analyze trends, patterns, and relationships in structured and unstructured data, and use these indicators to predict future events and provide a basis for taking measures.
Semantic Engine: Semantic engine refers to the operation of adding semantics to existing data to improve users’ Internet search experience.
Author: Sajjad Hussain
Source: Medium