7 items tagged "SQL "

  • 6 Basic Security Concerns for SQL Databases

    Durjoy-Patranabish-Blueocean-Market-IntelligenceConsider these scenarios: A low-level IT systems engineer spills soda, which takes down a bank of servers; a warehouse fire burns all of the patient records of a well-regarded medical firm; a government division’s entire website vanishes without a trace. Data breaches and failures are not isolated incidents. According to the 2014 Verizon Data Breach Investigations Report, databases are one of the most critical vulnerability points in corporate data assets. Databases are targeted because their information is so valuable, and many organizations are not taking the proper steps to ensure data protection.

    • Only 5 percent of billions of dollars allocated to security products is used for security in data centers, according to a report from International Data Corporation (IDC).
    • In a July 2011 survey of employees at organizations with multiple computers connected to the Internet, almost half said they had lost or deleted data by accident.
    • According to Fortune magazine, corporate CEOs are not making data security a priority, seemingly deciding that they will handle a data problem if it actually happens.

    You might think CEOs would be more concerned, even if it is just for their own survival. A 2013 data breach at Target was widely considered to be an important contributing factor to the ouster of Greg Steinhafel, then company president, CEO and chairman of the board. The Target breach affected more than 40 million debit and credit card accounts at the retailing giant. Stolen data included names of customers, their associated card numbers, security codes and expiration dates.
    Although the threats to corporate database security have never been more sophisticated and organized, taking necessary steps and implementing accepted best practices will decrease the chances of a data breach, or other database security crisis, taking place at your organization.

    6 Basic Security Concerns

    If you are new to database administration, you may not be familiar with the basic steps you can take to improve database security. Here are the first moves you should make

    1. The physical environment. One of the most-often overlooked steps in increasing database security is locking down the physical environment. While most security threats are, in fact, at the network level, the physical environment presents opportunities for bad actors to compromise physical devices. Unhappy employees can abscond with company records, health information or credit data. To protect the physical environment, start by implementing and maintaining strict security measures that are detailed and updated on a regular basis. Severely limit access to physical devices to only a short list of employees who must have access as part of their job. Strive to educate employees and systems technicians about maintaining good security habits while operating company laptops, hard drives, and desktop computers. Lackadaisical security habits by employees can make them an easy target.


    2. Network security. Database administrators should assess any weak points in its network and how company databases connect. An updated antivirus software that runs on the network is a fundamental essential item. Also, ensure that secure firewalls are implemented on every server. Consider changing TCP/IP ports from the defaults, as the standard ports are known access points for hackers and Trojan horses.


    3. Server environment. Information in a database can appear in other areas, such as log files, depending on the nature of the operating system and database application. Because the data can appear in different areas in the server environment, you should check that every folder and file on the system is protected. Limit access as much is possible, only allowing the people who absolutely need permission to get that information. This applies to the physical machine as well. Do not provide users with elevated access when they only need lower-level permissions.


    4. Avoid over-deployment of features. Modern databases and related software have some services designed to make the database faster, more efficient and secure. At the same time, software application companies are in a very competitive field, essentially a mini arms race to provide better functionality every year. The result is that you may have deployed more services and features than you will realistically use. Review each feature that you have in place, and turn off any service that is not really needed. Doing so cuts down the number of areas or “fronts” where hackers can attack your database.


    5. Patch the system. Just like a personal computer operating system, databases must be updated on a continuing basis. Vendors constantly release patches, service packs and security updates. These are only good if you implement them right away. Here is a cautionary tale: In 2003, a computer worm called the SQL Slammer was able to penetrate tens of thousands of computer services within minutes of its release. The worm exploited a vulnerability in Microsoft’s Desktop Engines and SQL Server. A patch that fixed a weakness in the server’s buffer overflow was released the previous summer, but many companies that became infected had never patched their servers.


    6. Encrypt sensitive data. Although back-end databases might seem to be more secure than components that interface with end users, the data must still be accessed through the network, which increases its risk. Encryption cannot stop malicious hackers from attempting to access data. However, it does provide another layer of security for sensitive information such as credit card numbers.

    Famous Data Breaches

    Is all this overblown? Maybe stories of catastrophic database breaches are ghost stories, conjured up by senior IT managers to force implementation of inconvenient security procedures. Sadly, data breaches happen on a regular basis to small and large organizations alike. Here are some examples:

    • TJX Companies. In December 2006, TJX Companies, Inc., failed to protect its IT systems with a proper firewall. A group led by high-profile hacker Albert Gonzalez gained access to more than 90 million credit cards. He was convicted of the crime and invited to spend over 40 years in prison. Eleven other people were arrested in relation to the breach.
    • Department of Veterans Affairs. A database containing names, dates of birth, types of disability and Social Security numbers of more than 26 million veterans was stolen from an unencrypted database at the Department of Veterans Affairs. Leaders in the organization estimated that it would cost between $100 million and $500 million to cover damages resulting from the theft. This is an excellent example of human error being the softest point in the security profile. An external hard drive and laptop were stolen from the home of an analyst who worked at the department. Although the theft was reported to local police promptly, the head of the department was not notified until two weeks later. He informed federal authorities right away, but the department did not make any public statement until several days had gone by. Incredibly, an unidentified person returned the stolen data in late June 2006.
    • Sony PlayStation Network. In April 2011, more than 75 million PlayStation network accounts were compromised. The popular site was down for weeks, and industry experts estimate the company lost millions of dollars. It is still considered by many as the worst breach of a multiplayer gaming network in history. To this day, the company says it has not determined who the attacks were. The hackers were able to get the names of gamers, their email addresses, passwords, buying history, addresses and credit card numbers. Because Sony is a technology company, it was even more surprising and concerning. Consumers began to wonder: If it could happen to Sony, was their data safe at other big companies.
    • Gawker Media. Hackers breached Gawker Media, parent company of the popular gossip site Gawker.com, in December 2010. The passwords and email addresses of more than one million users of Gawker Media properties like Gawker, Gizmodo, and Lifehacker, were compromised. The company made basic security mistakes, including storing passwords in a format hackers could easily crack.

    Take These Steps

    In summary, basic database security is not especially difficult but requires constant vigilance and consistent effort. Here is a snapshot review:

    • Secure the physical environment.
    • Strengthen network security.
    • Limit access to the server.
    • Cut back or eliminate unneeded features.
    • Apply patches and updates immediately.
    • Encrypt sensitive data such as credit cards, bank statements, and passwords.
    • Document baseline configurations, and ensure all database administrators follow the policies.
    • Encrypt all communications between the database and applications, especially Web-based programs.
    • Match internal patch cycles to vendor release patterns.
    • Make consistent backups of critical data, and protect the backup files with database encryption.
    • Create an action plan to implement if data is lost or stolen. In the current computing environment, it is better to think in terms of when this could happen, not if it will happen.

    Basic database security seems logical and obvious. However, the repeated occurrences of major and minor data breaches in organizations of all sizes indicate that company leadership, IT personnel, and database administrators are not doing all they can to implement consistent database security principles.
    The cost to do otherwise is too great. Increasingly, corporate America is turning to cloud-based enterprise software. Many of today’s popular applications like Facebook, Google and Amazon rely on advanced databases and high-level computer languages to handle millions of customers accessing their information at the same time. In our next article, we take a closer look at advanced database security methods that these companies and other forward-thinking organizations use to protect their data and prevent hackers, crackers, and thieves from making off with millions of dollars worth of information.

    Source: Sys-con Media

  • Becoming a better data scientist by improving your SQL skills

    Becoming a better data scientist by improving your SQL skills

    Learning advanced SQL skills can help data scientists effectively query their databases and unlock new insights into data relationships, resulting in more useful information.

    The skills people most often associate with data scientists are usually those "hard" technical and math skills, including statistics, probability, linear algebra, algorithm knowledge and data visualization.  They need to understand how to work with structured and unstructured data stores and use machine learning and analytics programs to extract valuable information from these stores.

    Data scientists also need to possess "soft" skills such as business and domain process knowledge, problem solving, communication and collaboration.

    These skills, combined with advanced SQL abilities, enable data scientists to extract value, information and insight from data.

    In order to unlock the full value from data, data scientists need to have a collection of tools for dealing with structured information. Many organizations still operate and rely heavily on structured enterprise data stores, data warehouses and databases. Having advanced skills to extract, manipulate and transform this data can really set data scientists apart from the pack.

    Advanced vs. beginner SQL skills for data scientists

    The common tool and language for interacting with structured data stores is the Structured Query Language (SQL), a standard, widely adopted syntax for data stores that contain schemas that define the structure of their information. SQL allows the user to query, manipulate, edit, update and retrieve data from data sources, including the relational database, an omnipresent feature of modern enterprises.

    Relational databases that utilize SQL are popular within organizations, so data scientists should have SQL knowledge at both the basic and advanced levels.

    Basic SQL skills include knowing how to extract information from data tables as well as how to insert and update those records.

    Because relational databases are often large with many columns and millions of rows, data scientists won't want to pull the entire database for most queries but rather extract only the information needed from a table. As a result, data scientists will need to know at a fundamental level how to apply conditional filters to filter and extract only the data they need.

    For most cases, the data that analysts need to work with will not live on just one database, and certainly not in a single table in that database.

    It's not uncommon for organizations to have hundreds or thousands of tables spread across hundreds or thousands of databases that were created by different groups and at different periods. Data scientists need to know how to join these multiple tables and databases together, making it easier to analyze different data sets.

    So, data scientists need to have deep knowledge of JOIN and SELECT operations in SQL as well as their impact on overall query performance.

    However, to address more complex data analytics needs, data scientists need to move beyond these basic skills and gain advanced SQL skills to enable a wider range of analytic abilities. These advanced skills enable data scientists to work more quickly and efficiently with structured databases without having to rely on data engineering team members or groups.

    Understanding advanced SQL skills can help data scientists stand out to potential employers or shine internally.

    Types of advanced SQL skills data scientists need to know

    Advanced SQL skills often mean distributing information across multiple stores, efficiently querying and combining that data for specific analytic purposes.

    Some of these skills include the following:

    Advanced and nested subqueries. Subqueries and nested queries are important to combine and link data between different sources. Combined with advanced JOIN operations, subqueries can be faster and more efficient than basic JOIN or queries because they eliminate extra steps in data extraction.

    Common table expressions. Common table expressions allow you to create a temporary table that enables temporary storage while working on large query operations. Multiple subqueries can complicate things, so table expressions help you break down your code into smaller chunks, making it easier to make sense of everything. 

    Efficient use of indexes. Indexes keep relational databases functioning effectively by setting up the system for expecting and optimizing for particular queries. Efficient use of indexes can greatly speed up performance, making data easier and faster to find. Conversely, poor use of indexing can lead to high query time and slow query performance, resulting in systems that can have runaway performance when queried at scale.

    Advanced use of date and time operations. Knowing how to manipulate date and time can come in handy, especially when working with time-series data. Advanced date operations might require knowledge of date parsing, time formats, date and time ranges, time grouping, time sorting and other activities that involve the use of timestamps and date formatting.

    Delta values. For many reasons, you may want to compare values from different periods. For example, you might want to evaluate sales from this month versus last month or sales from December this year versus December last year. You can find the difference between these numbers by running delta queries to uncover insights or trends you may not have seen otherwise.

    Ranking and sorting methods. Being able to rank and sort rows or values is necessary to help uncover key insights from data. Data analytics requirements might include ranking data by number of products or units sold, top items viewed, or top sources of purchases. Knowing advanced methods for ranking and sorting can optimize overall query time and provide accurate results.

    Query optimization. Effective data analysts spend time not only formulating queries but optimizing them for performance. This skill is incredibly important once databases grow past a certain size or are distributed across multiple sources. Knowing how to deal with complex queries and generate valuable results promptly with optimal performance is a key skill for effective data scientists.

    The value of advanced SQL skills

    The main purpose of data science is to help organizations derive value by finding information needles in data haystacks. Data scientists need to be masters at filtering, sorting and summarizing data to provide this value. Advanced SQL skills are core to providing this ability.

    Organizations are always looking to find data science unicorns who have all the skills they want and more. Knowing different ways to shape data for targeted analysis is incredibly desirable.

    For many decades, companies have stored valuable information in relational databases, including transactional data and customer data. Feeling comfortable finding, manipulating, extracting, joining or adding data to these databases will give data scientists a leg up on creating value from this data.

    As with any skill, learning advanced SQL skills will take time and practice to master. However, enterprises provide many opportunities for data scientists and data analysts to master those skills and provide more value to the organization with real-life data and business problems to solve.

    Author: Kathleen Walch

    Source: TechTarget

  • Data wrangling in SQL: 5 recommended methods

    Data wrangling in SQL: 5 recommended methods

    Data wrangling is an essential job function for data engineering, data science, or machine learning roles. As knowledgable coders, many of these professionals can rely on their programming skills and help from libraries like Pandas to wrangle data. However, it can often be optimal to manipulate data directly at the source with SQL scrips. Here are some data wrangling techniques every data expert should know.

    Note for the purpose of this blog post I will not go into the details of the many versions of SQL. Much of the syntax mentioned is conformant with the SQL-92 standard, but where there are platform-specific clauses and functions, I will try and point them out (PostgreSQL, MySQL, T-SQL, PSQL, etc) but not extensively. Remember, however, it’s common to implement a missing function directly from your favorite flavor of SQL. For example, the correlation coefficient function, CORR, is present in PostgreSQL and PLSQL but not MySQL and can be implemented directly or imported from an external library. To keep this blog brief, I don’t specify the implementation of these techniques but they are widely available and will also be covered in my upcoming virtual training session where I’ll go into further detail on data wrangling techniques and how to wrangle data.

    1. Data Profiling with SQL

    Data profiling is often the first step in any data science or machine learning project and an important preliminary step to explore the data prior to data preparation or transformation. The goal is straightforward: determine if the data is accurate, reliable, and representative to build performing algorithms since the resulting models are only as good as the data employed therein.

    Core SQL Skills for Data Profiling

    • Descriptive statistics: MIN, MAX, MEAN, AVERAGE, MEDIAN,, STDEV, , STDDEV(PostgreSQL), STDEVP (Transact-SQL)
    • Correlation functions:: CORR (PostgreSQL), 
    • Aggregation functions: COUNT, SUM, GROUP BY

    Data profiling starts with some basic descriptive statistic functions such as MIN, MAX, STDEV, etc to help you understand the distribution and range of your data. Even basic aggregation functions like COUNT and GROUP BY are useful. From there, standard deviation functions including STDEV/STDEVP can tell you a lot about your data (high, low) and its distribution. For a quick look at correlations across features and the target feature, the CORR correlation coefficient function can give some quick insight. 

    2. Detecting Outliers with SQL

    An essential part of data profiling is identifying outliers, and SQL can be a quick and easy way to find extreme values. You can get quite far with the below simple techniques and employing basic sorting techniques.

    Core SQL Skills for Detecting Outliers 

    • Sorting & Filtering: ORDER, CASE, WHERE HAVING
    • Descriptive statistics: AVG, AVERAGE, MEAN, MEDIAN
    • Winsorization: PERCENTILE_CONT, PERCENT_RANK(T-SQL),  Subqueries, VAR (T-SQL) 
    • Fixing outliers: UPDATE, NULL, AVG

    Simple sorting and ordering using ORDER BY will produce quick results and further classifying using CASE statements and using groups and summation. More sophisticated techniques can be used by implementing various forms of Winsorization and variance functions VAR(T-SQL). To identify outliers you can use percentile calculations, and VAR(T-SQL) can be used to measure the statical variants.

    There are various techniques for handling outliers from ignoring to removing (trimming). Replacement techniques such as Winsorization where we essentially change extreme values in the dataset to less extreme values with observations closest to them. This typically requires us to calculate percentiles. We can use UPDATE with PERCENTILE_CONT (present in PostgreSQL, T-SQL, and PLSQ) to, for example,  sets all observations greater than the 95th percentile equal to the value at the 95th percentile and all observations less than the 5th percentile equal to the value at the 5th percentile.

    3. Time Series, ReIndexing, and Pivoting with SQL

    Time-series databases such as Prometheus, TDengine, and InfluxDB are quite popular among data wrangling techniques but more often than not you may be dealing with a standard relational database, and pivoting, reindexing, or transforming data into time series can be a challenge.

    Core SQL Skills for TimeSeries, ReIndexing, and Pivoting with SQL

    • Window Functions:  LAG, OVER, (PostgreSQL, T-SQL, MySQL)   OVER
    • Time-series generation: Pivoting 
    • PIVOT (T-SQL) CROSSTAB, CASE
    • Data Trends: COALSE

    For time series and data realignment, window functions perform calculations on a set of rows that are related together and thus perfect for this type of data transformation. For example, if we have a set of time-stamped data and we wanted to get the difference at each step, you can use the LAG function. The LAG() function provides access to a row that comes before the current row. This works because window operations like LAG() do not collapse groups of query rows into a single output row. Instead, they produce a result for each row. Using the OVER clause specifies you can  partition query rows into groups for processing by the window function: 

    Reindexing (not to be confused with database indexing) or aligning datasets is another key technique. For example, you have sales from different time periods you’d like to compare. Using windowing functions and subqueries you can reset the data from the same data point or zero or align on the same period units in time.

    Pivoting is a widely used spreadsheet technique among data wrangling techniques. Often when looking at periodic or timestamp data you get a single column of data you’d like to pivot to multispoke columns. For example, pivoting a single column of monthly sales listed by rows into multiple columns of sales by month. PIVOT and CROSSTAB can be used but for other SQL favors that don’t support those functions, you can implement them with CASE statements.

    4. Fake Data Generation with SQL

    There are a number of reasons to generate fake data, including:

    • Anonymizing production and sensitive data before porting it to other data stores
    • Generating more data that has similar characteristics to existing data for load testing, benchmarking, robustness testing, or scaling. 
    • Varying the distribution of your data to understand how your data scienceor ML model performs for bias, loss function, or other metrics

    Core SQL skills:

    • RANDOM, RAND
    • CREATE, UPDATE, SELECT, INSERT, and subqueries
    • JOIN and CROSS JOIN, MERGE CAST
    • String manipulation: functions, CONCAT SUBSTR, TRIM, etc. 

    SQL can be useful for generating fake data for load testing, benchmarking, or model training and testing. For numerical data, RANDOM, RAND(), and its variants are classic random generators. Generation data for new tables (CREATE) or updating existing ones (UPDATE). Data selection and subqueries are generally a prerequisite. CROSS JOIN is a favorite as it lets you combine data (aka cartesian join) to make new data sets and MEGE can be used similarly. CAST is useful for converting variables into different data types. When generating text string manipulation  CONCAT SUBSTR, TRIM and a host of other functions can be used to create new text variables.

    5. Loading Flat Files with SQL 

    This may look like a bonus round data wrangling technique but I’ve worked with many data experts that struggle to load data files directly into a database (the L in ETL). Why do this?  For many research projects data is not readily available in data lakes, APIs, and other data stores. Generally, it has to be collected in the wild and more often than not in a flat-file format or multiple flat files.

    Core Skills required

    • SQL Clients; Understand delimiters, text delimiting, and special characters
    • Table creation CREATE, TEMPORARY TABLE  
    • Importing data, IMPORT, INSERT, SELECT INTO
    • Normalize data with SQL clauses including, SELECT, UPDATE, WHERE, JOIN, and subqueries
    • Transaction control: BEGIN,  COMMIT, and ROLLBACK

    ETL (Extraction, Transformation, Loading) is the most common way to simply import CSV files. You can use SQL for the all-important verification step. Simple verification includes checking for nulls, understanding the range and distribution of values imported, finding extreme values, and event simple count checks. 

    However, loading flat file data into a database is not as trivial as it seems thanks to the need for specific delimiters. Most SQL clients allow you to specify custom delimiters. For text-heavy files, you can’t use simple text quote delimiters. Double quotes and other techniques must be employed. Be careful to check for whitespace characters also including newline characters.

    Importing a single flat file like a CSV file is somewhat straightforward once the above steps are done. However, by nature flat files are not normalized thus you can use SQL to create a temporary table to load the data and then use the SQL filter clause and Update clause to port that data to the final destination tables. A common technique is to import that data to temporary tables in “raw” format and then normalize the data to other tables. As always when inserting or updating data employ transaction controls.

    Author: Seamus McGovern

    Source: Open Data Science

  • Hadoop engine benchmark: How Spark, Impala, Hive, and Presto compare

    forresters-hadoop-predictions-2015AtScale recently performed benchmark tests on the Hadoop engines Spark, Impala, Hive, and Presto. Find out the results, and discover which option might be best for your enterprise

    The global Hadoop market is expected to expand at an average compound annual growth rate (CAGR) of 26.3% between now and 2023, a testimony to how aggressively companies have been adopting this big data software framework for storing and processing the gargantuan files that characterize big data. But to turbo-charge this processing so that it performs faster, additional engine software is used in concert with Hadoop.

    AtScale, a business intelligence (BI) Hadoop solutions provider, periodically performs BI-on-Hadoop benchmarks that compare the performances of various Hadoop engines to determine which engine is best for which Hadoop processing scenario. The benchmark results assist systems professionals charged with managing big data operations as they make their engine choices for different types of Hadoop processing deployments.

    Recently, AtScale published a new survey that I discussed with Josh Klahr, AtScale's vice president of product management.

    "In this benchmark, we tested four different Hadoop engines," said Klahr. "The engines were Spark, Impala, Hive, and a newer entrant, Presto. We used the same cluster size for the benchmark that we had used in previous benchmarking."

    What AtScale found is that there was no clear engine winner in every case, but that some engines outperformed others depending on what the big data processing task involved. In one case, the benchmark looked at which Hadoop engine performed best when it came to processing large SQL data queries that involved big data joins.

    "There are companies out there that have six billion row tables that they have to join for a single SQL query," said Klahr. "The data architecture that these companies use include runtime filtering and pre-filtering of data based upon certain data specifications or parameters that end users input, and which also contribute to the processing load. In these cases, Spark and Impala performed very well. However, if it was a case of many concurrent users requiring access to the data, Presto processed more data."

    The AtScale benchmark also looked at which Hadoop engine had attained the greatest improvement in processing speed over the past six months.

    "The most noticeable gain that we saw was with Hive, especially in the process of performing SQL queries," said Klahr. "In the past six months, Hive has moved from release 1.4 to 2.1—and on an average, is now processing data 3.4 times faster."
     
    Other Hadoop engines also experienced processing performance gains over the past six months. Spark was processing data 2.4 times faster than it was six months ago, and Impala had improved processing over the past six months by 2.8%. In all cases, better processing speeds were being delivered to users.

    "What we found is that all four of these engines are well suited to the Hadoop environment and deliver excellent performance to end users, but that some engines perform in certain processing contexts better than others," said Klahr. "For instance, if your organization must support many concurrent users of your data, Presto and Impala perform best. However, if you are looking for the greatest amount of stability in your Hadoop processing engine, Hive is the best choice. And if you are faced with billions of rows of data that you must combine in complicated data joins for SQL queries in your big data environment, Spark is the best performer."

    Klahr said that many sites seems to be relatively savvy about Hadoop performance and engine options, but that a majority really hadn't done much benchmarking when it came to using SQL.

    "The best news for users is that all of these engines perform capably with Hadoop," sad Klahr. "Now that we also have benchmark information on SQL performance, this further enables sites to make the engine choices that best suit their Hadoop processing scenarios."

    Source: techrepublic.com, October 29, 2016

  • Modern Information Management: Understanding Big Data at Rest and in Motion

    Big data is the buzzword of the century, it seems. But, why is everyone so obsessed with it? Here’s what it’s all about, how companies are gathering it, and how it’s stored and used.

    7979558647 6c822e698d o YO

    What is it?

    Big data is simply large data sets that need to be analyzed computationally in order to reveal patterns, associations, or trends. This data is usually collected by governments and businesses on citizens and customers, respectively.

    The IT industry has had to shift its focus to big data over the last few years because of the sheer amount of interest being generated by big business. By collecting massive amounts of data, companies, like Amazon.com, Google, Walmart, Target, and others, are able to track buying behaviors of specific customers.

    Once enough data is collected, these companies then use the data to help shape advertising initiatives. For example, Target has used its big data collection initiative to help target (no pun intended) its customers with products it thought would be most beneficial given their past purchases.

    How Companies Store and Use It

    There are two ways that companies can use big data. The first way is to use the data at rest. The second way is to use it in motion.

    At Rest Data – Data at rest refers to information that’s collected and analyzed after the fact. It tells businesses what’s already happened. The analysis is done separately and distinctly from any actions that are taken upon conclusion of said analysis.

    For example, if a retailer wanted to analyze the previous month’s sales data. It would use data at rest to look over the previous month’s sales totals. Then, it would take those sales totals and make strategic decisions about how to move forward given what’s already happened.

    In essence, the company is using past data to guide future business activities. The data might drive the retailer to create new marketing initiatives, customize coupons, increase or decrease inventory, or to otherwise adjust merchandise pricing.

    Some companies might use this data to determine just how much of a discount is needed on promotions to spur sales growth.

    Some companies may use it to figure out how much they are able to discount in the spring and summer without creating a revenue problem later on in the year. Or, a company may use it to predict large sales events, like Black Friday or Cyber Monday.

    This type of data is batch processed since there’s no need to have the data instantly accessible or “streaming live.” There is a need, however, for storage of large amounts of data and for processing unstructured data. Companies often use a public cloud infrastructure due to the costs involved in storage and retrieval.

    Data In Motion – Data in motion refers to data that’s analyzed in real-time. Like data at rest, data may be captured at the point of sale, or at a contact point with a customer along the sales cycle. The difference between data in motion and data at rest is how the data is analyzed.

    Instead of batch processing and analyzation after the fact, data in motion uses a bare metal cloud environment because this type of infrastructure uses dedicated servers offering cloud-like features without virtualization.

    This allows for real-time processing of large amounts of data. Latency is also a concern for large companies because they need to be able to manage and use the data quickly. This is why many companies send their IT professionals to Simplilearn Hadoop admin training and then subsequently load them up on cloud-based training and other database training like NoSQL.

    9427663067 713fa3e786 o

    Big Data For The Future

    Some awesome, and potentially frightening, uses for big data are on the horizon. For example, in February 2014, the Chicago Police Department sent uniformed officers to make notification visits to targeted individuals they had identified as potential criminals. They used a computer-generated list which gathered data about those individuals’ backgrounds.

    Another possible use for big data is development of hiring algorithms. More and more companies are trying to figure out ways to hire candidates without trusting slick resume writing skills. New algorithms may eliminate job prospects based on statistics, rather than skillsets, however. For example, some algorithms find that people with shorter commutes are more likely to stay in a job longer.

    So, people who have long commutes are filtered out of the hiring process quickly.

    Finally, some insurance companies might use big data to analyze your driving habits and adjust your insurance premium accordingly. That might sound nice if you’re a good driver, but insurers know that driving late at night increases the risk for getting into an accident. Problem is, poorer people tend to work late shifts and overnights or second jobs just to make ends meet. The people who are least able to afford insurance hikes may be the ones that have to pay them.

    Source: Mobilemag

  • The Bright Future of SQL

    The Bright Future of SQL

    Machine learning, big data analytics or AI may steal the headlines, but if you want to hone a smart, strategic skill that can elevate your career, look no further than SQL.

    Data engineering and data science are fast-moving, competitive fields. Technologies come and go, so keeping your skillset updated is something all ambitious data pros can agree on. Where data engineers and scientists disagree is exactly what skills will be most valuable in the future.

    Regardless of the bewildering array of tools and services available to data scientists, it is still humble SQL that forms the bedrock of a data scientist’s stack. While SQL is usually seen as a baseline skill, it is, in fact, much more than that. Despite being almost 50 years old, SQL is becoming more,not less relevant. 

    Machine learning, big data analytics or AI may steal the headlines, but if you want to hone a smart, strategic skill that can elevate your career, look no further than SQL. Here’s why. 

    SQL Dominates Databases

    First off, “SQL is really the language of data” in the words of Benjamn Rogojan (aka Seattle Data Guy). This is down to the fact that the majority of databases are built on one or other of the SQL-based technologies. All but two of the top ten most popular databases today are based on SQL, the exceptions (MongoDB and Redis) are ranked fifth and sixth respectively and even they can be used with SQL. It’s easy to see why anyone who needs to query, update, change or in anyway engage with data in relational databases is going to be well-served with a solid working knowledge of SQL no matter what specialism they end up pursuing. 

    Demand for SQL Skills is High, and Growing

    Despite its age, SQL is far from a legacy skill. As data engineering has advanced into the cloud, so SQL has followed, and, according to Dataquest, SQL was the most in-demand skill among all jobs in data in 2021, especially at the more junior end of the spectrum. However, even more experienced data scientist job postings still list SQL in almost 60 percent of vacancies. What’s more, doubtless due to surging demand for data-related expertise, demand for SQL skills appears to be growing, despite a brief dip in 2020. Pandemic aside, the SQL server transformation market – that helps business address the need for data transformation - is predicted to grow steadily at more than 10 percent CAGR until the end of the decade. 

    Should Savvy Data Scientists Prioritize SQL?

    The future of SQL looks safe, but it does not necessarily follow that budding data scientists who already have a working knowledge of it will prioritize deepening their SQL skills to further their career progress. 

    They should. 

    With so many tools and nascent technologies to help them at the ELT / ETL stage, for BI and for both predictive and historic analytics, data scientists need to be savvy about where to plough their energy. The continually shrinking half-life of high-tech skills means that the tools and skills data scientists learn can be career-defining – or career-limiting. 

    How SQL is Taking Center Stage?

    No-one wants to spend six months figuring out a tool that only delivers half of what was expected of it, let alone recommending it to the wider team only to find that it underwhelms. So, when data scientists look at the services and techniques out there that will help them query their data more effectively,

    they are probably going to look at the best BI tools and ML extensions that will enable them prep the data, create the model and then train it. But all these different stages take time and demand high levels of expertise. We’ve been conditioned into accepting that ML modelling requires the data to be extracted from the database, usually using a BI tool, transformed and loaded into the BI system, before being exported (again) to the ML tool, where the magic happens, and transporting it back to the BI tool for visualization. 

    What if I told you there is a way of taking ML models to the data, enabling you to query predictions from inside the database using – you guessed it - SQL? There is. It’s part of a small, but rapidly growing movement that brings intelligence into the data layer, rather than painstakingly taking the data to the ML tool. 

    In-database Innovation

    In-database ML is a much simpler way to use existing data to predict future events…and it uses standard SQL commands. In-database ML is a bit like giving your database a brain. It means data scientists – and data engineers, and indeed anyone with SQL skills – can work within the database, running ML models to answer almost any business question. Predicting customer churn, credit scoring, customer lifecycle optimization, fraud detection, inventory management, price modelling, and predicting patient health outcomes are just a few of the many use cases in-database modelling has enabled. With this approach, all the ML models can be created, queried and maintained as if they were database tables, using the SQL language and bringing powerful predictive capabilities to a much wider range of data pros.  

    In-database ML is a relatively new field, but it is one part of a wider, fast-growing movement to simplify and democratize data engineering and data science, breaking down the technical barriers that currently exist for those working with data. Take, for example dbt Labs, a company that’s taken the data world by storm, having recently secured $222 million of funding and been valued at $4.2 billion. Its data transformation product enables data engineers to build production-grade data pipelines from within the data warehouse using SQL commands, radically simplifying, speeding up scaling the process of prepping data. 

    SQL – Not Old, But Evergreen

    We’re fortunate to be living in a golden age of digital innovation. However, against a business backdrop that prizes the insights data can offer, data scientists are under pressure like never before to produce miracles out of data. A dizzying range of tools and services has grown out of the need to speed up and scale the data analytics. These tools often demand an investment in time and skill development to fully realize their benefits. However, one skill that has often been overlooked is humble SQL, the data scientist’s best friend. SQL is not only not going anywhere, as the growing movement to innovate closer to the data shows, SQL is becoming the data scientist’s strategic secret weapon. 

    Author: Jorge Torres

    Source: KDnuggets

  • The difference between structured and unstructured data

    The difference between structured and unstructured data

    Structured data and unstructured data are both forms of data, but the first uses a single standardized format for storage, and the second does not. Structured data must be appropriately formatted (or reformatted) to provide a standardized data format before being stored, which is not a necessary step when storing unstructured data.

    The relational database provides an excellent example of how structured data is used and stored. The data is normally formatted into specific fields (for example, credit card numbers or addresses), allowing the data to be easily found using SQL.

    Non-relational databases, also called NoSQL, provide a way to work with unstructured data.

    Edgar F. Codd invented relational databases (RDBMs) in 1970, and they became popular during the 1980s. Relational databases allow users to access data and write in SQL (Structured Query Language). RDBMs and SQL gave organizations the ability to analyze stored data on demand, providing a significant advantage against the competition of those times. 

    Relational databases are user-friendly, and very, very efficient at maintaining accurate records. Regrettably, they are also quite rigid and cannot work with other languages or data formats.

    Unfortunately for relational databases, during the mid-1990s, the internet gained significantly in popularity, and the rigidity of relational databases could not handle the variety of languages and formats that became accessible. This made research difficult, and NoSQL was developed as a solution between 2007 and 2009. 

    A NoSQL database translates data written in different languages and formats efficiently and quickly and avoids the rigidity of SQL. Structured data is often stored in relational databases and data warehouses, while unstructured data is often stored in NoSQL databases and data lakes.

    For broad research, unstructured data used by NoSQL databases, compared to relational databases, are the better choice because of their speed and flexibility.

    The Expanded Use of the Internet and Unstructured Data

    During the late 1980s, the low prices of hard disks, combined with the development of data warehouses, resulted in remarkably inexpensive data storage. This, in turn, resulted in organizations and individuals embracing the habit of storing all data gathered from customers, and all the data collected from the internet for research purposes. A data warehouse allows analysts to access research data more quickly and efficiently.

    Unlike a relational database, which is used for a variety of purposes, a data warehouse is specifically designed for a quick response to queries.

    Data warehouses can be cloud-based, or part of a business’s in-house mainframe server. They are compatible with SQL systems because by design, they rely on structured datasets. Generally speaking, data warehouses are not compatible with unstructured, or NoSQL, databases. Before the 2000s, businesses focused only on extracting and analyzing information from structured data. 

    The internet began to offer unique data analysis opportunities and data collections in the early 2000s. With the growth of web research and online shopping, businesses such as Amazon, Yahoo, and eBay began analyzing their customer’s behavior by including such things as search logs, click-rates, and IP-specific location data. This abruptly opened up a whole new world of research possibilities. The profits resulting from their research prompted other organizations to begin their own expanded business intelligence research.

    Data lakes came about as a way to deal with unstructured data in roughly 2015. Currently, data lakes can be set up both in-house and in the cloud (the cloud version eliminates in-house installation difficulties and costs). The advantages of moving a data lake from an in-house location to the cloud for analyzing unstructured data can include:

    • Cloud-based tools that are more efficient: The tools available on the cloud can build data pipelines much more efficiently than in-house tools. Often, the data pipeline is pre-integrated, offering a working solution while saving hundreds of hours of in-house set up costs.
    • Scaling as needed: A cloud provider can provide and manage scaling for stored data, as opposed to an in-house system, which would require adding machines or managing clusters.
    • A flexible infrastructure: Cloud services provide a flexible, on-demand infrastructure that is charged for based on time used. Additional services can also be accessed. (However, confusion and inexperience will result in wasted time and money.) 
    • Backup copies: Cloud providers strive to prevent service interruptions, so they store redundant copies of the data, using physically different servers, just in case your data gets lost.

    Data lakes, sadly, have not become the perfect solution for working with unstructured data. The data lake industry is about seven years old and is not yet mature – unlike structured/SQL data systems. 

    Cloud-based data lakes may be easy to deploy but can be difficult to manage, resulting in unexpected costs. Data reliability issues can develop when combining batch and streaming data and corrupted data. A lack of experienced data lake professionals is also a significant problem.

    Data lakehouses, which are still in the development stage, have the goal of storing and accessing unstructured data, while providing the benefits of structured data/SQL systems. 

    The Benefits of Using Structured Data

    Basically, the primary benefit of structured data is its ease of use. This benefit is expressed in three ways:

    • A great selection of tools: Because this popular way of organizing data has been around for a while, a significant number of tools have been developed for structured/SQL databases.
    • Machine learning algorithms: Structured data works remarkably well for training machine learning algorithms. The clearly defined nature of structured data provides a language machine learning can understand and work with.
    • Business transactions: Structured data can be used for business purposes by the average person because it’s easy to use. There is no need for an understanding of different types of data.

    The Benefits of Using Unstructured Data 

    Examples of unstructured data include such things as social media posts, chats, email, presentations, photographs, music, and IoT sensor data. The primary strength of NoSQL and data lakes working with unstructured data is their flexibility in working with a variety of data formats. The benefits of working with NoSql databases or data lakes are:

    • Faster accumulation rates: Because there is no need to transform different types of data into a standardized format, it can be gathered quickly and efficiently.
    • More efficient research: A broader base of data taken from a variety of sources typically provides more accurate predictions of human behavior.

    The Future of Structured and Unstructured Data

    Over the next decade, the use of unstructured data will become much easier to work with, and much more commonplace. It will have no problems working with structured data. Tools for structured data will continue to be developed, and it will continue to be used for business purposes. 

    Although very much in the early stages of development, artificial intelligence algorithms have been developed that help find meaning automatically when searching unstructured data.

    Currently, Microsoft’s Azure AI is using a combination of optical character recognition, voice recognition, text analysis, and machine vision to scan and understand unstructured collections of data that may be made up of text or images. 

    Google offers a wide range of tools using AI algorithms that are ideal for working with unstructured data. For example, Vision AI can decode text, analyze images, and even recognize the emotions of people in photos.

    In the next decade, we can predict that AI will play a significant role in processing unstructured data. There will be an urgent need for “recognition algorithms.” (We currently seem to be limited to image recognitionpattern recognition, and facial recognition.) As artificial intelligence evolves, it will be used to make working with unstructured data much easier.

    Author: Keith D. Foote

    Source: Dataversity

EasyTagCloud v2.8