13 items tagged "Data quality"

  • 9 Data issues to deal with in order to optimize AI projects

    9 Data issues to deal with in order to optimize AI projects

    The quality of your data affects how well your AI and machine learning models will operate. Getting ahead of these nine data issues will poise organizations for successful AI models.

    At the core of modern AI projects are machine-learning-based systems which depend on data to derive their predictive power. Because of this, all artificial intelligence projects are dependent on high data quality.

    However, obtaining and maintaining high quality data is not always easy. There are numerous data quality issues that threaten to derail your AI and machine learning projects. In particular, these nine data quality issues need to be considered and prevented before issues arise.

    1. Inaccurate, incomplete and improperly labeled data

    Inaccurate, incomplete or improperly labeled data is typically the cause of AI project failure. These data issues can range from bad data at the source to data that has not been cleaned or prepared properly. Data might be in the incorrect fields or have the wrong labels applied.

    Data cleanliness is such an issue that an entire industry of data preparation has emerged to address it. While it might seem an easy task to clean gigabytes of data, imagine having petabytes or zettabytes of data to clean. Traditional approaches simply don't scale, which has resulted in new AI-powered tools to help spot and clean data issues.

    2. Having too much data

    Since data is important to AI projects, it's a common thought that the more data you have, the better. However, when using machine learning sometimes throwing too much data at a model doesn't actually help. Therefore, a counterintuitive issue around data quality is actually having too much data.

    While it might seem like too much data can never be a bad thing, more often than not, a good portion of the data is not usable or relevant. Having to go through to separate useful data from this large data set wastes organizational resources. In addition, all that extra data might result in data "noise" that can result in machine learning systems learning from the nuances and variances in the data rather than the more significant overall trend.

    3. Having too little data

    On the flip side, having too little data presents its own problems. While training a model on a small data set may produce acceptable results in a test environment, bringing this model from proof of concept or pilot stage into production typically requires more data. In general, small data sets can produce results that have low complexity, are biased or too overfitted and will not be accurate when working with new data.

    4. Biased data

    In addition to incorrect data, another issue is that the data might be biased. The data might be selected from larger data sets in ways that doesn't appropriately convey the message of the wider data set. In other ways, data might be derived from older information that might have been the result of human bias. Or perhaps there are some issues with the way that data is collected or generated that results in a final biased outcome.

    5. Unbalanced data

    While everyone wants to try to minimize or eliminate bias from their data sets, this is much easier said than done. There are several factors that can come into play when addressing biased data. One factor can be unbalanced data. Unbalanced data sets can significantly hinder the performance of machine learning models. Unbalanced data has an overrepresentation of data from one community or group while unnecessarily reducing the representation of another group.

    An example of an unbalanced data set can be found in some approaches to fraud detection. In general, most transactions are not fraudulent, which means that only a very small portion of your data set will be fraudulent transactions. Since a model trained on this fraudulent data can receive significantly more examples from one class versus another, the results will be biased towards the class with more examples. That's why it's essential to conduct thorough exploratory data analysis to discover such issues early and consider solutions that can help balance data sets.

    6. Data silos

    Related to the issue of unbalanced data is the issue of data silos. A data silo is where only a certain group or limited number of individuals at an organization have access to a data set. Data silos can result from several factors, including technical challenges or restrictions in integrating data sets as well as issues with proprietary or security access control of data.

    They are also the product of structural breakdowns at organizations where only certain groups have access to certain data as well as cultural issues where lack of collaboration between departments prevents data sharing. Regardless of the reason, data silos can limit the ability of those at a company working on artificial intelligence projects to gain access to comprehensive data sets, possibly lowering quality results.

    7. Inconsistent data

    Not all data is created the same. Just because you're collecting information, that doesn't mean that it can or should always be used. Related to the collection of too much data is the challenge of collecting irrelevant data to be used for training. Training the model on clean, but irrelevant data results in the same issues as training systems on poor quality data.

    In conjunction with the concept of data irrelevancy is inconsistent data. In many circumstances, the same records might exist multiple times in different data sets but with different values, resulting in inconsistencies. Duplicate data is one of the biggest problems for data-driven businesses. When dealing with multiple data sources, inconsistency is a big indicator of a data quality problem.

    8. Data sparsity

    Another issue is data sparsity. Data sparsity is when there is missing data or when there is an insufficient quantity of specific expected values in a data set. Data sparsity can change the performance of machine learning algorithms and their ability to calculate accurate predictions. If data sparsity is not identified, it can result in models being trained on noisy or insufficient data, reducing the effectiveness or accuracy of results.

    9. Data labeling issues

    Supervised machine learning models, one of the fundamental types of machine learning, require data to be labeled with correct metadata for machines to be able to derive insights. Data labeling is a hard task, often requiring human resources to put metadata on a wide range of data types. This can be both complex and expensive. One of the biggest data quality issues currently challenging in-house AI projects is the lack of proper labeling of machine learning training data. Accurately labeled data ensures that machine learning systems establish reliable models for pattern recognition, forming the foundations of every AI project. Good quality labeled data is paramount to accurately training the AI system on what data it is being fed.

    Organizations looking to implement successful AI projects need to pay attention to the quality of their data. While reasons for data quality issues are many, a common theme that companies need to remember is that in order to have data in the best condition possible, proper management is key. It's important to keep a watchful eye on the data that is being collected, run regular checks on this data, keep the data as accurate as possible, and get the data in the right format before having machine learning models learn on this data. If companies are able to stay on top of their data, quality issues are less likely to arise.

    Author: Kathleen Walch

    Source: TechTarget

  • An overview of Morgan Stanley's surge toward data quality

    An overview of Morgan Stanley's surge toward data quality

    Jeff McMillan, chief analytics and data officer at Morgan Stanley, has long worried about the risks of relying solely on data. If the data put into an institution's system is inaccurate or out of date, it will give customers the wrong advice. At a firm like Morgan Stanley, that just isn't an option.

    As a result, Morgan Stanley has been overhauling its approach to data. Chief among them is that it wants to improve data quality in core business processing.

    “The acceleration of data volume and the opportunity this data presents for efficiency and product innovation is expanding dramatically,” said Gerard Hester, head of the bank’s data center of excellence. “We want to be sure we are ahead of the game.”

    The data center of excellence was established in 2018. Hester describes it as a hub with spokes out to all parts of the organization, including equities, fixed income, research, banking, investment management, wealth management, legal, compliance, risk, finance and operations. Each division has its own data requirements.

    “Being able to pull all this data together across the firm we think will help Morgan Stanley’s franchise internally as well as the product we can offer to our clients,” Hester said.

    The firm hopes that improved data quality will let the bank build higher quality artificial intelligence and machine learning tools to deliver insights and guide business decisions. One product expected to benefit from this is the 'next best action' the bank developed for its financial advisers.

    This next best action uses machine learning and predictive analytics to analyze research reports and market data, identify investment possibilities, and match them to individual clients’ preferences. Financial advisers can choose to use the next best action’s suggestions or not.

    Another tool that could benefit from better data is an internal virtual assistant called 'ask research'. Ask research provides quick answers to routine questions like, “What’s Google’s earnings per share?” or “Send me your latest model for Google.” This technology is currently being tested in several departments, including wealth management.

    New data strategy

    Better data quality is just one of the goals of the revamp. Another is to have tighter control and oversight over where and how data is being used, and to ensure the right data is being used to deliver new products to clients.

    To make this happen, the bank recently created a new data strategy with three pillar. The first is working with each business area to understand their data issues and begin to address those issues.

    “We have made significant progress in the last nine months working with a number of our businesses, specifically our equities business,” Hester said.

    The second pillar is tools and innovation that improve data access and security. The third pillar is an identity framework.

    At the end of February, the bank hired Liezel McCord to oversee data policy within the new strategy. Until recently, McCord was an external consultant helping Morgan Stanley with its Brexit strategy. One of McCord’s responsibilities will be to improve data ownership, to hold data owners accountable when the data they create is wrong and to give them credit when it’s right.

    “It’s incredibly important that we have clear ownership of the data,” Hester said. “Imagine you’re joining lots of pieces of data. If the quality isn’t high for one of those sources of data, that could undermine the work you’re trying to do.”

    Data owners will be held accountable for the accuracy, security and quality of the data they contribute and make sure that any issues are addressed.

    Trend of data quality projects

    Arindam Choudhury, the banking and capital markets leader at Capgemini, said many banks are refocusing on data as it gets distributed in new applications.

    Some are driven by regulatory concerns, he said. For example, the Basel Committee on Banking Supervision's standard number 239 (principles for effective risk data aggregation and risk reporting) is pushing some institutions to make data management changes.

    “In the first go-round, people complied with it, but as point-to-point interfaces and applications, which was not very cost effective,” Choudhury said. “So now people are looking at moving to the cloud or a data lake, they’re looking at a more rationalized way and a more cost-effective way of implementing those principles.”

    Another trend pushing banks to get their data house in order is competition from fintechs.

    “One challenge that almost every financial services organization has today is they’re being disintermediated by a lot of the fintechs, so they’re looking at assets that can be used to either partner with these fintechs or protect or even grow their business,” Choudhury said. “So they’re taking a closer look at the data access they have. Organizations are starting to look at data as a strategic asset and try to find ways to monetize it.”

    A third driver is the desire for better analytics and reports.

    "There’s a strong trend toward centralizing and figuring out, where does this data come from, what is the provenance of this data, who touched it, what kinds of rules did we apply to it?” Choudhury said. That, he said, could lead to explainable, valid and trustworthy AI.

    Author: Penny Crosman

    Source: Information-management

  • Business Intelligence Trends for 2017

    businessintelligence 5829945be5abcAnalyst and consulting firm, Business Application Research Centre (BARC), has come out with the top BI trends based on a survey carried out on 2800 BI professionals. Compared to last year, there were no significant changes in the ranking of the importance of BI trends, indicating that no major market shifts or disruptions are expected to impact this sector.
    With the growing advancement and disruptions in IT, the eight meta trends that influence and affect the strategies, investments and operations of enterprises, worldwide, are Digitalization, Consumerization, Agility, Security, Analytics, Cloud, Mobile and Artificial Intelligence. All these meta trends are major drivers for the growing demand for data management, business intelligence and analytics (BI). Their growth would also specify the trend for this industry.The top three trends out of 21 trends for 2017 were:
    • Data discovery and visualization,
    • Self-service BI and
    • Data quality and master data management
    • Data labs and data science, cloud BI and data as a product were the least important trends for 2017.
    Data discovery and visualization, along with predictive analytics, are some of the most desired BI functions that users want in a self-service mode. But the report suggested that organizations should also have an underlying tool and data governance framework to ensure control over data.
    In 2016, BI was majorly used in the finance department followed by management and sales and there was a very slight variation in their usage rates in that last 3 years. But, there was a surge in BI usage in production and operations departments which grew from 20% in 2008 to 53% in 2016.
    "While BI has always been strong in sales and finance, production and operations departments have traditionally been more cautious about adopting it,” says Carsten Bange, CEO of BARC. “But with the general trend for using data to support decision-making, this has all changed. Technology for areas such as event processing and real-time data integration and visualization has become more widely available in recent years. Also, the wave of big data from the Internet of Things and the Industrial Internet has increased awareness and demand for analytics, and will likely continue to drive further BI usage in production and operations."
    Customer analysis was the #1 investment area for new BI projects with 40% respondents investing their BI budgets on customer behavior analysis and 32% on developing a unified view of customers.
    • “With areas such as accounting and finance more or less under control, companies are moving to other areas of the enterprise, in particular to gain a better understanding of customer, market and competitive dynamics,” said Carsten Bange.
    • Many BI trends in the past, have become critical BI components in the present.
    • Many organizations were also considering trends like collaboration and sensor data analysis as critical BI components. About 20% respondents were already using BI trends like collaboration and spatial/location analysis.
    • About 12% were using cloud BI and more were planning to employ it in the future. IBM's Watson and Salesforce's Einstein are gearing to meet this growth.
    • Only 10% of the respondents used social media analysis.
    • Sensor data analysis is also growing driven by the huge volumes of data generated by the millions of IoT devices being used by telecom, utilities and transportation industries. According to the survey, in 2017, the transport and telecoms industries would lead the leveraging of sensor data.
    The biggest new investments in BI are planned in the manufacturing and utilities industries in 2017.
    Source: readitquick.com, November 14, 2016
  • Changing voluntarily and the role of data quality

    Changing voluntarily and the role of data quality

    In the modern world nothing stays the same for long. We live in a state of constant change with new technologies, new trends and new risks. Yet it’s a commonly held belief that people don’t like change. Which led me to wonder, why do we persist in calling change management initiatives 'change management' if people don’t like change.

    In my experience I have not found this maxim to be true. Actually, nobody minds change, we evolve and adapt naturally but what we do not like is being forced to change. As such, when we make a choice to change, it is often easy, fast and permanent.

    To put that into context, change is an external force imposed upon you. For example, if I tell you I want you to change your attitude, you are expected to adapt your patterns of behaviour to comply with my idea of your ‘new and improved attitude’. This is difficult to maintain and conflicts with your innate human need to exercise your own free-will. However, if I ask you to choose your attitude, this places you in control of your own patterns of behaviour. You can assess the situation and decide the appropriate attitude you will adopt. This makes it far more likely that you will maintain the changes and, as a result, will reap the rewards.

    Perhaps you’re wondering what this has to do with the data quality and data quality management of your organisation?

    Quite simply, the need for choice applies to every aspect of life. Making positive choices for our health and wellbeing, choosing to make change that improves our environmental impact and making changes that will positively impact the financial, reputational and commercial wellbeing of your business, one of which is data quality management. The ultimate success of these initiatives stem from one thing: the conscious choice to change.

    It’s a simple case of cause and effect.

    So back to my original point of choice management, not change management.
    An organisational choice owned and performed by everyone, to improve your data quality and data cleansing, driven by a thorough understanding of the beneficial outcomes, will reap untold business rewards. After all, over 2,000 years ago Aristotle gave us a clue by saying “We are what we repeatedly do, therefore excellence is not an act, but a habit.”
    When you choose to improve and maintain the quality of the baseline data that is relied upon for business decisions:

    • Your business outcomes will improve because you will have a better understanding of your customers’ needs:
    • You will reduce wasted effort by communicating directly to a relevant and engaged audience:
    • Profits will increase as a result of data cleansing and reduced duplication of effort coupled with increased trust in your brand, and
    • Customer, employee and shareholder confidence and satisfaction will rise.

    Bringing your team with you on a journey of change and helping them to make the choices to effectively implement those changes, will require you to travel the ‘Change Curve’ together. As a business leader, you will be at the forefront leading the way and coaching your staff to join you on the journey.

    We can all find ourselves at the start of the change curve at times, in denial of the need or issues you know need to be tackled. You, and your team, may feel angry or overwhelmed by the scale of the change that you need to achieve. However, the key is choosing to accept the need to change, adapt and evolve. That way, you will move in your new direction much faster, taking the action to make your goals a reality.

    It’s easy to feel overwhelmed when you feel that you have a mountain to climb and it can be easy to make decisions based on where you are now. However, choosing to make business decisions regarding your data quality and your need for data quality tools, that are based on where you want to be, is where the true power lies and that is where you will unleash your winning formula.

    Author: Martin Doyle

    Source: DQ Global

  • Data accuracy - What is it and why is it important?

    Data accuracy - What is it and why is it important?

    The world has come to rely on data. Data-driven analytics fuel marketing strategies, supply chain operations, and more, and often to impressive results. However, without careful attention to data accuracy, these analytics can steer businesses in the wrong direction.

    Just as data analytics can be detrimental if not executed properly, so too can the misapplication of data analysis lead to unintended consequences. This is especially true when it comes to understanding accuracy in data.


    Data accuracy is, as its sounds, whether or not given values are correct and consistent. The two most important characteristics of this are form and content, and a data set must be correct in both fields to be accurate.

    For example, imagine a database containing information on employees’ birthdays, and one worker’s birthday is January 5th, 1996. U.S. formats would record that as 1/5/1996, but if this employee is European, they may record it as 5/1/1996. This difference could cause the database to incorrectly state that the worker’s birthday is May 1, 1996.

    In this example, while the data’s content was correct, its form wasn’t, so it wasn’t accurate in the end. If information is of any use to a company, it must be accurate in both form and content.


    While the birthday example may not have significant ramifications, data accuracy can have widespread ripple effects. Consider how some hospitals use AI to predict the best treatment course for cancer patients. If the data this AI analyzes isn’t accurate, it won’t produce reliable predictions, potentially leading to minimally effective or even harmful treatments.

    Studies have shown that bad data costs businesses 30% or more of their revenue on average. If companies are making course-changing decisions based on data analytics, their databases must be accurate. As the world comes to rely more heavily on data, this becomes a more pressing concern.


    Before using data to train an algorithm or fuel business decisions, data scientists must ensure accuracy. Thankfully, organizations can take several steps to improve their data accuracy. Here are five of the most important actions.


    One of the best ways to improve data accuracy is to start with higher-quality information. Companies should review their internal and external data sources to ensure what they’re gathering is true to life. That includes making sure sensors are working correctly, collecting large enough datasets, and vetting third-party sources.

    Some third-party data sources track and publish reported errors, which serves as a useful vetting tool. When getting data from these external sources, businesses should always check these reports to gauge their reliability. Similarly, internal error reports can reveal if one data-gathering process may need adjustment.


    Some data is accurate from the source but becomes inaccurate in the data entry process. Errors in entry and organization can taint good information, so organizations must work to eliminate these mistakes. One of the most significant fixes to this issue is easing the manual data entry workload.

    If data entry workers have too much on their plate, they can become stressed or tired, leading to mistakes. Delegating the workload more evenly across teams, extending deadlines, or automating some processes can help prevent this stress. Mistakes will drop as a result.


    Another common cause of data inaccuracy is inconsistencies between departments. If people across multiple teams have access to the same datasets, there will likely be discrepancies in their inputs. Differences in formats and standards between departments could result in duplication or inconsistencies.

    Organizations can prevent these errors by regulating who has access to databases. Minimizing database accessibility makes it easier to standardize data entry methods and reduces the likelihood of duplication. This will also make it easier to trace mistakes to their source and improve security.


    After compiling information into a database, teams must cleanse the data before using it in any analytics process. This will remove any errors that earlier steps didn’t prevent. Generally speaking, the data cleansing workflow should follow four basic steps: inspection, cleaning, verifying, and reporting.

    In short, that means looking for errors, fixing or removing them (including standardizing formats), double-checking to verify the accuracy, and recording any changes made. That final step is easy to overlook but crucial, as it can reveal any error trends that emerge between data sets.


    While applying these fixes across an entire organization simultaneously may be tempting, that’s not feasible. Instead, teams should work on the accuracy of one database or operation at a time, starting with the most mission-critical data.

    As teams slowly refine their databases, they’ll learn which fixes have the most significant impact and how to implement them efficiently. This gradual approach will maximize these improvements’ efficacy and minimize disruptions.


    Poor-quality data will lead to unreliable and possibly harmful outcomes. Data teams must pay attention to data accuracy if they hope to produce any meaningful results for their company.

    These five steps provide an outline for improving any data operation’s accuracy. With these fixes, teams can ensure they’re working with the highest-quality data, leading to the most effective analytics.

    Author: Devin Partida

    Source: Dataconomy

  • Data integration applied to BI: making data useful for decision making

    Data integration applied to BI: making data useful for decision making

    In this technology-driven world, the influx of data can seem overwhelming, if not properly utilized. With data coming in from so many different sources, the only way to extract real insights from these raw inputs is through integration.

    Properly integrated data has a trickle-down effect on all business processes, such as sales, vendor acquisition, customer management, business intelligence, etc. Implementing this level of integration enables businesses to make continuous improvements to their products and services.

    Business intelligence (BI) is one of the most significant data integration use cases. An effective BI process incorporates everything from predictive analytics to reporting and operations management. But this sort of comprehensive analytics framework requires integrated enterprise data to identify process inefficiencies, missed opportunities, and other improvement areas.

    What complicates BI integration?

    Given that enterprise information comes from different sources in varying formats and often contains inconsistencies, duplicates, and errors, users must ensure that quality issues identified during the data extraction process do not propagate to their end results. These unchecked outputs impact the integrity and accuracy of reporting, which in turn negatively influences decision making leading to further inefficiencies across business processes.

    Creating well-defined integration processes that not only consolidate data but standardize it for consistency and quality can make high-quality data readily available for decision making.

    Streamlining BI integration: best practices

    Raw data becomes valuable when transformed into analytics-ready, actionable information. By bringing disparate formats together into a unified data repository, an integrated BI system offers better visibility and efficiency into the enterprise assets.

    Therefore, successful BI initiatives are a combination of an effective integration and analytics strategy. The best practices stated below can help you make the best of it:

    Document a BI strategy

    Every business has a reporting process in place. Before implementing a new BI strategy, it’s important to evaluate existing systems to identify the areas that need improvement. Based on that information, you can design a new strategy, which can include several components depending on your specific business structure. However, the major ones that cannot be ignored include the following:

    • Narrow down the data source channels essential for your reporting process. This may consist of stakeholder or departmental information from databases, files, or web sources.
    • The purpose of BI tools is essential to track business KPIs with supporting data. Therefore, identifying the custom KPIs for your organization is imperative in presenting a broad picture of your business growth and losses.
    • Set a format for reporting: visual or textual. Based on your preferences and the input sources, you can select a vendor for the BI system.

    Set up data integration tools

    The integration stage of the entire process will be time-consuming. You can go about it in two ways:

    • Opt for the manual approach, where you rely on your developers and IT team to develop a BI architecture for your custom requirements.
    • The simpler and faster approach would be to buy an enterprise-ready integration solution from the market. These solutions extract data from different sources using built-in connectors, transform it into the required format, and load into the destination system that is connected to BI tools. Several data integration solutions offer out-of-the-box connectivity to BI tools. Therefore, purchasing a data integration solution would serve the dual purpose of integration and reporting.

    Factor in data security

    Setting up security measures before implementing BI is imperative in protecting your information assets against data breaches. By configuring authorization or authentication protocols and outlining procedures to carry out secure data processes, you can control access to data sets.

    BI is no longer a privilege for enterprises, it’s a necessity that enables organizations to stay ahead of the competition and optimize decision-making.

    Identifying the challenges in their reporting journey and implementing the best practices mentioned above will help organizations leverage the BI capabilities and become data-focused.

    Author: Ibrahim Surani

    Source: Dataversity

  • Machine learning, AI, and the increasing attention for data quality

    Machine learning, AI, and the increasing attention for data quality

    Data quality has been going through a renaissance recently.

    As a growing number of organizations increase efforts to transition computing infrastructure to the cloud and invest in cutting-edge machine learning and AI initiatives, they are finding that the main barrier to success is the quality of their data.

    The old saying “garbage in, garbage out” has never been more relevant. With the speed and scale of today’s analytics workloads and the businesses that they support, the costs associated with poor data quality are also higher than ever.

    This is reflected in a massive uptick in media coverage on the topic. Over the past few months, data quality has been the focus of feature articles in The Wall Street Journal, Forbes, Harvard Business Review, MIT Sloan Management Review and others. The common theme is that the success of machine learning and AI is completely dependent on data quality. A quote that summarizes this dependency very well is this one by Thomas Redman: ''If your data is bad, your machine learning tools are useless.''

    The development of new approaches towards data quality

    The need to accelerate data quality assessment, remediation and monitoring has never been more critical for organizations and they are finding that the traditional approaches to data quality don’t provide the speed, scale and agility required by today’s businesses.

    For this reason, highly rated data preparation business Trifacta recently announced an expansion into data quality and unveiled two major new platform capabilities with active profiling and smart cleaning. This is the first time Trifacta has expanded our focus beyond data preparation. By adding new data quality functionality, the business aims to gain capabilities to handle a wider set of data management tasks as part of a modern DataOps platform.

    Legacy approaches to data quality involve many manual, disparate activities as part of a broader process. Dedicated data quality teams, often disconnected from the business context of the data they are working with, manage the process of profiling, fixing and continually monitoring data quality in operational workflows. Each step must be managed in a completely separate interface. It’s hard to iteratively move back-and-forth between steps such as profiling and remediation. Worst of all, the individuals doing the work of managing data quality often don’t have the appropriate context for the data to make informed decisions when business rules change or new situations arise.

    Trifacta uses interactive visualizations and machine intelligence guides help users by highlighting data quality issues and providing intelligent suggestions on how to address them. Profiling, user interaction, intelligent suggestions, and guided decision-making are all interconnected and drive the other. Users can seamlessly transition back-and-forth between steps to ensure their work is correct. This guided approach lowers the barriers to users and helps to democratize the work beyond siloed data quality teams, allowing those with the business context to own and deliver quality outputs with greater efficiency to downstream analytics initiatives.

    New data platform capabilities like this are only a first (albeit significant) step into data quality. Keep your eyes open and expect more developments towards data quality in the near future!

    Author: Will Davis

    Source: Trifacta

  • Managing data at your organization? Take a holistic approach

    Managing data at your organization? Take a holistic approach

    Taking a holistic approach to data requires considering the entire data lifecycle – from gathering, integrating, and organizing data to analyzing and maintaining it. Companies must create a standard for their data that fits their business needs and processes. To determine what those are, start by asking your internal stakeholders questions such as, “Who needs access to the data?” and “What do each of these departments, teams, or leaders need to know? And why?” This helps establish what data is necessary, what can be purged from the system, and how the remaining data should be organized and presented.

    This holistic approach helps yield higher-quality data that’s more usable and more actionable. Here are three reasons to take a holistic approach at your organization:

    1. Remote workforce needs simpler systems

    We saw a massive shift to work-from-home in 2020, and that trend continues to pick up speed. Companies like Twitter, Shopify, Siemens, and the State Bank of India are telling employees they can continue working remotely indefinitely. And according to the World Economic Forum, the number of people working remotely worldwide is expected to double in 2021.

    This makes it vital that we simplify how people interact with their business systems, including CRMs. After all, we still need answers to everyday questions like, “Who’s handling the XYZ account now?” and “How did customer service solve ABC’s problem?” But instead of being able to ask the person in the next office or cubicle, we’re forced to rely on a CRM to keep us up to date and make sure we’re moving in the right direction.

    This means team members must input data in a timely manner, and others must be able to access that data easily and make sense of it, whether it’s to view the sales pipeline, analyze a marketing campaign’s performance, or spot changes in customer buying behavior.

    Unfortunately, the CRMs used by many companies make data entry and analytics challenging. At best, this is an efficiency issue. At worst, it means people aren’t inputting the data that’s needed, and any analysis of spotty data will be flawed. That’s why we suggest companies focus on improving their CRM’s user interface, if it isn’t already user-friendly.

    2. A greater need for data accuracy

    The increased reliance on CRM data also means companies need to ramp up their Data Quality efforts. People need access to clean, accurate information they can act on quickly.

    It’s a profound waste of time when the sales team needs to verify contact information for every lead before they reach out, or when data scientists have to spend hours each week cleaning up data before they analyze it.

    Yet, according to online learning company O’Reilly’s The State of Data Quality 2020 report, 40% or more of companies suffer from these and other major Data Quality issues:

    • Poor quality controls when data enters the system
    • Too many data sources and inconsistent data
    • Poorly labeled data
    • Disorganized data
    • Too few resources to address Data Quality issues

    These are serious systemic issues that must be addressed in order to deliver accurate data on an ongoing basis.

    3. A greater need for automation

    Data Quality Management is an ongoing process throughout the entire data lifecycle. We can’t just clean up data once and call it done.

    Unfortunately, many companies are being forced to work with smaller budgets and leaner teams these days, yet the same amount of data cleanup and maintenance work needs to get done. Automation can help with many of the repetitive tasks involved in data cleanup and maintenance. This includes:

    • Standardizing data
    • Removing duplicates
    • Preventing new duplicates
    • Managing imports
    • Importing/exporting data
    • Converting leads
    • Verifying data

    A solid business case

    By taking a holistic approach to Data Management – including simplifying business systems, improving data accuracy, and automating whenever possible – companies can improve the efficiency and effectiveness of teams throughout their organization. These efforts will help organizations come through the pandemic stronger, with a “new normal” for data that’s far better than what came before.

    Author: Oilivia Hinkle

    Source: Dataversity

  • Migros: an example of seizing the opportunities BI offers

    Migros: an example of seizing the opportunities BI offers

    Migros is the largest retailer in Turkey, with more than 2500 outlets selling fresh produce and groceries to millions of people. To maintain high-quality operations, the company depends on fresh, accurate data. And to ensure high data quality, Migros depends on Talend.

    The sheer volume of data managed by Migros is astonishing. The company’s data warehouse currently holds more than 200 terabytes, and Migros is running more than 7,000 ETL (extract, transform, load) jobs every day. Recently, the quality of that data became the focal point for the BI (business intelligence) team at Migros.

    “We have 4,000 BI users in this company,” said Ahmet Gozmen, Senior Manager of IT Data Quality and Governance at Migros. “We produce 5-6 million mobile reports every year that our BI analysts see on their personal dashboards. If they can’t trust the timeliness or the accuracy of the reports, they can’t provide trustworthy guidance on key business decisions.”

    In 2019, Mr. Gozmen and his team decided they needed a more reliable foundation on which to build data quality. “We were having a few issues with our data at that time,” he said. “There would be occasional problematic or unexpected values in reports—a store’s stock would indicate an abnormal level, for example—and the issue was in the data, not the inventory. We had to address these problems, and more than that we wanted to take our data analysis and BI capabilities to a higher level.''

    From Community to Commercial

    Initially, Mr. Gozmen’s team used the non-commercial version of Talend Data Quality. “It was an open-source solution that we could download and set up in one day,” he said. “At first, we just wanted to see whether we could do something with this tool or not. We explored its capabilities, and we asked the Talend Community if we had questions or needed advice.”

    Mr. Gozmen discovered that Talend had far more potential than he expected. “We found that the data quality tool was very powerful, and we started exploring what else we could do with Talend,” he said. “So we also downloaded the data integration package, then the big data package. Talend could handle the huge volumes of data we were dealing with. And very soon we started thinking about the licensed, commercial versions of these solutions, because we saw a great deal of potential not only for immediate needs but for future plans.”

    By upgrading to the commercial versions, Migros also elevated the level of service and support that was available. “The Community served our purposes well in the early stages,” said Mr. Gozmen, “but with the commercial license we now have more personalized support and access to specialists who can help us immediately with any aspect of our implementation.”

    From Better Data Quality to Big Data Dreams

    With Talend Data Quality, Migros has improved the accuracy and reliability of its BI reporting, according to Mr. Gozmen. “We are a small department in a very big company,” he said, “but with help from Talend we can instill confidence in our reporting, and we can start to support other departments and have a larger impact on improving processes and even help generate more income.”

    The higher level of data quality Migros has achieved with Talend has also led Mr. Gozmen to consider using Talend for future data initiatives. “We have big dreams, and we are testing the waters on several fronts,” he said. “We are exploring the possibilities for predictive analytics, and we feel Talend’s big data capabilities are a good match.”

    The Migros team is also considering using Talend in moving from its current batch processing mode to real-time data analysis, according to Mr. Gozmen. “We are currently using date-minus-one or date-minus two batch processing, but we want to move to real-time big data predictive analytics and other advanced capabilities as soon as possible,” he said. “We are currently testing new models that can help us achieve these goals.”

    While the business advantages of using Talend are manifesting themselves in past, present, and future use cases, Mr. Gozmen sums them up this way: “With Talend we can trust our data, so business leaders can trust our reports, so we can start to use big data in new ways to improve processes, supply chain performance, and business results.”

    Author: Laura Ventura

    Source: Talend

  • The Growing Influence of Ethical AI in Data Science

    The Growing Influence of Ethical AI in Data Science

    Industries such as insurance that handle personal information are paying more attention to customers’ desire for responsible, transparent AI.

    AI (artificial intelligence) is a tremendous asset to companies that use predictive modeling and have automated tasks. However, AI is still facing problems with data bias. After all, AI gets its marching orders from human-generated data -- which by its nature is prone to bias, no matter how evolved we humans like to think we are.

    With the wide adoption of AI, many industries are starting to pay attention to a new form of governance called responsible or ethical AI. These are governance practices associated with regulated data. For most organizations, this involves removing any unintentional bias or discrimination from their customer data and cross-checking any unexpected algorithmic activity once the data moves into production mode.

    This is an especially important transformation for the insurance industry because consumers today are becoming far more attuned to their personal end-to-end experience in any industry that relies on the use of personal data. By advancing responsible, ethical AI, insurers can confidently map to the way consumers want to search for insurance and find insurance policies, and they can align with the values and ethics that govern this kind of personal search.

    What Does Inherent Bias Look Like in AI Algorithms Today?

    One of the more noticeable examples of human-learned, albeit unintentional, data bias today is around gender. This happens when the AI system does not behave the same way for a man versus a woman, even when the data provided to the system is identical except for the gender information. One example outcome is that individuals who should be in the same insurance risk category are offered unequal policy advice.

    Another example is something called the survivor bias, which is optimizing an AI model using only available, visible data -- i.e., “surviving” data. This approach inadvertently overlooks information due to the lack of visibility, and the results are skewed to one vantage point. To move past this weakness, for example in the insurance industry, AI must be trained not to favor the known customer data over prospective customer data that is not yet known.

    More enterprises are becoming aware of how these data determinants can expose them to unnecessary risk. A case in point: in their State of AI in 2021 report, McKinsey reviewed industry regulatory compliance through the filter of a company’s allegiance to equity and fairness data practices --and reported that two of companies’ top three global concerns are the ability to establish ethical AI and to explain their practices well to customers.

    How Can Companies Proactively Eliminate Data Bias Company-wide?

    Most companies should already have a diversity, equity, and inclusion (DEI) program to set a strong foundation before exploring practices in technology, processes, and people. At a minimum, companies can set a goal to remove ingrained data biases. Fortunately, there are a host of best-practice options to do this.

    • Adopt an open source strategy. First, enterprises need to know that biases are not necessarily where they imagine them to be. There can be a bias in the sales training data or in the data at the later inference or prediction time, or both. At Zelros, for example, we recommend that companies use an open source strategy to be more open and transparent in their AI initiatives. This is becoming an essential baseline anti-bias step that is being practiced at companies of all sizes. 

    • Utilize vendor partnerships. Companies that want to put a bigger stake in the ground when it comes to regulatory compliance and ethical AI standards can collaborate with organizations such as isahit, dedicated to helping organizations across industries become competent in their use and implementation of ethical AI. As a best practice, we recommend that companies work toward adopting responsible AI at every level, not just with their technical R&D or research teams, then communicate this governance proliferation to their customers and partners. 

    • Initiate bias bounties. Another method for eliminating data bias was identified by Forrester as a significant trend in their North American “Predictions 2022” guide. It is an initiative called bias bounties. Forrester stated that, “At least five large companies will introduce bias bounties in 2022.”
      Bias bounties are like bug bounties, but instead of rewarding users based on the issues they detect in software, users are rewarded for identifying bias in AI systems. The bias happens because of incomplete data or existing data that can lead to discriminatory outcomes from AI systems. According to Forrester, in 2022, major tech companies such as Google and Microsoft will implement bias bounties, and so will non-technology organizations such as banks and healthcare companies. With trust high on stakeholders’ agenda, basing decisions on accountability and integrity is more critical than ever.

    • Get certified. Finally, another method for establishing an ethical AI approach -- one that is gaining momentum -- is getting AI system certification. Being able to provide proof of the built-in governance through an external audit goes a long way. In Europe, the AI Act is a resource for institutions to assess their AI systems from a process or operational standpoint. In the U.S., the NAIC is a reference organization providing guiding principles for insurers to follow. Another option is for companies to align to a third-party organization for best practices.

    Can an AI System Be Self-criticizing and Self-sustaining?

    Creating an AI system that is both self-criticizing and self-sustaining is the goal. Through the design itself, the AI must adapt and learn, with the support of human common sense, which the machine cannot emulate.

    Companies that want to have a fair prediction outcome may analyze different metrics at various subgroup levels within a specific model feature (for example gender) because that can help identify and prevent biases before they go to market with consumer-facing capabilities. With any AI, making sure that it doesn’t fall into a trap called a Simpson’s Paradox is key. Simpson's Paradox, which also goes by several other names, is a phenomenon in probability and statistics where a trend appears in several groups of data but disappears or reverses when the groups are combined. Successfully preventing this from happening ensures that personal data does not penalize the client or consumer who it is supposed to benefit.

    Responsible Use of AI Can Be a Powerful Advantage

    Companies are starting to pay attention to how responsible AI has the power to nurture a virtuous, profitable circle of customer retention through more reliable and robust data collection. There will be challenges in the ongoing refinement of ethical AI for many applications, but the strategic advantages and opportunities are clear. In insurance, the ability to monitor, control, and balance human bias can keep policy recommendations meant for certain races and genders fairly focused on the needs of those intended audiences. Responsible AI leads to stronger customer attraction and retention, and ultimately increased profitability.


    Companies globally are revving up their focus on data equity and fairness as a relevant risk to mitigate. Fortunately, they have options to choose from to protect themselves. AI offers an opportunity to accelerate more diverse, equitable interactions between humans and machines. Solutions can help large enterprises globally provide hyper-personalized, unbiased recommendations across channels. Respected trend analysts have called out data bias a top business concern of 2022. Simultaneously, they identify responsible, ethical AI as a forward-thinking solution companies can deploy to increase customer and partner trust and boost profitability.

    How are you moving toward an ethical use of AI today?

    Author: Damien Philippon

    Source: TDWI

  • The key challenges in translating high quality data to value

    The key challenges in translating high quality data to value

    Most organizations consider their data quality to be either 'good' or 'very good', but there’s a disconnect around understanding and trust in the data and how it informs business decisions, according to new research from software company Syncsort.

    The company surveyed 175 data management professionals earlier this year, and found that 38% rated their data quality as good while 27% said it was very good.

    A majority of the respondents (69%) said their leadership trusts data insights enough to inform business decisions. Yet they also said only 14% of stakeholders had a very good understanding of the data. Of the 27% who reported sub-optimal data quality, 72% said it negatively affected business decisions.

    The top three challenges companies face when ensuring high quality data are multiple sources of data (70%), applying data governance processes (50%) and volume of data (48%).

    Approximately three quarters (78%) have challenges profiling or applying data quality to large data sets, and 29% said they have a partial understanding of the data that exists across their organization. About half (48%) said they have a good understanding.

    Fewer than 50% of the respondents said they take advantage of data profiling tools or data catalogs. Instead, they rely on other methods to gain an understanding of data. More than half use SQL queries and about 40% use business intelligence tools.

    Author: Bob Violino

    Source: Information-management

  • Understanding and taking advantage of smart data distancing

    Understanding and taking advantage of smart data distancing

    The ongoing COVID-19 pandemic has made the term 'social distancing' a cynosure of our daily conversations. There have been guidelines issued, media campaigns run on prime time, hashtags created, and memes shared to highlight how social distancing can save lives. When you have young children talking about it, you know the message has cut across the cacophony! This might give data scientists a clue of what they can do to garner enterprise attention towards the importance of better data management.

    While many enterprises kickstart their data management projects with much fanfare, egregious data quality practices can hamper the effectiveness of these projects, leading to disastrous results. In a 2016 research study, IBM estimated that bad quality data costs the U.S. economy around $3.1 trillion dollars every year.

    And bad quality data affects the entire ecosystem; salespeople chase the wrong prospects, marketing campaigns do not reach the target segment, and delivery teams are busy cleaning up flawed projects. The good news is that it doesn’t have to be this way. The solution is 'smart data distancing'.

    What is smart data distancing?

    Smart data distancing is a crucial aspect of data danagement, more specifically, data governance for businesses to identify, create, maintain, and authenticate data assets to ensure they are devoid of data corruption or mishandling.

    The recent pandemic has forced governments and health experts to issue explicit guidelines on basic health etiquette; washing hands, using hand sanitizer, keeping social distance, etc. At times, even the most rudimentary facts need to be recapped multiple times so that they become accepted practices.

    Enterprises, too, should strongly emphasize the need for their data assets to be accountable, accurate, and consistent to reap the true benefits of data governance.

    The 7 do’s and don’ts of smart data distancing:

    1. Establish clear guidelines based on global best data management practices for the internal or external data lifecycle process. When accompanied by a good metadata management solution, which includes data profiling, classification, management, and organizing diverse enterprise data, this can vastly improve target marketing campaigns, customer service, and even new product development.

    2. Set up quarantine units for regular data cleansing or data scrubbing, matching, and standardization for all inbound and outbound data.

    3. Build centralized data asset management to optimize, refresh, and overcome data duplication issues for overall accuracy and consistency of data quality.

    4. Create data integrity standards using stringent constraint and trigger techniques. These techniques will impose restrictions against accidental damage to your data.

    5. Create periodic training programs for all data stakeholders on the right practices to gather and handle data assets and the need to maintain data accuracy and consistency. A data-driven culture will ensure the who, what, when, and where of your organization’s data and help bring transparency in complex processes.

    6. Don’t focus only on existing data that is readily available but also focus on the process of creating or capturing new and useful data. Responsive businesses create a successful data-driven culture that encompasses people, process, as well as technology.

    7. Don’t take your customer for granted. Always choose ethical data partners.

    How to navigate your way around third-party data

    The COVID-19 crisis has clearly highlighted how prevention is better than a cure. To this effect, the need to maintain safe and minimal human contact has been stressed immensely. Applying the same logic when enterprises rely on third-party data, the risks also increase manifold. Enterprises cannot ensure that a third-party data partner/vendor follows proper data quality processes and procedures.

    The questions that should keep your lights on at night are:

    • Will my third-party data partner disclose their data assessment and audit processes?
    • What are the risks involved, and how can they be best assessed, addressed, mitigated, and monitored?
    • Does my data partner have an adequate security response plan in case of a data breach?
    • Will a vendor agreement suffice in protecting my business interests?
    • Can an enterprise hold a third-party vendor accountable for data quality and data integrity lapses?  

    Smart data distancing for managing third-party data

    The third-party data risk landscape is complex. If the third-party’s data integrity is compromised, your organization stands to lose vital business data. However, here are a few steps you can take to protect your business:

    • Create a thorough information-sharing policy for protection against data leakage.
    • Streamline data dictionaries and metadata repositories to formulate a single cohesive data management policy that furthers the organization’s objectives.
    • Maintain quality of enterprise metadata to ensure its consistency across all organizational units to increase its trust value.
    • Integrate the linkage between business goals and the enterprise information running across the organization with the help of a robust metadata management system.
    • Schedule periodic training programs that emphasize the value of data integrity and its role in decision-making.

    The functional importance of a data steward in the data management and governance framework is often overlooked. The hallmark of a good data governance framework lies in how well the role of the data steward has been etched and fashioned within an organization. The data steward (or a custodian) determines the fitness levels of your data elements, the establishment of control, and the evaluation of vulnerabilities, and they remain on the frontline in managing any data breach. As a conduit between the IT and end-users, a data steward offers you a transparent overview of an organization’s critical data assets that can help you have nuanced conversations with your customers. 

    Unlock the benefits of smart data distancing

    Smart and unadulterated data is instrumental to the success of data governance. However, many enterprises often are content to just meet the bare minimum standards of compliance and regulation and tend to overlook the priority it deserves. Smart data means cleaner, high-quality data, which in turn means sharper analytics that directly translates to better decisions for better outcomes.

    Gartner says corporate data is valued at 20-25% of the enterprise value. Organizations should learn to monetize and use it wisely. Organizations can reap the benefits of the historical and current data that has been amassed over the years by harnessing and linking them to new business initiatives and projects. Data governance based on smart enterprise data will offer you the strategic competence to gain a competitive edge and improve operational efficiency.


    It is an accepted fact that an enterprise with poor data management will suffer an impact on its bottom line. Not having a properly defined data management framework can create regulatory compliance issues and impact business revenue.

    Enterprises are beginning to see the value of data in driving better outcomes and hence are rushing their efforts in setting up robust data governance initiatives. There are a lot of technology solutions and platforms available. Towards this endeavor, the first step for an enterprise is to develop a mindset of being data-driven and being receptive to a transformative culture.

     he objective is to ensure that the enterprise data serves the cross-functional business initiatives with insightful information, and for that to happen, the data needs to be accurate, meaningful, and trustworthy. Setting out to be a successful data-driven enterprise can be a daunting objective with a long transformational journey. Take a step in the right direction today with smart data distancing!

    Author: Sowmya Kandregula

    Source: Dataversity

  • What makes your data healthy data?

    What makes your data healthy data?

    If someone asked you what makes data “healthy”, what would you say? What IS data health? Healthy data just means data that is quality, accessible, trusted, and secure, right? Wrong. 

    • Healthy data is data that provides business value. 
    • Data health depends on how well an organization's data supports its business objectives. 
    • Your data is unhealthy if it does not provide business value. 

    Let's dissect. Data health really has nothing to do with the data itself, if you think about it. It has everything to do with the state of your organization as a whole - whether you’re a university, a government entity, or a commercial business - and how well your data supports your current and long-term business objectives.  

    It’s so easy to think data quality = data health. Think instead: what is the biggest problem we have in the world of data today?   

    It’s not moving data, connecting to data sources, or moving from on-premise to the cloud. It’s not even data quality, integration, or management! In today’s world with SO many solutions to choose from, we have access to more tools than ever that let us connect, store, and move our data. 

    You probably already spend an awful lot of time and money getting your data loaded, managed, movable, and usable. But the question remains: are you getting any REAL value out of that data you spend so much time and money on?   

    The biggest problem businesses face is not getting value from their data 

    According to our 2021 Data Health survey, 64% of executives surveyed work with data every day while 44% of finance executives make the majority of their decisions without data. If you're in the majority, you're already a step ahead - but just working with data isn't enough. It has to be about delivering an outcome.

    Given, that outcome can look different depending on who you talk to in your organization.

    Data health can mean different things to different roles 

    If you speak to the CEO vs the CMO vs the VP of Sales vs the Head of Compliance or IT, data health is going to mean something different to each of them. This is because every one of these business leaders has a different data health problem – but there's a common thread. 

    They're all not achieving their objectives because their data isn't enabling them to. To create a data health strategy with real business value, you have to start from the bottom: what are you trying to achieve?   

    What’s your most important business objective?  

    The world of managing dataalone does not deliver value. Focusing on value first: what are you trying to do? Often, companies have business objectives such as creating an intuitive marketing strategy, improving sales, or meeting regulatory compliance.  

    Once you outline your objectives, recognizing that they may be different objectives depending on role, now you can move to: 

    What data supports that objective? 

    Say you have a massive amount of data in a CRM, with intent data coming from multiple different systems. You want to bring that marketing data together to find your target audiences and tap into their needs, right?  

    Or maybe your marketing efforts for intent data are inefficient because your data is siloed and not being used to deliver the insights you want, as fast as you want them. 

    Consider what data you would need to achieve your business objectives, and then finally: 

    What’s stopping you from achieving that today? 

    You understand your goals, you know what you need to get there, you are relying on your data to deliver business outcome to be healthy – but that’s not enough. You need the platform and the technology to be able to do it.  

    You need a platform that combines the concepts of data quality, trust, and accessibility, that can also get you focused on achieving business initiatives – not just managing and moving around your data. In a data-driven world with endless options, you need a solution entirely focused making your business outcomes a reality with (truly) healthy data.

    Author: Stu Garrow

    Source: Talend

EasyTagCloud v2.8