Data Disasters: 8 Infamous Analytics and AI Failures

Data Disasters: 8 Infamous Analytics and AI Failures

Insights from data and machine learning algorithms can be invaluable, but mistakes can cost you reputation, revenue, or even lives. These high-profile analytics and AI blunders illustrate what can go wrong.

In 2017, The Economist declared that data, rather than oil, had become the world’s most valuable resource. The refrain has been repeated ever since. Organizations across every industry have been and continue to invest heavily in data and analytics. But like oil, data and analytics have their dark side.

According to CIO’s State of the CIO 2023 report, 34% of IT leaders say that data and business analytics will drive the most IT investment at their organization this year. And 26% of IT leaders say machine learning/artificial intelligence will drive the most IT investment. Insights gained from analytics and actions driven by machine learning algorithms can give organizations a competitive advantage, but mistakes can be costly in terms of reputation, revenue, or even lives.

Understanding your data and what it’s telling you is important, but it’s also important to understand your tools, know your data, and keep your organization’s values firmly in mind.

Here are a handful of high-profile analytics and AI blunders from the past decade to illustrate what can go wrong.

ChatGPT hallucinates court cases

Advances made in 2023 by large language models (LLMs) have stoked widespread interest in the transformative potential of generative AI across nearly every industry. OpenAI’s ChatGPT has been at the center of this surge in interest, foreshadowing how generative AI holds the power to disrupt the nature of work in nearly every corner of business.

But the technology still has ways to go before it can reliably take over most business processes, as attorney Steven A. Schwartz learned when he found himself in hot water with US District Judge P. Kevin Castel in 2023 after using ChatGPT to research precedents in a suit against Colombian airline Avianca.

Schwartz, an attorney with Levidow, Levidow & Oberman, used the OpenAI generative AI chatbot to find prior cases to support a case filed by Avianca employee Roberto Mata for injuries he sustained in 2019. The only problem? At least six of the cases submitted in the brief did not exist. In a document filed in May, Judge Castel noted the cases submitted by Schwartz included false names and docket numbers, along with bogus internal citations and quotes.

In an affidavit, Schwartz told the court that it was the first time he had used ChatGPT as a legal research source and he was “unaware of the possibility that its content could be false.” He admitted that he had not confirmed the sources provided by the AI chatbot. He also said that he “greatly regrets having utilized generative artificial intelligence to supplement the legal research performed herein and will never do so in the future without absolute verification of its authenticity.”

As of June 2023, Schwartz was facing possible sanctions by the court.

AI algorithms identify everything but COVID-19

Since the COVID-19 pandemic began, numerous organizations have sought to apply machine learning (ML) algorithms to help hospitals diagnose or triage patients faster. But according to the UK’s Turing Institute, a national center for data science and AI, the predictive tools made little to no difference.

MIT Technology Review has chronicled a number of failures, most of which stem from errors in the way the tools were trained or tested. The use of mislabeled data or data from unknown sources was a common culprit.

Derek Driggs, a machine learning researcher at the University of Cambridge, together with his colleagues, published a paper in Nature Machine Intelligence that explored the use of deep learning models for diagnosing the virus. The paper determined the technique not fit for clinical use. For example, Driggs’ group found that their own model was flawed because it was trained on a data set that included scans of patients that were lying down while scanned and patients that were standing up. The patients who were lying down were much more likely to be seriously ill, so the algorithm learned to identify COVID risk based on the position of the person in the scan.

A similar example includes an algorithm trained with a data set that included scans of the chests of healthy children. The algorithm learned to identify children, not high-risk patients.

Zillow wrote down millions of dollars, slashed workforce due to algorithmic home-buying disaster

In November 2021, online real estate marketplace Zillow told shareholders it would wind down its Zillow Offers operations and cut 25% of the company’s workforce — about 2,000 employees — over the next several quarters. The home-flipping unit’s woes were the result of the error rate in the machine learning algorithm it used to predict home prices.

Zillow Offers was a program through which the company made cash offers on properties based on a “Zestimate” of home values derived from a machine learning algorithm. The idea was to renovate the properties and flip them quickly. But a Zillow spokesperson told CNN that the algorithm had a median error rate of 1.9%, and the error rate could be much higher, as much as 6.9%, for off-market homes.

CNN reported that Zillow bought 27,000 homes through Zillow Offers since its launch in April 2018 but sold only 17,000 through the end of September 2021. Black swan events like the COVID-19 pandemic and a home renovation labor shortage contributed to the algorithm’s accuracy troubles.

Zillow said the algorithm had led it to unintentionally purchase homes at higher prices that its current estimates of future selling prices, resulting in a $304 million inventory write-down in Q3 2021.

In a conference call with investors following the announcement, Zillow co-founder and CEO Rich Barton said it might be possible to tweak the algorithm, but ultimately it was too risky.

UK lost thousands of COVID cases by exceeding spreadsheet data limit

In October 2020, Public Health England (PHE), the UK government body responsible for tallying new COVID-19 infections, revealed that nearly 16,000 coronavirus cases went unreported between Sept. 25 and Oct. 2. The culprit? Data limitations in Microsoft Excel.

PHE uses an automated process to transfer COVID-19 positive lab results as a CSV file into Excel templates used by reporting dashboards and for contact tracing. Unfortunately, Excel spreadsheets can have a maximum of 1,048,576 rows and 16,384 columns per worksheet. Moreover, PHE was listing cases in columns rather than rows. When the cases exceeded the 16,384-column limit, Excel cut off the 15,841 records at the bottom.

The “glitch” didn’t prevent individuals who got tested from receiving their results, but it did stymie contact tracing efforts, making it harder for the UK National Health Service (NHS) to identify and notify individuals who were in close contact with infected patients. In a statement on Oct. 4, Michael Brodie, interim chief executive of PHE, said NHS Test and Trace and PHE resolved the issue quickly and transferred all outstanding cases immediately into the NHS Test and Trace contact tracing system.

PHE put in place a “rapid mitigation” that splits large files and has conducted a full end-to-end review of all systems to prevent similar incidents in the future.

Healthcare algorithm failed to flag Black patients

In 2019, a study published in Science revealed that a healthcare prediction algorithm, used by hospitals and insurance companies throughout the US to identify patients to in need of “high-risk care management” programs, was far less likely to single out Black patients.

High-risk care management programs provide trained nursing staff and primary-care monitoring to chronically ill patients in an effort to prevent serious complications. But the algorithm was much more likely to recommend white patients for these programs than Black patients.

The study found that the algorithm used healthcare spending as a proxy for determining an individual’s healthcare need. But according to Scientific American, the healthcare costs of sicker Black patients were on par with the costs of healthier white people, which meant they received lower risk scores even when their need was greater.

The study’s researchers suggested that a few factors may have contributed. First, people of color are more likely to have lower incomes, which, even when insured, may make them less likely to access medical care. Implicit bias may also cause people of color to receive lower-quality care.

While the study did not name the algorithm or the developer, the researchers told Scientific American they were working with the developer to address the situation.

Dataset trained Microsoft chatbot to spew racist tweets

In March 2016, Microsoft learned that using Twitter interactions as training data for machine learning algorithms can have dismaying results.

Microsoft released Tay, an AI chatbot, on the social media platform. The company described it as an experiment in “conversational understanding.” The idea was the chatbot would assume the persona of a teen girl and interact with individuals via Twitter using a combination of machine learning and natural language processing. Microsoft seeded it with anonymized public data and some material pre-written by comedians, then set it loose to learn and evolve from its interactions on the social network.

Within 16 hours, the chatbot posted more than 95,000 tweets, and those tweets rapidly turned overtly racist, misogynist, and anti-Semitic. Microsoft quickly suspended the service for adjustments and ultimately pulled the plug.

“We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay,” Peter Lee, corporate vice president, Microsoft Research & Incubations (then corporate vice president of Microsoft Healthcare), wrote in a post on Microsoft’s official blog following the incident.

Lee noted that Tay’s predecessor, Xiaoice, released by Microsoft in China in 2014, had successfully had conversations with more than 40 million people in the two years prior to Tay’s release. What Microsoft didn’t take into account was that a group of Twitter users would immediately begin tweeting racist and misogynist comments to Tay. The bot quickly learned from that material and incorporated it into its own tweets.

“Although we had prepared for many types of abuses of the system, we had made a critical oversight for this specific attack. As a result, Tay tweeted wildly inappropriate and reprehensible words and images,” Lee wrote.

Like many large companies, Amazon is hungry for tools that can help its HR function screen applications for the best candidates. In 2014, Amazon started working on AI-powered recruiting software to do just that. There was only one problem: The system vastly preferred male candidates. In 2018, Reuters broke the news that Amazon had scrapped the project.

Amazon’s system gave candidates star ratings from 1 to 5. But the machine learning models at the heart of the system were trained on 10 years’ worth of resumes submitted to Amazon — most of them from men. As a result of that training data, the system started penalizing phrases in the resume that included the word “women’s” and even downgraded candidates from all-women colleges.

At the time, Amazon said the tool was never used by Amazon recruiters to evaluate candidates.

The company tried to edit the tool to make it neutral, but ultimately decided it could not guarantee it would not learn some other discriminatory way of sorting candidates and ended the project.

Target analytics violated privacy

In 2012, an analytics project by retail titan Target showcased how much companies can learn about customers from their data. According to the New York Times, in 2002 Target’s marketing department started wondering how it could determine whether customers are pregnant. That line of inquiry led to a predictive analytics project that would famously lead the retailer to inadvertently reveal to a teenage girl’s family that she was pregnant. That, in turn, would lead to all manner of articles and marketing blogs citing the incident as part of advice for avoiding the “creepy factor.”

Target’s marketing department wanted to identify pregnant individuals because there are certain periods in life — pregnancy foremost among them — when people are most likely to radically change their buying habits. If Target could reach out to customers in that period, it could, for instance, cultivate new behaviors in those customers, getting them to turn to Target for groceries or clothing or other goods.

Like all other big retailers, Target had been collecting data on its customers via shopper codes, credit cards, surveys, and more. It mashed that data up with demographic data and third-party data it purchased. Crunching all that data enabled Target’s analytics team to determine that there were about 25 products sold by Target that could be analyzed together to generate a “pregnancy prediction” score. The marketing department could then target high-scoring customers with coupons and marketing messages.

Additional research would reveal that studying customers’ reproductive status could feel creepy to some of those customers. According to the Times, the company didn’t back away from its targeted marketing, but did start mixing in ads for things they knew pregnant women wouldn’t buy — including ads for lawn mowers next to ads for diapers — to make the ad mix feel random to the customer.

Date: July 3, 2023

Author: Thor Olavsrud

Source: CIO

Chatbots and their Struggle with Negation

Chatbots and their Struggle with Negation

Today’s language models are more sophisticated than ever, but they still struggle with the concept of negation. That’s unlikely to change anytime soon.

Nora Kassner suspected her computer wasn’t as smart as people thought. In October 2018, Google released a language model algorithm called BERT, which Kassner, a researcher in the same field, quickly loaded on her laptop. It was Google’s first language model that was self-taught on a massive volume of online data. Like her peers, Kassner was impressed that BERT could complete users’ sentences and answer simple questions. It seemed as if the large language model (LLM) could read text like a human (or better).

But Kassner, at the time a graduate student at Ludwig Maximilian University of Munich, remained skeptical. She felt LLMs should understand what their answers mean — and what they don’t mean. It’s one thing to know that a bird can fly. “A model should automatically also know that the negated statement — ‘a bird cannot fly’ — is false,” she said. But when she and her adviser, Hinrich Schütze, tested BERT and two other LLMs in 2019, they found that the models behaved as if words like “not” were invisible.

Since then, LLMs have skyrocketed in size and ability. “The algorithm itself is still similar to what we had before. But the scale and the performance is really astonishing,” said Ding Zhao, who leads the Safe Artificial Intelligence Lab at Carnegie Mellon University.

But while chatbots have improved their humanlike performances, they still have trouble with negation. They know what it means if a bird can’t fly, but they collapse when confronted with more complicated logic involving words like “not,” which is trivial to a human.

“Large language models work better than any system we have ever had before,” said Pascale Fung, an AI researcher at the Hong Kong University of Science and Technology. “Why do they struggle with something that’s seemingly simple while it’s demonstrating amazing power in other things that we don’t expect it to?” Recent studies have finally started to explain the difficulties, and what programmers can do to get around them. But researchers still don’t understand whether machines will ever truly know the word “no.”

Making Connections

It’s hard to coax a computer into reading and writing like a human. Machines excel at storing lots of data and blasting through complex calculations, so developers build LLMs as neural networks: statistical models that assess how objects (words, in this case) relate to one another. Each linguistic relationship carries some weight, and that weight — fine-tuned during training — codifies the relationship’s strength. For example, “rat” relates more to “rodent” than “pizza,” even if some rats have been known to enjoy a good slice.

In the same way that your smartphone’s keyboard learns that you follow “good” with “morning,” LLMs sequentially predict the next word in a block of text. The bigger the data set used to train them, the better the predictions, and as the amount of data used to train the models has increased enormously, dozens of emergent behaviors have bubbled up. Chatbots have learned style, syntax and tone, for example, all on their own. “An early problem was that they completely could not detect emotional language at all. And now they can,” said Kathleen Carley, a computer scientist at Carnegie Mellon. Carley uses LLMs for “sentiment analysis,” which is all about extracting emotional language from large data sets — an approach used for things like mining social media for opinions.

So new models should get the right answers more reliably. “But we’re not applying reasoning,” Carley said. “We’re just applying a kind of mathematical change.” And, unsurprisingly, experts are finding gaps where these models diverge from how humans read.

No Negatives

Unlike humans, LLMs process language by turning it into math. This helps them excel at generating text — by predicting likely combinations of text — but it comes at a cost.

“The problem is that the task of prediction is not equivalent to the task of understanding,” said Allyson Ettinger, a computational linguist at the University of Chicago. Like Kassner, Ettinger tests how language models fare on tasks that seem easy to humans. In 2019, for example, Ettinger tested BERT with diagnostics pulled from experiments designed to test human language ability. The model’s abilities weren’t consistent. For example:

He caught the pass and scored another touchdown. There was nothing he enjoyed more than a good game of ____. (BERT correctly predicted “football.”)

The snow had piled up on the drive so high that they couldn’t get the car out. When Albert woke up, his father handed him a ____. (BERT incorrectly guessed “note,” “letter,” “gun.”)

And when it came to negation, BERT consistently struggled.

A robin is not a ____. (BERT predicted “robin,” and “bird.”)

On the one hand, it’s a reasonable mistake. “In very many contexts, ‘robin’ and ‘bird’ are going to be predictive of one another because they’re probably going to co-occur very frequently,” Ettinger said. On the other hand, any human can see it’s wrong.

By 2023, OpenAI’s ChatGPT and Google’s bot, Bard, had improved enough to predict that Albert’s father had handed him a shovel instead of a gun. Again, this was likely the result of increased and improved data, which allowed for better mathematical predictions.

But the concept of negation still tripped up the chatbots. Consider the prompt, “What animals don’t have paws or lay eggs, but have wings?” Bard replied, “No animals.” ChatGPT correctly replied bats, but also included flying squirrels and flying lemurs, which do not have wings. In general, “negation [failures] tended to be fairly consistent as models got larger,” Ettinger said. “General world knowledge doesn’t help.”

Invisible Words

The obvious question becomes: Why don’t the phrases “do not” or “is not” simply prompt the machine to ignore the best predictions from “do” and “is”?

That failure is not an accident. Negations like “not,” “never” and “none” are known as stop words, which are functional rather than descriptive. Compare them to words like “bird” and “rat” that have clear meanings. Stop words, in contrast, don’t add content on their own. Other examples include “a,” “the” and “with.”

“Some models filter out stop words to increase the efficiency,” said Izunna Okpala, a doctoral candidate at the University of Cincinnati who works on perception analysis. Nixing every “a” and so on makes it easier to analyze a text’s descriptive content. You don’t lose meaning by dropping every “the.” But the process sweeps out negations as well, meaning most LLMs just ignore them.

So why can’t LLMs just learn what stop words mean? Ultimately, because “meaning” is something orthogonal to how these models work. Negations matter to us because we’re equipped to grasp what those words do. But models learn “meaning” from mathematical weights: “Rose” appears often with “flower,” “red” with “smell.” And it’s impossible to learn what “not” is this way.

Kassner says the training data is also to blame, and more of it won’t necessarily solve the problem. Models mainly train on affirmative sentences because that’s how people communicate most effectively. “If I say I’m born on a certain date, that automatically excludes all the other dates,” Kassner said. “I wouldn’t say ‘I’m not born on that date.’”

This dearth of negative statements undermines a model’s training. “It’s harder for models to generate factually correct negated sentences, because the models haven’t seen that many,” Kassner said.

Untangling the Not

If more training data isn’t the solution, what might work? Clues come from an analysis posted to in March, where Myeongjun Jang and Thomas Lukasiewicz, computer scientists at the University of Oxford (Lukasiewicz is also at the Vienna University of Technology), tested ChatGPT’s negation skills. They found that ChatGPT was a little better at negation than earlier LLMs, even though the way LLMs learned remained unchanged. “It is quite a surprising result,” Jang said. He believes the secret weapon was human feedback.

The ChatGPT algorithm had been fine-tuned with “human-in-the-loop” learning, where people validate responses and suggest improvements. So when users noticed ChatGPT floundering with simple negation, they reported that poor performance, allowing the algorithm to eventually get it right.

John Schulman, a developer of ChatGPT, described in a recent lecture how human feedback was also key to another improvement: getting ChatGPT to respond “I don’t know” when confused by a prompt, such as one involving negation. “Being able to abstain from answering is very important,” Kassner said. Sometimes “I don’t know” is the answer.

Yet even this approach leaves gaps. When Kassner prompted ChatGPT with “Alice is not born in Germany. Is Alice born in Hamburg?” the bot still replied that it didn’t know. She also noticed it fumbling with double negatives like “Alice does not know that she does not know the painter of the Mona Lisa.”

“It’s not a problem that is naturally solved by the way that learning works in language models,” Lukasiewicz said. “So the important thing is to find ways to solve that.”

One option is to add an extra layer of language processing to negation. Okpala developed one such algorithm for sentiment analysis. His team’s paper, posted on in February, describes applying a library called WordHoard to catch and capture negation words like “not” and antonyms in general. It’s a simple algorithm that researchers can plug into their own tools and language models. “It proves to have higher accuracy compared to just doing sentiment analysis alone,” Okpala said. When he combined his code and WordHoard with three common sentiment analyzers, they all improved in accuracy in extracting opinions — the best one by 35%.

Another option is to modify the training data. When working with BERT, Kassner used texts with an equal number of affirmative and negated statements. The approach helped boost performance in simple cases where antonyms (“bad”) could replace negations (“not good”). But this is not a perfect fix, since “not good” doesn’t always mean “bad.” The space of “what’s not” is simply too big for machines to sift through. “It’s not interpretable,” Fung said. “You’re not me. You’re not shoes. You’re not an infinite amount of things.” 

Finally, since LLMs have surprised us with their abilities before, it’s possible even larger models with even more training will eventually learn to handle negation on their own. Jang and Lukasiewicz are hopeful that diverse training data, beyond just words, will help. “Language is not only described by text alone,” Lukasiewicz said. “Language describes anything. Vision, audio.” OpenAI’s new GPT-4 integrates text, audio and visuals, making it reportedly the largest “multimodal” LLM to date.

Future Not Clear

But while these techniques, together with greater processing and data, might lead to chatbots that can master negation, most researchers remain skeptical. “We can’t actually guarantee that that will happen,” Ettinger said. She suspects it’ll require a fundamental shift, moving language models away from their current objective of predicting words.

After all, when children learn language, they’re not attempting to predict words, they’re just mapping words to concepts. They’re “making judgments like ‘is this true’ or ‘is this not true’ about the world,” Ettinger said.

If an LLM could separate true from false this way, it would open the possibilities dramatically. “The negation problem might go away when the LLM models have a closer resemblance to humans,” Okpala said.

Of course, this might just be switching one problem for another. “We need better theories of how humans recognize meaning and how people interpret texts,” Carley said. “There’s just a lot less money put into understanding how people think than there is to making better algorithms.”

And dissecting how LLMs fail is getting harder, too. State-of-the-art models aren’t as transparent as they used to be, so researchers evaluate them based on inputs and outputs, rather than on what happens in the middle. “It’s just proxy,” Fung said. “It’s not a theoretical proof.” So what progress we have seen isn’t even well understood.

And Kassner suspects that the rate of improvement will slow in the future. “I would have never imagined the breakthroughs and the gains we’ve seen in such a short amount of time,” she said. “I was always quite skeptical whether just scaling models and putting more and more data in it is enough. And I would still argue it’s not.”

Date: June 2, 2023

Author: Max G. Levy

Source: Quanta Magazine

Helping Business Executives Understand Machine Learning

Helping Business Executives Understand Machine Learning

For data science teams to succeed, business leaders need to understand the importance of MLops, modelops, and the machine learning life cycle. Try these analogies and examples to cut through the jargon.

If you’re a data scientist or you work with machine learning (ML) models, you have tools to label data, technology environments to train models, and a fundamental understanding of MLops and modelops. If you have ML models running in production, you probably use ML monitoring to identify data drift and other model risks.

Data science teams use these essential ML practices and platforms to collaborate on model development, to configure infrastructure, to deploy ML models to different environments, and to maintain models at scale. Others who are seeking to increase the number of models in production, improve the quality of predictions, and reduce the costs in ML model maintenance will likely need these ML life cycle management tools, too.

Unfortunately, explaining these practices and tools to business stakeholders and budget decision-makers isn’t easy. It’s all technical jargon to leaders who want to understand the return on investment and business impact of machine learning and artificial intelligence investments and would prefer staying out of the technical and operational weeds.

Data scientists, developers, and technology leaders recognize that getting buy-in requires defining and simplifying the jargon so stakeholders understand the importance of key disciplines. Following up on a previous article about how to explain devops jargon to business executives, I thought I would write a similar one to clarify several critical ML practices that business leaders should understand.   

What is the machine learning life cycle?

As a developer or data scientist, you have an engineering process for taking new ideas from concept to delivering business value. That process includes defining the problem statement, developing and testing models, deploying models to production environments, monitoring models in production, and enabling maintenance and improvements. We call this a life cycle process, knowing that deployment is the first step to realizing the business value and that once in production, models aren’t static and will require ongoing support.

Business leaders may not understand the term life cycle. Many still perceive software development and data science work as one-time investments, which is one reason why many organizations suffer from tech debt and data quality issues.

Explaining the life cycle with technical terms about model development, training, deployment, and monitoring will make a business executive’s eyes glaze over. Marcus Merrell, vice president of technology strategy at Sauce Labs, suggests providing leaders with a real-world analogy.

Machine learning is somewhat analogous to farming: The crops we know today are the ideal outcome of previous generations noticing patterns, experimenting with combinations, and sharing information with other farmers to create better variations using accumulated knowledge,” he says. “Machine learning is much the same process of observation, cascading conclusions, and compounding knowledge as your algorithm gets trained.”

What I like about this analogy is that it illustrates generative learning from one crop year to the next but can also factor in real-time adjustments that might occur during a growing season because of weather, supply chain, or other factors. Where possible, it may be beneficial to find analogies in your industry or a domain your business leaders understand.

What is MLops?

Most developers and data scientists think of MLops as the equivalent of devops for machine learning. Automating infrastructure, deployment, and other engineering processes improves collaborations and helps teams focus more energy on business objectives instead of manually performing technical tasks.

But all this is in the weeds for business executives who need a simple definition of MLops, especially when teams need budget for tools or time to establish best practices.

“MLops, or machine learning operations, is the practice of collaboration and communication between data science, IT, and the business to help manage the end-to-end life cycle of machine learning projects,” says Alon Gubkin, CTO and cofounder of Aporia. “MLops is about bringing together different teams and departments within an organization to ensure that machine learning models are deployed and maintained effectively.”

Thibaut Gourdel, technical product marketing manager at Talend, suggests adding some detail for the more data-driven business leaders. He says, “MLops promotes the use of agile software principles applied to ML projects, such as version control of data and models as well as continuous data validation, testing, and ML deployment to improve repeatability and reliability of models, in addition to your teams’ productivity.”

What is data drift?

Whenever you can use words that convey a picture, it’s much easier to connect the term with an example or a story. An executive understands what drift is from examples such as a boat drifting off course because of the wind, but they may struggle to translate it to the world of data, statistical distributions, and model accuracy.

“Data drift occurs when the data the model sees in production no longer resembles the historical data it was trained on,” says Krishnaram Kenthapadi, chief AI officer and scientist at Fiddler AI. “It can be abrupt, like the shopping behavior changes brought on by the COVID-19 pandemic. Regardless of how the drift occurs, it’s critical to identify these shifts quickly to maintain model accuracy and reduce business impact.”

Gubkin provides a second example of when data drift is a more gradual shift from the data the model was trained on. “Data drift is like a company’s products becoming less popular over time because consumer preferences have changed.”

David Talby, CTO of John Snow Labs, shared a generalized analogy. “Model drift happens when accuracy degrades due to the changing production environment in which it operates,” he says. “Much like a new car’s value declines the instant you drive it off the lot, a model does the same, as the predictable research environment it was trained on behaves differently in production. Regardless of how well it’s operating, a model will always need maintenance as the world around it changes.” 

The important message that data science leaders must convey is that because data isn’t static, models must be reviewed for accuracy and be retrained on more recent and relevant data.

What is ML monitoring?

How does a manufacturer measure quality before their products are boxed and shipped to retailers and customers? Manufacturers use different tools to identify defects, including when an assembly line is beginning to show deviations from acceptable output quality. If we think of an ML model as a small manufacturing plant producing forecasts, then it makes sense that data science teams need ML monitoring tools to check for performance and quality issues. Katie Roberts, data science solution architect at Neo4j, says, “ML monitoring is a set of techniques used during production to detect issues that may negatively impact model performance, resulting in poor-quality insights.”

Manufacturing and quality control is an easy analogy, and here are two recommendations to provide ML model monitoring specifics: “As companies accelerate investment in AI/ML initiatives, AI models will increase drastically from tens to thousands. Each needs to be stored securely and monitored continuously to ensure accuracy,” says Hillary Ashton, chief product officer at Teradata. 

What is modelops?

MLops focuses on multidisciplinary teams collaborating on developing, deploying, and maintaining models. But how should leaders decide what models to invest in, which ones require maintenance, and where to create transparency around the costs and benefits of artificial intelligence and machine learning?

These are governance concerns and part of what modelops practices and platforms aim to address. Business leaders want modelops but won’t fully understand the need and what it delivers until its partially implemented.

That’s a problem, especially for enterprises that seek investment in modelops platforms. Nitin Rakesh, CEO and managing director of Mphasis suggests explaining modelops this way. “By focusing on modelops, organizations can ensure machine learning models are deployed and maintained to maximize value and ensure governance for different versions.“

Ashton suggests including one example practice. “Modelops allows data scientists to identify and remediate data quality risks, automatically detect when models degrade, and schedule model retraining,” she says.

There are still many new ML and AI capabilities, algorithms, and technologies with confusing jargon that will seep into a business leader’s vocabulary. When data specialists and technologists take time to explain the terminology in language business leaders understand, they are more likely to get collaborative support and buy-in for new investments.

Author: Isaac Sacolick

Soruce: InfoWorld

Meer artikelen...