Machine learning is an established discipline and tool set that has proven extremely powerful across a wide range of applications. This power has brought with it a number of challenges and various types of incidents — many of them related directly to the methods used to build the models themselves. Considerations like model robustness, transparency and fairness are critical for risk management and organizational adoption of machine learning. Moreover, many jurisdictions are increasing their regulatory requirements in these areas, adding legal urgency to ethical ML questions.
Join Patrick Hall as he discusses AI incidents, the nascent NIST AI risk management framework and how PiML can streamline your journey to higher quality machine learning implementations.
Patrick Hall
Principal Scientist, BNH.AI
Visiting Faculty at the George Washington University School of Business
January 20, 2023
Go to OpenTeams (www.openteams.com) to find your Open Source Architect to train, support, and deliver your software solutions. Assemble the right open source expert team today.
Transcript
Brian Skinn [00:00:01] Hello, everyone, and welcome to today’s Open Teams Community Tech Shares Event. Open Teams gives you access to the best open source architects in the world to help you train, support, and deliver your software solutions. Today we were privileged to bring together some of the best open source architects in the interpretable ML and ML fairness spaces to share their knowledge with the community. My name is Brian Skinn. I’ll be your host for today.
Today’s Tech Share is a technology showcase entitled ‘Quality AI ML Models by Design with PiML’. PiML is an integrated Python toolbox for interpretable machine learning that exposes numerous tools for evaluating ML models in a user-friendly way. We are open for audience questions on the sessions for today, but at this time we plan to hold them for the final roundtable session. So, please post any questions you may have, but we will likely address them at that time.
In this first session, Patrick Hall will answer the core question ‘Why PiML?’. Patrick is principal scientist at BNH.AI, a boutique law firm focusing on model audit and AI risk management matters. He also serves as visiting faculty in the Department of Decision Sciences at George Washington School of Business teaching and researching AI risk management data ethics and machine learning. Prior to co-founding BNH, Patrick led H2O.AI’s efforts in responsible AI and worked as a senior machine learning scientist at SAS Institute. So, Patrick, welcome. Thank you for sharing today. Go ahead and pop up your slides if you would.
Patrick Hall [00:01:25] Sure. Let me make sure I can do that.
Brian Skinn [00:01:31] That looks good. All right, and I’ll leave the floor to you. It’s all yours.
Patrick Hall [00:01:37] Thank you so much for that kind introduction. And just before I dive into what I hope is a very interesting slide, I’ll just say that I’m the first of several speakers to be followed by a panel.
I’m going to get into the basic motivation of why you might want to think about explainable, fair, and debugged machine learning models versus the sort of standard black box workflow that a lot of people work with today. And, so I’m going to kind of tee that up, and we’ll have lots of later presentations and a panel discussion to get into more details.
But what you see on the screen is what many are starting to call AI instances. So, this is a sampling of different headlines relating to public reports of AI system problems. And I want to use this to sort of motivate, again, why you might want to start thinking about explainable models or models that have been debugged to make sure they perform as expected or models that have had their implicit bias that they learned from their training data and from some of the bias decisions that humans make when designing them has that bias been managed. And so, again, this is kind of a sampling of how things can go wrong. I want to be clear that there are thousands of reports of hundreds of different incidents. And, so failures of these systems of AI machine learning based systems are not even particularly rare, they’re just something that gets a ton of attention.
So, I’m going to start with this face that you see in the background. And many of you might be familiar. I think that many people are familiar with Tay now. But just in case you’re not. Tay is a chat bot. We were talking about chat GPT just like everybody else before, and they’re a leader in responsible AI. And so even an organization like, Microsoft Research, can release a chat bot and have it go wildly wrong like Tay did. And so, Tay had a short existence of about, I believe, 16 hours on Twitter. In that time, it tweeted about 90,000 times. So, I think when we’re thinking about these failures of AI systems, we should keep in mind that scale and speed are very critical factors to keep in mind when these systems go wrong. They can go wrong very quickly, and they can go wrong at a global scale.
And so, what happened with Tay, and again, Tay is very illustrative. What happened with Tay, I would say that Tay was a security incident at first. So, I’m not sure of the algorithm behind Tay, but it was some kind of adaptive learning or reinforcement learning algorithm where it learned from the things that people were saying to it. And so, people on Twitter, Twitter users, very quickly learned that if they said nasty things to Tay, Tay would uptake those nasty things and sort of repeat them back. And so, by the end of its 16-hour existence on Twitter, Tay was just basically a constant stream of pornographic and racist, horrible, toxic language.
And so, what happened there is, I would argue, that a security incident changed into a bias incident. So that’s another thing. AI systems are very complex, and when they fail, the failure can be multifaceted. So, I would say that Tay was a security incident that quickly morphed into an algorithmic bias incident. And, so there’s a lot that we can learn from Tay, and we’ll talk about how people are not learning from Tay before we move away from this slide.
So, there’re many different kinds of incidents involving AI. I would say the most common type, at least in the U.S., is algorithm discrimination. And we’ll get into that later with some of the subsequent speakers. But they can go wrong in many other ways. So, let’s see, some other things that you might see on the slide is the very sad and notorious incident in 2018 where a self-driving Uber test car ran over and killed a pedestrian who was crossing the street at night in Arizona. And we’ll get into some details on that one in a minute too.
Other things that you see that might be more pertinent for the finance crowd is the very notorious rollout of the Apple card, which involved allegations of gender discrimination. There was this issue where women were getting 10 times less credit, a 10 times lower credit line than their husband. Now, this was investigated by New York state regulators, and they determined it wasn’t a criminal or regulatory violation. But still, it’s not the kind of publicity that you want to have when you’re rolling out a flashy new credit product.
Other interesting failure modes or modes in which these incidents happen are abuse. So, I would say that things like Tay, Uber incident, Apple Card, essentially, boil down to mistakes or accidents. But there are abuses of AI systems, and we’re starting to see this too.
So, people are starting to use deepfakes for generating disinformation. And, while I wouldn’t call what you see here in this headline disinformation, it was, actually, a very ingenious use of a deepfake to give a political speech in different languages, people are starting to use deepfake, whether it’s deepfake videos, which I think is what most people think about, but now we have the ability to do deepfake text or deepfake images. People are starting to use this for misinformation.
So, let’s, maybe, circle back to Tay, and maybe circle back to this Uber incident. So, with Tay, Tay was 2016. And then in 2021 South Korean company, Scatter Labs, released a chat bot called Lee Luda. And Lee Luda had nearly the exact same failure modes as Tay. And so, with Lee Luda, it was a chat bot, it was put on social media, it was put on the very popular South Korean chat app called Kakao, and immediately started getting into offensive language, but added this interesting twist where it was also handing out people’s personal details. So, it was apparently trained on non-anonymized data, and there were no controls there to prevent it from just handing out people’s personal information from its training data. So, I would argue what has happened there is that, essentially, Lee Luda is a repeat of Tay. And it’s very clear that when the designers built Lee Luda, they were not thinking about failure. Because, if they had been thinking about the possibility of failure or incidents or risk management, they would have done some research on what had gone wrong in the past. And they would have said: ‘Oh, hey, maybe we should avoid doing the things that Microsoft Research did with Tay, because that did not go very well.’
So, I think that it points to a sort of lack of maturity out there in the broader AI space. And, I think, if we’re honest with ourselves, that’s okay, as AI is a new technology, relatively speaking, and there are lots of lessons to be learned from people that have been designing machine learning tools for a long time. And I think that’s why I’m so excited to be a part of this PiML group, is that in certain verticals of the US economy and in national security, machine learning has actually been used for a long time, and people have learned to do smart things in risk management, like look into past failures and avoid them.
I will also point out that this self-driving Uber incident – there have been 37 previous crashes of self-driving Ubers before they actually killed someone. And so, I think, again, that should be a little alarming that not only were there external incidents they could have drawn from, there were internal incidents that they should have been analyzing to try not to repeat. And another sort of striking thing on the Uber crash is after the crash, they were able to do simulation testing and sensitivity analysis and debugging, things that we’ll be talking about later in this presentation, and found that had their software been a little bit different, it would have been able to stop in time. So, of course, you want to do that kind of testing before you release your AI system, not afterwards. And just to be clear, it’s not me saying these things about Uber, you can go read the federal government report about it. The federal government report on the self-driving car crash in 2018 is very damning. They actually say things like: ‘Uber software had no conception of jaywalking pedestrians.’
And so, again, it helps us think through these types of real-world failure modes that aren’t going to be reflected in test data AUC or something like that, that we need to be thinking about before we deploy a system into the real world. So, with that, I’m going to kind of transition to why does PiML help with these problems, and what are some other tools and frameworks that we can get into that help us with these problems.
So, very exciting, very exciting, NIST, the United States National Institute of Standards and Technology, is going to release version 1.0 of their AI Risk Management framework next week. And, I think, that’s actually going to be a pretty big deal in the U.S. economy for uses of AI. So, NIST, a widely respected government, a guidance body, is now going to be presenting official but voluntary guidance on how to manage risk in AI and machine learning systems. What is a precursor of this and getting back to this theme of that there are parts of the economy such as consumer finance, where machine learning has been used for a long time and, essentially, hard lessons have been learned, there’s a really interesting framework in consumer finance called model risk management.
And, so on this call and this presentation, we have some of the world’s experts in model risk management. And there’s a lot to learn there. And so, PiML, as you might start to suspect, is designed by people who have a long running history in model risk management. And so, it’s aligned to these known public frameworks.
Well, what do these known public frameworks say? They say things about the process and cultural approaches that we need to manage risk in AI and machine learning systems. So, things like governance, we really need rules, written policies around humans who interact, maintain, build, monitor these systems. There’s a very prominent notion of what’s called ‘Effective Challenge. And, essentially, that means that if you’re designing an AI system, you need to be questioned. Someone with the skills and the knowledge and the ability to make changes in your system, you need to be able to take questions from them and defend your work and make sure that you’ve made good choices. That’s broadly aligned with notions of accountability, that if something goes wrong in an AI system, if harm occurs, that there would be some kind of accountability for the people who made or operated the system and then in broader topics like transparency and documentation. Do the people who design the system really understand it? Can they write a document sort of attesting to their understanding? And then the other side of transparency is do the end users understand the system?
And a very particular notion of transparency in consumer finance is known as adverse action notices. Certain credit decisions, certain credit denial decisions should be accompanied by an explanation that would allow a consumer to appeal that decision if it was wrong. Those are some of the high-level things that these frameworks talk about, and then they get into more technical details, and it’s those technical details that I’m about to talk about enable these higher-level human notions of risk management like governance, accountability, and transparency.
So, what are some of these technical approaches that enable governance, accountability, transparency? Well, a term that comes up in SR 11-7 is an initial guidance around model risk management. It’s a great document. You can read it. It’s written in pretty plain language. It’s only about 20 pages long, and I certainly suggest you do read it. They talk about soundness a lot, and that, essentially, means that the math behind the model is valid, and the model is a tool that will be reliable. And so, valid and reliable are characteristics from the NIST AI risk management. So, I think, these notions of mathematical validity, reliability as a tool and mathematical soundness are, of course, very important to designing well-functioning machine learning and AI systems.
Safety is a crucial consideration, and I’m not sure that in consumer finance we have to worry about physical safety so much. But you can think back to that Uber incident that I was describing and understand why safety is really important. AI systems are starting to interact with the physical world, and they do hurt people. I can’t remember if it was on the slide, I just showed it or not, but another really kind of stark incident to think about is that in Amazon a warehouse robot got into a case of Bear Mace, a very powerful chemical irritant, and injured dozens of people. So, these AI systems can physically hurt people. I don’t think that’s what we’ll be talking about today, but it’s a really important thing to keep in mind. Bias management systems learn implicit biases and human cognitive biases in their training, and as they’re being developed by humans, security, like we talked about with Tay, these systems can be hacked, they can be manipulated, people can extract IP from them. And then really fundamentally, we need the system to be explainable and interpretable. We need people to understand what’s going on so that people can think through whether it makes sense or not. And then, of course, data privacy is very serious, and data privacy interacts with machine learning both in the sense that we oftentimes need to collect and use a lot of sensitive training data and there are new attacks on machine learning systems that allow people to access training data. So, privacy is a big consideration, too.
And, essentially, PiML, there’s a lot of good open-source software out there, but PiML is the only library I’m aware of that treats all of these different technical aspects of responsible AI or machine learning risk management. So, what I’m trying to say here is that the explainable, transparent models in PiML, the post hoc explanation or summarization approaches, the testing that the system allows you to do, and, especially, the bias testing that the system allows you to do, really feed into these characteristics of what makes a system trustworthy, which then enables these higher-level human aspects of governance, accountability and transparency. So, while it takes more than technical know-how to make a trustworthy AI system, PiML provides these bottom-line technical functionalities that you need to eventually build up to governance, accountability and transparency. So, I’m really excited to see what the next speakers have to say, and I’m going to leave my comments there and just hand the mic back.
Brian Skinn [00:18:48] Terrific. That was a great introduction, Patrick, to both the challenges and setting the stage for the opportunity that PiML represents. Thank you very much.