Quality AI/ML Models by Design with PiML: PiML and Model Fairness

About

One of the significant concerns that emerged after extended use of machine learning at large scale was the propensity for machine learning models to internalize latent biases and inequities in their training datasets and transfer them to their predictions and decisions in production. This property of machine learning models thus necessitates careful consideration of fairness in their implementation, training, and application.

In this session, Nicholas Schmidt provides an overview of the tools PiML provides to help you quickly and easily screen your models for bias and inequity.

Nicholas Schmidt
Chief Technology Officer, SolasAI
Director and AI Practice Leader, BLDS, LLC

January 20, 2023

Transcript

Brian Skinn [00:00:04] Nick, welcome to the event. So again, as Agus mentioned, Nick Schmidt of SolasAI is going to be discussing model fairness in particular, and how PiML is a great tool to help evaluate your models for bias and equity, among other things. 

Nick is the CTO at SolasAI and director and A.I. Practice leader at BLDS, LLC and brings a tremendous amount of expertise to this topic of fairness. So let me go ahead and get your slides up, and I will leave the stage to you. It’s all yours. 

Nicholas Schmidt [00:00:38] Excellent. Thank you very much, Brian. And thank you all for coming to this presentation or watching it at home. And so, today what I want to talk about is a little bit about fairness. And I’ll start with a conceptual overview, and then I’ll actually move into a short code demonstration. So, advancing my slides. 

This is the ‘Should you listen to me?’ slide or ‘Why should you listen to me, why should you care?’.  And it asks the question ‘Can AI discriminate?’. And when I was working in this area five years ago, this was a much more controversial question according to some people. Now, it is much more accepted. But I really like this picture in this slide as a way to demonstrate why AI can be a problem. What it is, next to each of these pictures is a number. And those numbers represent the error rate in identifying gender for people who look like the people in those pictures. So, for white men, the error rate was 1%. If we go down to the bottom right, it was 35% for women of color. And if you think about how these facial recognition software systems are used, a 35% error rate is quite dangerous and really quite horrifying. 35 times that of men. And what I think is another very important point about this slide and this particular problem is that there are very few of the problems that we talk about with discrimination in AI, there are very few of them that are present here. The data on which the model was trained were pictures. Nothing inherently discriminatory about that. I believe that the modelers who were training these models had every intention of doing the right thing. No one had put in their code ‘If black women then falsely identify’. And so, we had no negative intentions, we had good data, we had a good model, good system. Yet, here we are with this massive problem. 

Imagine what happens when we get into a world like credit modeling where there is a history of discrimination that is embedded in the data. And so, with that, I think that kind of tees up the idea that there really can be an issue, and it’s something that we need to address. One of the things that I hear a lot is that we can’t define fairness. And fairness is a complicated concept. And to give you an example of it, suppose that I tell you that women are receiving offers for credit at 70% of the rate of men. That sounds pretty bad. But what if I tell you with that same model, men’s default is being over predicted by 10%. So, on that first side, you have a definition of fairness that implies that women are not getting what they deserve, which is loan offers. And on the other hand, you have a measure of fairness that’s saying that men are not receiving what they should get, which is a fair estimate of their probability of default. What do you do with those two contradictory measures? Well, in one of them, you would say, we need to give women more loans. In another, we would say, we need to change the estimates for men, and which would ultimately get them to get more loans. And so, those are very contradictory things. And that is a problem. 

One of the things that I hear, though, is we can’t define it; we have this contradictory problem; how can we do anything about it; and should we just let discrimination continue? Of course, not. What we can use is our common sense, our intelligence, and existing public policy to frame the question and to frame the solution. And there are a couple things in that that I think are especially important, which is that in AI and machine learning, like Agus was talking about, there is not one single model. There is a multiplicity of good models or the Rashomon’s effect, as he was talking about. And some models may be fairer than others. And so that’s what I want you to think about as you’re going forward and building your models. There may be one that is similar in terms of its quality, its ability to predict, but it is fairer on whatever dimension of fairness that you want to define. And the other thing is that in credit, for example, we are never or unfortunately, hopefully not never, but for the foreseeable future, we will not get rid of discrimination in credit. But we can do better. And doing better is better than doing nothing at all. 

There are things that you can do. You can limit your data availability. Make sure you understand what’s in your data, and only include that in your models. Ensure that you have a good set of diverse reviewers of your model. I can’t tell you how many things I’ve seen get put into a model that a bunch of white guys wouldn’t have a problem with or wouldn’t realize was a problem. But when you have a diverse set of reviewers, you see that there is an issue. Don’t use protected class data unless it’s necessary. So, do you really need to use race in a model? In credit decisions that’s just plain illegal, so it should be ‘No’. But in a marketing model, outside of credit or employment or housing, it may be legal, but one thing I would encourage every modeler to think about is that ‘Is it wise?’. Just as race is rarely predictive, truly, the predictive thing, truly the causal predictive variable. Can you find data that is not discriminatory and use that instead? 

And then finally, I want you to think about monitoring for drift. And while Agus was talking about data drift, here we can talk about what I call usage drift, which is when, and this is particularly a problem in today’s environment where people are so excited to use algorithms, make sure that you are monitoring on the usage of your algorithms, so that you’ve approved it for one usage before it goes and gets used for something else. Make sure that the fairness tests that you’ve done for that first use are still applicable for the next one. 

One thing I wanted to very quickly touch on, is just this idea of ‘Can you do better?’. One of the things that I hear frequently, in addition to ‘We can’t measure fairness’ or ‘We can’t define fairness, so why do it?’, and ‘The follow up is usually okay, we define it, but what does that do us? How does that do us any good?’. And I would say it does us a lot of good because there are a lot of companies, my own included, but there are others as well, and there are also open-source packages that you can use to make your models fairer. And that’s actually what we’re showing here, is that to go through it quickly, we had a baseline credit model that had a given level of model quality, and it had a given level of fairness. And that fairness was not sufficient. And so, what we wanted to do was find a fairer but high-quality model. And that’s what this graph shows; each dot represents a different model. And what we wanted to do was we wanted to find models that were far to the right and further up. And ultimately, we were able to find models that had about a 20% reduction in the disparity for a 2% or so reduction in the AUC. And, even if you said 2% reduction in the AUC was too steep, we found models with still a 10% or so reduction in disparity, which is really quite something. You can think that that would amount to tens of thousands of additional loans to African-American families or Hispanic families that otherwise would not have received a loan. 

One of the last things I want to talk about before I get into the nitty gritty, is the problem of model systems. And I spent almost all of my life thinking about the model itself, but there’s really a model system, which is that the model is just a small piece of a much larger funnel, where at the end of it you get the yes-no outcome of is there a credit offer, a marketing offer, or something like that. But at the larger end of the funnel is really just the population as a whole. And there are numerous things that happen over time that are over the process that will narrow that funnel down. And if you’re only thinking about discrimination happening at the model level, you can be missing many other places where discrimination might come in. You may also be missing many other places where there’s an opportunity to do better. Because as much as I did show that credit models or models in general usually can be made fairer, quite frequently, there are other things that can be done, like expanding the marketing decisions that you’re making and expanding outreach that can actually do better than changing your algorithm. And so, it’s very important to look for all the opportunities you can to improve models. 

Now, what I want to do is kind of get into the math of it. And, fortunately, for the most part, the math is pretty easy on fairness analysis. And what I’m showing here are a couple of different measures of fairness that are very commonly used. And the first one, actually kind of the first two, are called impact ratios, adverse impact ratios. And they’re maybe the most straightforward measure of discrimination in classification questions. And what they’re saying is that, as an example, let’s say that women receive loans 10% of the time, 10% of women who apply for a loan are given an offer. And 20% of men receive offers. Now, we call the women the protected group, and we call the men the reference group. And that’s just sort of standard terminology that’s used in litigation and employment, credit and housing. Sometimes you hear protected and control; sometimes you hear a sort of the term minority and majority, some, even, if that’s not numerically minority or majority. But whatever you call it, the discriminated against group versus the reference group. So that 10% over 20%; that’s a value of 0.5. And that’s saying that women are receiving offers at 50% of the rate of men. That would be considered a large difference. Now, there are policy decisions that go along with each of these metrics and how big they are to decide whether or not they are a big difference. Meaning that they should be reviewed for further problems, to understand why there are issues, why there’s such a big difference. 

Now, with the adverse impact ratio, there’s a threshold that’s commonly used called the four-fifths threshold or the four-fifths test. And that says, if the adverse impact ratio is less than 80% or four-fifths, then it is evidence of a problem that requires further review. That four-fifths test has gotten quite a bit of press in the academic literature on fairness in AI. I would caution you in using it, especially if you’re thinking about using it for other metrics. It is very specific; it was designed just for the AIR and has very specific use. But assuming that is okay, we’ve got this adverse impact ratio and we say that, okay, it’s less than 0.8, and in my example was 0.5, and we also test for statistical significance. So, is this difference just a result of chance or is it actually statistically meaningful? If it is both statistically meaningful and below that threshold, then we call the result practically significant. If it’s practically significant, then you can’t just turn around and walk away, or it’s unwise to. And each different measure has its own measure of practical significance, and I’ll show you two of them in the code example, which I’m now going to get to. So, Brian, I’m going to share the code, if you can let me know if it’s coming through. 

Brian Skinn [00:15:11] I am not seeing it yet. There it is. 

Nicholas Schmidt [00:15:12] You got it? Great. Okay.  And just so you know, this code will be available, as will my slides. So, there’s a lot of detail in this code that I’m not going to be able to go through, but I want it to be there for you to take a look at after the presentation. So, what I do here is import some data, and I use PiML to build a model and explain the model. And then we’ve got PiML to do fairness testing. I also use the default library that is used behind the scenes by PiML to do fairness testing. I will show how to use that. So, that default library is called Solas Disparity. And it’s actually the company that I founded in SolasAI. And so, the Solas Disparity is a free, openly available piece of software that you can download on PEP. And it does common measures of discrimination testing. It incorporates a lot of nice things about statistical and practical significance, and various other things that you may need as a practitioner. 

So, what we do is we import the data. And this data, I think, is actually similar to Agus’s data, except that I added a few things. I added a sample weight column. I also made the variables about race proportional, and was a little bit more specific than just saying race equals one or zero. I actually named them Black, Hispanic, and White. And you can see that they’re percentages instead of ones and zeros. That’s very important when you go out into the real world, very frequently, you do not have information on a person’s race or ethnicity or even their gender sometimes, so you have to estimate it. And there are many ways to estimate it. The one that we use most frequently is called Bayesian Improved Surname Geocoding or BISC. But what’s important is just that with PiML and with Solas, you can do proportional estimation of determination. 

So, we bring this in, and this is actually kind of a high code usage of PiML. And what we’re doing is actually specifically setting the train and validation set instead of having the GUI that Agus showed. All we have to do to load the data is just send in the data that we want. And then you can, using code, you can tell it which variables to exclude. So, we don’t want it to include race, gender, age, and other variables. We only want PiML to see these variables that are ultimately going to be used as features in the model. A little bit more about preparing the data and just setting it up. And then this is the exploratory data analysis that Agus showed. And, I think, it’s really quite nice and has a lot of characteristics that are very valuable. One thing you can immediately notice is that this is default on the right-hand side. And we see that the size of your mortgage is positively related to default, whereas the rest are negative. And that’s an interesting fact. We then build a model and it’s a relatively simple, explainable boosting classifier. Quite a good model. It’s not overfit. We can see the gap is small. 

Now we get into this question of fairness. And we’re going to define the protected groups as Black, Hispanic, Female and Older. Correspondingly, we define these reference groups as White, White, Male and Younger. And then we classify this analysis as race or ethnicity, sex and age. And here are some notes about how you run Solas and PiML and the names of the variables and things like that. I’ll skip over that. And here you can see this is the low code version of how you run fairness with PiML. I’m actually going to skip this because of time constraints. But there are some very good examples on the PiML GitHub web page. Now using Solas, if you need to specify things at a finer level detail than PiML, you can use Solas directly. And the calls for virtually all of the PiML, sorry, the Solas functions follow this pattern. You put in what we call the group data, which is the information about whether or not you’re Black, White, Hispanic or Young, so forth. You put in these group categories and the names of the groups you want to test. Then the outcome and the thresholds you’re going to use. So here we’re using this AIR threshold of 0.8, so that goes back to that four-fifths ratio that I was describing. We run this, and we get a nice couple of tables out of it. And here’s what we end up with. 

I’ll start on the chart and these charts are what you get with PiML as well. And so, we see here this threshold of 0.8 Is dotted black lines. And below it everything that is statistically significant is going to have crosshairs on it. That means that the result is worrisome in some way. So, we are not necessarily worrisome, but certainly worth further review. And, so, we see both Black and Hispanics have AIRs in the 35 range and women do as well, 35.4. Whereas the older group, they’re actually very slightly favored. And so, if I were using this, what I would say is I need to concentrate on these three groups and find out why they are getting the loan offers at such a lower rate. 

We then have a summary table, and it shows pretty much the same information just in more detail and. 

And finally, a model card or a summary card that shows what you use the specifications and, ultimately, the findings. So, in this notebook. I also did the standardized mean difference, which is another measure of discrimination. And I will leave it to all of you to look at this on your own, because I’m about out of time. But thank you very much. And I hope that you enjoy testing your algorithms with PiML. And if you have any questions, please feel free to reach out to me on LinkedIn. 

Brian Skinn [00:23:04] Terrific. Thank you very much, Nick. Yeah, I mean, this fairness aspect and the ability of PiML to provide this kind of insight and evaluation seems extremely important. We really appreciate you setting this up and presenting it to us.