There is a wide variety of analytical tools that can be used to provide insight into machine model reliability, resilience, fairness, and other properties. PiML seeks to make a broad suite of these tools accessible to as great a community of machine learning practitioners as possible.
Join PiML creator and core contributor Agus Sudjianto for a discussion of the key features of PiML and their user-friendly implementations.
Dr. Agus Sudjianto
Executive Vice President & Head of Corporate Model Risk, Wells Fargo & Company
January 20, 2023
Transcript
Brian Skinn [00:00:00] Our next speaker will be Agus Sudjianto. He is Executive Vice President and Head of Corporate Model Risk at Wells Fargo and Company. And he is going to survey some of the panel’s key features, and I believe provide us a brief demonstration.
Just for anyone who might be just joining us, I want to reiterate that this is a TechShares event put on by Open Teams. It gives you access to the best open-source architects in the world to train, support, and deliver your software solutions. My name is Brian Skinn. I’m your host. Again, the overall event is Quality AI & ML Models by Design with PiML.
So, Agus, thank you for sharing your experience with PiML with us. Take it away.
Agus Sudjianto [00:00:44] Thank you, Brian, and thank you, Patrick for the very nice introduction. Let me start with a quick introduction of the motivation of PiML. And this is a tool that we built with, as Patrick said, the years of experience in model risk management, and people that are in this panel contributed significantly on the idea in shaping the direction what PiML is today and what it should be going forward as well. So, through this forum today, we are hoping a lot more people will get engaged into looking at the tool, which is useful to really shape the future as well.
Patrick, talked about SR11-7, about a very rapid adoption of ML today, particularly in the financial industry. This framework is not only applicable for the financial industry, but also applicable for any industry as well. But let me start with the problem that machine learning has today. I have a chart at the top. The horizontal is the complexity of the model, and it’s measured by the degree of freedom, more numbers of parameters, basically. And the performance of the model, in this case a predictive model for regression, means score error. Every dot in that line is a model, performance on a model, with a certain hyperparameter. In this case, this chart is from XGBoost, using XGBoost. So, every circle there is an XGBoost model with different tree depth, different boosting step, different learning rate, and different regularization. So, that’s what that chart is. The more complex model will fit data very well. And then we split the data training and testing. We look at the testing, and the more complex model also still works well too.
So, many models in machine learning with different hyperparameters will result in different models, performing almost similarly. In overly parameterized models, a lot of parameters, like all in machine learning, we encounter the phenomena of ‘model multiplicity’. Many models perform almost the same way just based on the testing data. The great Leo Breiman, the inventor of Random Forest, calls it Rashomon’s effect, or something that people call it as well, Benign Overfitting, meaning that from the testing data we don’t see it overfit. The model looks okay. The classic bias-variance tradeoff in statistical model, statistical learning, where the more complex model will overfit most of the time, doesn’t show up. You can try it yourself. When you do a gradient boosting, you boost it a lot more, which means that your model becomes more complex. The model gets better and better until when you deploy it in production the environment changes, and the model will not work well. So, there are a lot of problems in machine learning, and people have to retrain it very frequently because of noise, etc., etc… So, the overly parameterized model creates many models that perform almost the same.
Some models don’t make sense, some models make sense. So, model explainability is very important to uncover the potential problems. It can help to identify, to understand how sound the model is using model explainability. And when we talk about model explainability, most people in the field of XAI, explainable AI, typically talk about post hoc explainability, meaning the model is black box, and it’s then I’m going to apply a post hoc explainer like Lime, Shap, PDP, ALE, etc. to explain the model. If the model is still black box, we apply an explainability tool, but an explainability tool can go wrong very easily. And the more complex the model is, the more the explainability can be wrong too.
So, with that, the thinking is inherently, can we come up with an inherently interpretable machine learning model? That’s what we’re thinking about. What the PiML provides is basically a machine learning model, a sophisticated machine learning model, but inherently interpretable, so we’re not risking ourselves with a post hoc explainer that can go wrong.
Then I will talk about as well, and we’re going to talk about it a little bit later, is the standard outcome testing. The practice today is to split the data training and testing. We build a model using the training data, then we evaluate it with testing data, and we feel good about it. Keiko competition is like that too, right? So, you win the Keiko competition based on the testing data, then when you split it, basically you do random sleep splitting. And that has a lot of flaws, which we’re going to see a little bit later, particularly because when you test on that, we assume that the world stays the same, that the world is just like the training data. The distribution is I.I.D. between training and testing. And we know in the real world, the world changes. And so, with anything that is outside the training data or with any data drift, the model will perform differently, and is not getting detected. That is what we’re trying to address and what motivates us to build this year – is how to build a robust, reliable and fully responsible AI.
When we talk about this, how do we solve this problem? On the right-hand side, there are two key components that Patrick talked about in SR11-7. One is called conceptual soundness. Conceptual soundness deals with explainability, interpretability, as well as causality. In outcome analysis we want to go beyond the standard tests, simple standard testing split. We’re talking about weak spots. We identify which region in the model makes the model weak. We cannot just rely on AUC or F1 of the overall model, because the model fails in the weak area. So, it is identifying the weak spot. Patrick talked of an example of an Uber accident. Identifying the weak spot where the model is weak. And, the tool that’s in PiML is designed for that.
Reliability deals with reliability of model output, its uncertainty, how uncertain it is. We want to understand which region with a combination of data, with the prediction condition, is less reliable. So, that part of it is in the model diagnostic.
And the model’s robustness is very important. If there is noise then the data is corrupted. A lot of failures in machine learning in the medical field are because of lack of robustness issues. It’s when asking ourselves how robust it is – any change, small change in the input will affect the output – how’s the output, how the model will react? Is this becoming a completely different output with small perturbation? There is a lot of research in adversarial attack and all of this is really dealing with that – small changes, and how the model will react. So, lack of robustness is often a problem, in particular, when I talk about the non-overfitting and when in the testing the data looks good, but when you deploy it, it becomes terrible. You have to refit it all the time, because it’s overfitting noise that’s creating a lack of robustness.
Resiliency – we know that the world changes all the time, the environment changes, and the data drifts. Can we detect it, can we test it upfront, before we deploy the model? A lot of resiliency data drift people are managing it through monitoring. Yes, model monitoring is important, but during model testing we need to do that first, so that we can anticipate what kind of drift, what factor of drift is really important to monitor. And so, when drift happens, we know exactly what we need to do.
And then the other aspect is bias and fairness that the next speaker will talk about in more detail. So, this is what PiML covers, it covers all of these aspects.
I’m going to go through this very quickly and then will stop on just some of these concepts. In PiML, when we talk about an inherently interpretable model, it has many models in there that’re available. We’ll go through that a little bit quickly. And all the testing and outcome analysis testing that I spoke about, are available in PiML as well. When we use PiML, the workflow is we get PiML, and you can use it in a low-code mode and high-code mode. Low-code mode – you only need the very minimum coding or even almost no coding. I’ll give you a demonstration on the low-code later. So, in low-code it has an interpretable machine learning model and from that you can get exact interpretation. It’s not post hoc, but an exact interpretation. If you want to still apply post hoc – yes, you can apply post hoc with more confidence now to get a simpler explanation – you can do that. You can diagnose models – on robustness, reliability, weak spots, etc. And then you can compare various models and choose based on the testing that you have and all of those things.
Okay, here is the model you’re going to use. So, in the real world for high-risk applications, we choose models not only based on performance, we choose models based on reliability, robustness, resiliency, and bias and fairness. All kinds of consideration we have to do. That’s what the model comparison is about. If you want to get more flexibility, you can run it high-code. It’s a lot more coding, but you have a lot more flexibility, including if you want to do a black-box model too. But black-box model doesn’t have interpretability, but you can still explain it using post doc explainability and compare it. And of course, in PiML, if you already have models or somebody already built a model, you can register a pre-trained model and go through all the diagnostic tools in PiML as well.
When we talk about interpretable models in PiML, PiML provides posts, the post hoc explainability tools at the bottom, but also the intrinsically interpretable model, particularly high-performance model. It has ReLU DNN that you can make ReLu DNN locally interpretable, and then if you need a globally interpretable model, it has GAMI-Net, it has explainable boosting machine, or if you like to use XGBoost, it has XGB2, basically, constrain the architecture to make the model inherently interpretable. I’m not going to go into much detail on this. I’m going to jump quickly into some of the testing, and then we go to a quick demo, in which we’ll go through a little bit in more detail.
Feature selection in PiML I would like to highlight a little bit. It has correlation-based, it has the traditional feature importance, but also it has causality-based. If you need to choose features that are causal in PiML, it is called Randomized Conditional Independent Test using Markov Blanket search for Markov Blanket, variable that really impacts directly the Y, directly the response instead of variable that impacts the other variable. And there are some benefits to having a conditional independence model because the model typically is more resilient when you choose your factor causal. In the distribution failure test, Patrick talked about how to test for failure, identifying the region where the error is bigger, so that’s slicing, understanding what variable and in what variable region the error is bigger, the region with large uncertainty. We talked about which region has bigger prediction uncertainty, where the model is less reliable. So, reliability testing is in there, and this is important. Both these too are important to identify model weakness, so that model can be improved further. It has Out of Distribution testing, talking about model robustness. Basically, in robustness you perturb the input a little bit and how the performance changes quickly. And resilience is really drift. We induce data drift and then see what variable is the most critical. If it’s drifting the most critical impacting performance, then the performance will degrade the fastest. This anticipates data drift, it has some simulation, it has some scenarios of data drift. And then to anticipate, to understand how the model will change, the performance will degrade, as well as on the fairness as well. Because, you can build a model that’s fair, but when the data drifts, the model is becoming unfair too. So, this is very important when we talk about resilience, performance, but also other attributes as well.
With that, I am going to go into a quick demo just to show you how easy it is to go through this in a low-code mode. So let me go to that. Here it is. I hope you can see the screen well. The first one is basically importing PiML, you need to install it from GitHub, and then construct the PiML experiment. I’m going to demonstrate low-code, so I’m going to run this. Select data. It has some demo data that you can use. In this example. I’m going to upload data from my machine. So, I’m going to upload this data. That gets uploaded, you get the data summary. You can get the data summary by numerical attribute, this is categorical attribute, this is a credit example, this is the input attribute. And the objective is that I am going to look at the status, whether they are delinquent or non-delinquent. One is non-delinquent. One is default, zero is non-default. We are going to use it for credit approval in this case. It has categorical attributes. In this case it has race and gender as well. Race and gender, of course, cannot be used, so I’m going to exclude these. I’m going to exclude gender; I’m going to exclude race from my data. Then I can prepare for training. In preparation here you have a test ratio. I split it 80/20. PiML will read the output. You can specify the output. In this, by default, the last column will be the output. And because of categorical classification, you can do classification or regression or you can put the sample in it as well.
Okay. It has a simple EDA that I’m not going to go through. I’m just going to show you how the local environment is just one line. You have all kinds of things that you can get. Feature selection. I talked about it. You give the feature select, and are doing all the calculation of Pearson correlation, the simple Pearson correlation. Distance correlation is nonlinear, to identify non-linear relationships using Distance correlation. So, you can see that as well. Traditional features are important; in this case you’re saying like, the GBM feature is important. It also has the Conditional Independent Test, which is the causality test, which variable is causal that you should use.
Let me just walk through a quick demo. I’ll just use this. Click on ‘Confirm’. And then you do Trained model. In Trained model, you have a lot of choices. This is traditional statistical techniques – you have XGB2, this is basically XGBoost constraint to make it inherently interpretable. Basically, XGBoost depth too, so that you can get up to order interaction. GAMI-Net is the counterpart of that using neural networks, and is called GAMI-Net or ReLu DNN. ReLu DNN Is also an inherently interpretable model. So let me do this. I don’t have much time; I just want to demonstrate how easy it is to run. So, this is running, and you get the result. After you get the result, you register the model for further analysis. Then you can do model explaining. This is a post hoc explainability.
So, I’m going to just put it very quickly on this to illustrate. There you have all the PDP and all of those and the interpretation, this is for inherently interpreting models. The explanation is exact. So, this is explaining the model in a very exact way. You have the main effect. You have the main effect exact and you have the interaction. Under interpret, for an inherently interpreted model, you don’t have the problem of post hoc explainability. And you can play around yourself and compare how trustworthy is Shap compared to the exact interpretability. It has global and local interpretability. If it’s in local, on the explain, under explain in the local, you will get all those Lime and Shap. You can compare it with the exact interpretability. While using a model that is inherently interpretable, you can see how trustworthy that post hoc explainability is. It has a model that knows to look at model resilience. One of them.
Let me run this one very quickly. Then I’ll turn it over to my next speaker. Let me pick quickly on this. Running model, diagnostic, and model diagnostic is looking at accuracy, weak spot analysis. I spoke about weak spot analysis, looking at overfit reliability, robustness and resilience. I’m going to pick just resilience for now. So, you can see how the model, the performance of the model in this accuracy, can degrade very rapidly. The performance of a model can degrade very rapidly. And what variable is important in this case? Markets are the most important. Shifted on the market side. And let me pick the markets – distribution shifted. The model from the original one shifted to the low market side. It will degrade very rapidly. So, this one will give an idea upfront. Under different shift scenarios, you can see how the model will degrade rapidly. There are several shift scenarios that you can do ‘What if analysis’.
So, I don’t have much time to deal with the other things, like model fairness, but Nick is going to cover next on the model fairness. With that, I’ll turn it back to you, Brian.
Brian Skinn [00:23:26] Indeed. Thank you very much, Agus, for giving us this great survey of PiML. Really appreciate it.