Andrew James (Staff Software Engineer, Quansight) and Sean Ross-Ross (Solutions Architect, OpenTeams) discuss recent PyTorch news.
Article: “The Case for Running AI on CPUs Isn’t Dead Yet”
Matthew S Smith, IEEE Spectrum, 1 Jun 2023
https://spectrum.ieee.org/ai-cpu
June 13, 2023
Transcript
Brian Skinn (00:37): Welcome. My name is Brian Skinn. I am the Open Source Architect Community Manager at OpenTeams. OpenTeams’ broad goals… We are seeking to support and sustain open source. Part of the way we’re doing that is through our community of Open Source Architects, and we also are trying to be the … single source of procurement for clients trying to build, and our partners, our experts, who can help them build. For today’s conversation, we’re calling it PyTorch News Corner. It’s a new experiment that we’re playing around with. For the conversation today, we have Andrew James of Quansight and Sean Ross-Ross of Open Teams. We’ve picked a recent news article–I’ll post the link to that in the chat–an IEEE Spectrum article about the case for running AI on CPUs versus GPUs. Let me post that there. Andrew, Sean, please go ahead and make your entrance, if you would. Good to see you both today. Thank you for being here.
Andrew James (01:47): Hi, Brian.Thanks for having me.
Sean Ross-Ross (01:48): Thanks, Brian.
Brian Skinn (01:50): Of course. So Andrew, why don’t you start by introducing yourself?
Andrew James (01:56): Yeah, of course. Hi everyone. My name is Andrew James. I’m a staff software engineer with Quansight. I’m currently working as a maintainer of torch-sparse and I’m a tech lead for a team of about a dozen PyTorch maintainers that we have at Quansight right now.
Brian Skinn (02:14): Great, thanks, Sean.
Sean Ross-Ross (02:16): Hi. So my name is Sean Ross-Ross. I am SVP of the OpenTeams Network here at OpenTeams. I am a very large PyTorch and TensorFlow user. And also, thanks to Andrew, I’ve made some contributions to PyTorch as well.
Brian Skinn (02:34): Very cool. So, yeah, the article that we’re going to be digging into, again, is called “The Case for Running AI on CPUs isn’t Dead Yet.” Subtitled that “GPUs may dominate, but CPUs could be perfect for smaller AI models.” So, I would assume anyone in the AI space is aware GPUs are a very common platform that are in very high demand, especially with the recent developments in foundation models, large language, large language models, to the point where there’s supply crunches. It’s hard to get, as I understand, it’s hard to compute time on the high end processors on various cloud platforms. But CPUs, what can they do for you with modern machine learning? Take it away.
Sean Ross-Ross (03:21): I think we could break this into two different components, or at least that’s what I was thinking when I was reading the article. Is, there’s the cost on the implementer–so somebody like me who’s building a PyTorch model and wants to test and train quickly, and I think there’s definitely different trade offs. And then I guess maybe the flip side, Andrew I won’t put words in your mouth, but like the PyTorch developers who are working on improving, like where should they put their effort on improving PyTorch for CPU or GPU and things like that. And I think that the demand is going to be driven by the users of what they’re using and I would love to advocate for a lot of CPU, a lot more CPU uses than are currently being done. One of the uses in particular I’d love to talk about is just early development of models. Even just on the first few epochs like 5, 10, 20 minutes of training can get you really a solid view on where your model is failing, in a quick iterative development … just really speeds up the overall time of you developing the model, rather than if you don’t have a GPU locally, like shipping it off to Google Cloud or optimizing it for something like that.
Andrew James (04:39): Yeah, I’ve got to say I agree with some of the points made in the article. I think that there’s certainly going to be innovation in terms of different approximations, different techniques we can use that will make the CPU more … give it more utility in the space, particularly the idea of applications which are going to run local models. In inference, I think that that is going to be something that’s really big and that’s going to motivate a lot of work by folks that have their own hardware. I’m thinking of Apple in particular. They’re going to want to make sure that they can run certain smaller models on their devices well, for new things that they’re going to be developing. I have seen in the PyTorch repo an uptick from Apple developers in supporting the MPs back end, so I don’t think that that’s an unreasonable sort of projection to make. Now, I don’t work with any of them, they haven’t told me anything, so that’s just my informed speculation. But yeah, I think it’s certainly something that we’re going to see a shift. As a PyTorch developer, we do put a lot of emphasis on the Nvidia back end and CUDA, and CPU is less important. We don’t go out of our way to write badly performing code for it, but we certainly don’t spend as much time optimizing it as we do for the GPU. And as these things develop, the amount of effort, which is something that Sean brought up on the part of the developer, I think that things like PyTorch 2.0 and the compiler stack are going to make that a lot easier. In order to generate or emit efficient code for models that you write in PyTorch, someone really only has to work on a core set of about, I think it’s like five or six hundred operations out of the total. I think it’s somewhere near 2500 operations that PyTorch supports simply because of the way that the compiler was designed so that back ends for the compiler which are emitting this code basically work in a limited subset that everything is decomposed in terms of. So I think that certainly combined with a shift will make it easier for us to shift where if performance is needed, if it’s coming from our users or internal customers, external customers, we will be well positioned to pivot a little bit faster because of PyTorch 2.0 and the opportunities that presents us.
Sean Ross-Ross (07:10): I’m super curious Andrew, they touched on sparsity as well as like a different avenue. So I mean, there’s the hardware optimization as well, but then there’s also some interesting notes on sparsity. And since you’re the resident sparsity expert, what are your thoughts on how likely that is going to come to fruition?
Andrew James (07:30): Yeah, I mean if you think about this movement towards the CPU, it’s either going to be motivated by cost, that’s the cost of cloud hours, and maybe it takes a little bit longer, but you can get more done for less dollar. But then there’s also the thing that we’ve talked about in terms of developing things that can run locally on a device or on the desktop computer. So, part of that is going to be developing new models that are smaller or do a very specific job. Maybe they work in conjunction with larger models that are running in the cloud or different techniques to compress the size of a model so that you can do more on a smaller system. One of those they talk about a lot in the article is quantization, and another one would be sparsity. And sparsity kind of has the advantage or disadvantage of that it’s very difficult to write generic sparse algorithms that vectorize well. So it’s already kind of an uphill battle to efficiently utilize very expensive GPU hours completely with sparse. We’ve only recently made some big breakthroughs where we’re able to, for certain sets of sparsity level and the specific layout we’re using, we’re actually able to beat the dense matrix multiply baseline, which is sort of like our signpost for how well are we going to do on machine learning workloads in general. So I think that movement to the CPU in terms of compressing models using sparse layers, sparse weights, I think that’s a totally reasonable future to see. We’ve been doing a lot of work on sparsity that’s mainly focused around getting the API to stability so that we can drive adoption with our users and foster more experimentation. But also I’d say that these two objectives are tied. We are targeting specific workloads where we’ve already identified basically a use case for sparsity and that’s going to come from different pruning techniques and things like this where people have done the research and discovered that, yes, you can remove 80% of your weights and still achieve an acceptable level of accuracy. So now that is a workload that we know sparsity will work, we have a strategy for employing it—so now let’s make that fast and make it usable in PyTorch.
Sean Ross-Ross (09:49): I’m curious Andrew, you mentioned some GPU benchmarks for sparse, but do you guys have, or do you guys actually do CPU benchmarks on your sparse tensors?
Andrew James (10:00): At the moment, no. And then these are not like part of the torch bench set or anything like that. Basically we are experimenting with new technologies for authoring GPU kernels, specifically using Triton in eager mode, which has been a little bit of a technical wrestling match, but using Triton—and we’re essentially benchmarking against our existing kernels—we’re essentially replacing functionality that we already have. We’re not introducing any more coverage for different layouts and sizes and shapes and that kind of stuff. But what we’re doing is we’re introducing a faster path. Now that Triton is a dependency of the 2.0 stack, we can guarantee that it’s available, we can leverage that technology in eager mode. And we’re taking a lot of the lessons that we learned and strategies that we employ, and carrying them forward as we get ready to start designing things like templates for the compiler to start working with sparse layouts and emit code that way.
Brian Skinn (11:01): One question I have, again, I’m a novice at AI/ML, it’s not something I have a lot of experience with, but one question that comes to mind is from the user perspective, PyTorch in particular, there’s been heavy emphasis on GPUs, and I know of CUDA, and if there are alternative implementations, how challenging is it? How much code change, churn, is involved in exploring a potential CPU implementation versus a GPU? Are there shim layers or interfacing layers that make it fairly easy, or is it a fairly deep change?
Andrew James (11:42): Yeah, well, one of the things that I think makes PyTorch such a popular library is the way that it’s able to abstract away these back end details. So if you want to change to a GPU device, and you’re currently working with a sample implementation or an experimental code, and you’re running on the CPU, you basically just need to change parameters to indicate that your tensors are now allocated on the device. And when it comes to this kind of shift, in particular, as we moved from handwritten CUDA kernels or kernels provided by NVIDIA’s cuSPARSE library, which is like their cuBLAS, but for sparse layouts, it was actually very simple. All that we had to do is essentially insert a branch somewhere where we look for a binding to this function that’s actually defined in Python and then call it if it’s available. And there are some limitations on the Triton kernel regarding block sizes and data types, but we do all those checks in there and we didn’t actually remove anything. So if we can’t select the fast new kernel, we’ll always fall back to the existing one for that use case. So it’s easy in the sense that PyTorch has already put a lot of effort into abstracting these things away. It’s difficult in the sense that if you were to build it from scratch, you would be looking at a monumental undertaking. So it’s easy now, but it was hard, that kind of thing.
Sean Ross-Ross (13:15): Yeah, I think that for me, it comes back to cost. Like, I do a lot of development on my MacBook. I don’t have access, easy access to a machine with CUDA. I always go to the cloud, so I’ll do Google Cloud or something like that for my training runs. So it’s really important for me to do those first few iterations just to make sure that the model is starting to converge, that everything is kind of roughly checking out. And I spend like a few days or hours on my laptop doing little micro training runs just to make sure I can check all the boxes before I then ship it off to run for overnight, or however long I’m going to run the model for.
Andrew James (13:55): Yeah, that’s not going to go away, that’s not going to change if CPU starts to rise in popularity or there’s more utility that we can get out of it. CPU is the default back end, the default device type for tensor. It has been, it will continue to be. And I’d say that even though maybe in terms of the effort we dedicate to optimization and reducing compute time—GPU gets the lion’s share of that work—CPU is still a first class citizen. You wouldn’t be able to implement a new operator with the GPU only implementation, right, you need to have the CPU and GPU functionality at least. And then there are other, let’s call the more exotic, more fringe back ends, that you can implement on top of that. But they’re not necessarily required for an out of the box sort of like first introduction of the new operator.
Brian Skinn (14:59): So the interfaces … that might be the best, the right word for it … that can route to either GPU or CPU in this case, what was the balance in terms of forethought planning, in terms of the structure of those interfaces versus just organic evolution as needs arose? Is there a spec that everything is being built to? Is it something that could be generalized that might be applicable to both TensorFlow and PyTorch and other machine learning tools? Or is it kind of intrinsically going to be bespoke to PyTorch?
Andrew James (15:35): Yeah, I mean, the abstraction strategy can certainly be extracted and used elsewhere. In terms of taking it as a standalone feature and using it as it exists in another library, that might be more difficult. But what we call that system is the dispatcher. And essentially every tensor, based on its, say, layout or the device type, it will have a set of dispatch keys. And, basically, at the time when you call a function, the front end is essentially going to select the correct implementation of that API you’re calling, based on these dispatch keys, essentially by looking at the highest priority one. So, the more specialized, the higher priority. So GPU will be selected over CPU and sparse would be selected over strided, and so on and so forth. So in terms of how much forethought and organic growth went into design, that kind of predates me, so I couldn’t speak concretely, but a lot of that system was kind of designed to work around limitations of existing solutions. So, pybind11 is a popular method for binding C++ code and exporting it as a Python module. But essentially the way that it works is by making each operator a virtual function. And there’s a V-table in the background selecting the correct overload. So if you try to do that with PyTorch, you run into an issue where essentially all of these implementations have to be in the same library, because virtual function calls can’t cross over into another DLL. And that’s problematic because we build basically the shared object, all the GPU kernels as a separate entity from the CPU kernels, and other back ends are implemented the same way. So the dispatcher’s design and initial setup was sort of in part to figure out how to do this, working around the limitations of what technologies already existed. So in that sense, as far as I’ve experienced, it’s somewhat novel, but it’s also not a very common problem, in terms of PyTorch is a rather specialized library dealing with workflows around tensor manipulations and tensor algebras, where sort of a plethora of back ends is a very attractive feature—in terms of general software development, that’s not necessarily something that you need to consider in exactly the same way.
Sean Ross-Ross (18:12): No, I think if somebody was doing general software development, you’d go with something like the standard tooling for Python, right? Numba and NumPy. And the reason why you would use PyTorch is obviously for machine learning, the neural network capabilities and the auto gradient. But I’d be willing to wager a bet that Numba and NumPy run faster on CPU natively than, say, a PyTorch model would. Although I don’t know if those benchmarks exist. I know that Jax has a note in their documentation saying that, yes, NumPy is faster for the CPU case, but they have all of these trade offs written down.
Andrew James (19:05): Yeah, I haven’t actually seen any data about that. It’d be an interesting experiment to run. I know that recently there’s been a bit of work done to make the compiler compatible with the Array API. So that … coming soon. I’m not on that development team, so I can’t tell you when exactly soon is. But, coming soon, you will be able to torch-compile code that you’ve written using NumPy, and basically it will be able to select out the NumPy calls, forward to the correct Torch function calls, compile that as if it was a PyTorch program. So, I’d be curious to kind of look at it at three levels, looking at that system, looking at Numba, and looking at sort of native Torch on the CPU. That’s not to say that the CPU implementations are patently terrible, particularly when it comes to, say, the most important operations like matrix multiply. We are going to use state of the art implementations from MKL, which is written by intel. And so if you’re using CPU at this point, I’d say it’s a pretty good bet that you have an Intel CPU, and MKL will be able to do as much as you can hope for it to do when it comes to that rate limiting step of the matrix multiply.
Brian Skinn (20:31): So in terms of the specific, I guess optimizations is a reasonable word for it, that are mentioned in the article. The hashing technique that seems like it’s an approach for dealing with sparsity if I read it right. And then also the down conversion, the quantization going from 16-bit floating point to 8-bit integer. Are either of the … either of those similar to things currently available in PyTorch? Would those be new features if they were added?
Andrew James (20:57): I know that PyTorch has some support already for quantization. I don’t know how far down it goes in terms of, “Does it go to int8? It might stop at int16,” but it’s certainly there. And then yes, I believe the hashing technique that they talk about is sort of alluding to sparse layouts and that type of stuff where essentially you have a lookup for the non-zero elements. That’s a very generic way of describing the sparse layout. I talked earlier about some of the things that we’ve been targeting with sparse and how they were motivated by research that essentially showed you could cut out a lot of data and still achieve high accuracy. Quantization is essentially discovering the same properties of these models in a slightly different way and that’s that you don’t need all the bits of accuracy the same way you don’t necessarily need all the values in the tensor. You can reduce the amount of data you have and essentially round things or shift priority onto the most important elements and recover most of the functionality. And I think we’ll consider… I was speaking with someone from work recently and they told me that there were folks working on quantized models that essentially went down to int3, which is one bit more than basically like you don’t really have a range there. You essentially have a discrete … sort of … when you’re doing a survey. “How would you rate us? Five being very satisfactory.” It’s that kind of thing. Like you just have a few values you can choose from. So it’s pretty cool to see the amount of research that’s still going into shrinking and compressing these models that they’re more nimble in a sense. We can move them more places where, say, memory is not a restriction or the amount of compute you can sort of hoard at once is not as much of a restriction. At the same time, while people are developing innovations with these massive models, large language models are pretty huge, and I for one am very excited to sit where I am and see where things go and how things develop.
Brian Skinn (23:15): For sure. So Sean, from the user side, how directly do you choose these pathways? Do you find yourself digging into these optimizations directly and trying to choose which features within PyTorch you’re using for your model work or is most of that invisible to you?
Sean Ross-Ross (23:37): No, and I would encourage everybody to do something similar is: don’t care about that. Right at the beginning, you really want to get something working off the beginning. And as your model becomes more and more stable, especially if you’re doing a lot of experimentation or creating some novel models, you really don’t want to do any of these optimizations until you’re sure that it’s a model that’s going to be lasting in the long term. And … that comes kind of a long way down the road. Now, that being said, there’s nothing stopping me from just doing two-device CUDA and then seeing the speedups kind of happen once we go to the cloud. But in terms of optimizations, it’s probably very, in my experience, very rarely worth it to do any of the optimizations because the cost of compute versus the cost of development versus how long you’re going to be running now … running the model, and until it becomes very largely utilized. Like, let’s say, if people are running inference on the cloud, you have a cloud service that takes advantage of it. There’s a certain point at where you’re going to say, “okay, I need to start optimizing this to reduce my costs,” whereas to get a service up and running to test it out in the first place, I would say there’s probably very little that you should be doing to do these optimizations.
Andrew James (25:04): Yeah, and I totally agree with what Sean’s saying. It’s very similar to the approach we take when we go to optimize algorithms on the back end. I’m not so much in the user seat, but it’s almost always going to be a waste of money to optimize prematurely. We need to study, we need to talk to our users, we need to profile. We need to see where our efforts are best spent. And then we need to analyze the algorithms, do the research, figure out if there’s anything for us to gain. And then at that point, we can go set about trying to realize those gains. So I think that it would not be advisable for someone to start employing things like quantization and sparsity from the outset, but it’s another step as they iterate and refine the model closer to the end. Once they have something developed and working, then they can go to the literature and look at different pruning strategies or quantization methods and experiment with them to see if they still achieve a result that’s satisfactory for their application. Because again, a lot of this research is … the paper will be published and they have a particular model that they’re targeting and that model was trained for a specific purpose. And so for some applications like perhaps image recognition or video processing, yes, sparsity works very well. You can cut out some of the data and that’s fine. But if you go to other applications, it probably won’t work so well. Applications where essentially your model is more highly connected and there’s more importance kind of spread around across the weights in the various layers.
Brian Skinn (26:47): Yeah, that was touching on something that occurred in terms of model accuracy and making sure that when you try to apply these optimizations that you have to … I figured you had to, and it sounds like you do … make sure that your model quality and your model accuracy stays where you need it to be as you’re trying to optimize it.
Andrew James (27:03): Absolutely. And depending on the model, it will be more or less sensitive to these types of things. And ultimately the end result is whatever your business need or application is, is it accurate enough for you to provide value to your customers or to accomplish the goal that your software set out to do? And those are questions that you have to answer. Unless you’re in a situation where you’re doing something exactly the way it’s been done before, which is somewhat rare, you can’t necessarily rely on things that people have done to work exactly the same way for you.
Brian Skinn (27:42): Sure. Cool. We’re about on time. Any last thoughts before we wrap? Seems like a good conversation. I enjoyed it. I have a lot to learn when it comes to machine learning and PyTorch in particular. This definitely was beneficial for me. Really appreciate it. Andrew, Sean, thank you for taking the time for the conversation. Again, I’m Brian Skinn, Community Manager here at OpenTeams. Check us out openteams.com and we’ll catch you next time. Have a good day.
Sean Ross-Ross (28:08): Thanks, everyone.
Andrew James (28:10): Thanks, Brian. Thanks, Sean.