“Well, it works on my machine” is a common phrase we have all heard. The other one is “that package isn’t available on the server”. Most organizations regularly contend with both of these issues. In this talk, we will cover how Conda-Store can be a solution to this tension between practitioners and administrators by providing a way to manage the entire lifecycle of data science environments inside organizations. Conda-Store adds namespacing, versioning, fine-grained permissions, and policy enforcement to the environments used in organizations as well as managing the process of building and publishing environments.
Dharhas Pothina
CTO at Quansight
Transcript
Steve: [00:00:01] Here with the second session of the day, Dharhas Pothina, who is the CTO of Quansight. He’s going to be talking about environmental lifecycle management across teams of Conda-Store, very excited about this session. Dharhas did an amazing session last time. We’ll be sending you the recordings this week on picking a different open source dashboarding solutions. Now, I’m always amazed at Dharhas and his leadership at Quansight. Quansight is one of the biggest and one of the best consulting companies who’ve really adopted open source lead in many ways, especially on the Pydata stack and AI ML. And they have a lot of maintainers and founders walking their halls, working on open source, working on projects so Dharhas, to kick us off, I mean that’s impressive. How do you do it? How do you lead in a way that keeps amazing talent in a company when it’s so hard to keep talent, to retain talent? What are your secrets?
Dharhas: [00:01:13] Yeah, so we actually have a fairly low attrition rate for a company our size. And I think part of it is our mission and our atmosphere. Part of our mission at Quansight is to make open source sustainable and we give people time to do that. We have associated group called Quansight Labs that does pure open source work. We also make sure that people can work on interesting things and we support the open source. And atmosphere wise, we have a learning atmosphere. We work as a team and there’s no question that’s too silly. There’s no question that’s too dumb. And we make sure that people know that if they’re stuck, they can reach out to anyone in the company and we make it a good place to work. And I think that’s a large part of how we retain people, the focus on our open source mission and making it a place where people enjoy working. I’d say the two big things.
Steve: [00:02:19] I’d love to get the background here cause you’re a consulting company, you work with open source, you help big enterprises tackle some massive data problems and want to know more about Conda-Store. That’s what you are going to talk to us about today, but how was that incubated? How did that start? And then last questioning. Please get started with this, but that would be very curious to know.
Dharhas: [00:02:45] Yeah. So primarily we have a fundamental thesis that I think people are starting to understand in the last ten years. And it’s the fact that you are not going to be able to succeed as a company unless you leverage open source effectively. But with that, there’s the additional piece that any time you’re building internal software or proprietary software, you’re taking on the maintenance burden of that software and also the intellectual burden of that software on your organization. And what I push companies to do and enterprises do is push things back to the open source community as much as you can, and then just maintain a very thin layer of internal software. And when you have the ability to put new features into the upstream open source libraries versus trying to build it on your own, you’ll have access to a larger set of eyes, a larger set of brainpower. Of course, you will have stuff that’s internal to your organization, but that needs to be your special sauce. It needs to be the stuff that’s specific to your own organization. So that’s how I’d say you can be successful with open source.
Steve: [00:04:00] And that’s where Conda-Store came from. You were building it out for clients
Dharhas: [00:04:07] Conda-Store from seeing the same problems at multiple clients and saying there is a better way to solve this problem. And if we make it an open source project product, we can leverage the community and leverage the problems that everyone has and build something that’s applicable across a lot of domains and to go into this. So I’m basically going to be talking about one of the biggest pain points most organizations have, and it’s managing software environments. I’m going to specifically focus on data science software environments. I’m going to have a couple of slides to explain why this is such a problem and this is a major problem. Almost every organization we go to, big and small, has a problem with data science software environments, people doing things in research, trying to get those environments to production, this whole area. And I’ll talk through some of the reasons why this is so difficult and then some of the solutions we’ve come up with.
Steve: [00:05:18] It sounds great. I’ll let you get started here and go ahead.
Dharhas: [00:05:24] Okay, so to start off with, data science is built on an open source ecosystem. Here’s a recent paper that was published in Nature that kind of shows the Pydata stack or the scientific Python stack. It’s based on NumPy at the bottom, and then you have other libraries which are foundational, and then you have the library specific to certain techniques like statistics or network analysis. And then on top of that, you have domain specific libraries and application specific libraries. This whole ecosystem moves very fast. It’s incorporating the latest research all the time. It uses multiple languages and compilers like this Python code or some of this R code. Underneath, it might actually be depending on Fortran or C++ or other languages so this is a multi language ecosystem. It’s not a single language ecosystem like some large Java apps might be. It includes thousands and thousands of individual and corporate contributors. Now, if you take the sheer number of contributors, the sheer number of languages involved, and the fact that everyone is trying to incorporate the latest research and the cutting edge stuff, this whole ecosystem is inherently not backwards compatible. There are some famous examples like back when Sun Microsystems was around. You could take a binary from Solaris one and drop it into Solaris 12 and it would just run. You had that detailed focus on backwards compatibility. That is not true of data science. Data science, the whole scientific ecosystem is designed for moving fast, doing the latest research.
[00:07:20] Now, this means that data science environments are extremely complex. They are very hard to create, very hard to maintain and even harder to share. And you will see this a lot. In most organizations, you have environments that, you hear this phrase a lot. It works on my machine. This is one of the most common issues organizations face. A part of this is data scientists, engineers, developers. In their mind, they often think of all these libraries as independent software tools and not as libraries. And a typical pattern for most people, especially when they’re on their laptop or on a desktop, is they decide they need a tool, they need Python or they need a library like pandas or similar. And so they will install that software. They will either download and double click something or they’ll do something like a pip install or a conda install, and then they’ll install Python and then maybe pandas. And then they will realize they want scikit-learn, they’ll install that. A month later, they may realize, Oh, I’d actually try I think this project, I need to use PyTorch and they’ll install PyTorch. And then they realize for a new project they need a different version of a software and then they’ll install that. And then this happens on laptops and desktop, but it can also happen on servers. And you eventually end up with this very unique unicorn software environment that only works on that laptop or that server. And trying to run it somewhere else or trying to move it to another machine or move it into production if you have a research team or a team building and machine learning model. And now you want to take that machine learning model and move it into production, you now hit this issue. It works on my machine. And you can’t transfer it.
[00:09:28] So what do IT departments do? They dockerize everything. They put everything in containers. They ship the containers. Great. That is actually a good solution for reproducibility except it’s not the entire solution because you end up with the opposite situation. Usually the process of creating these containers and shipping things to production is laborious and it is slow and it requires DevOps expertise and people with the right permissions on infrastructure and you end up with this situation where the software that is being used in the research environments is not available in the production system, and adding new software to the production systems can take months or in some cases we had one client where it took a year to get a new version of a particular library onto that production system because of the change request required and the process to create the containers and move them into the production systems. So containers are part of the solution, but they’re not the entire solution. And the issue is this, it is again the data science AI Ml world moves really fast. You cannot anticipate ahead of time which software projects or packages you will need to solve the problem that you need to solve. And all these packages change continuously and they get updated and they have new features. And those new features are critical to moving effectively into the future.
[00:11:24] Now, yes, you could say we’re going to freeze all our software requirements at the beginning of the year and everyone has to use these libraries. And that is a reasonable choice. You can make that and you will solve one problem. And you’ll solve the problem of making things work in production, but you are going to severely hamper your competitive edge and the ability of your organization to use the latest technology and just imagine, a year ago we didn’t have some of the machine learning and AI tools that we have even today, like with DALL-E and some of these other things or GPT-3. These all didn’t exist a few years ago. And if you have a long cycle and if you restrict your researchers and your data scientists and your people doing analytics to a set of tools that was decided at a very slow pace, you’re going to severely limit the ability of your company to compete. So to some extent, this is a solved problem at the individual scale. One solution is Conda. Conda is a cross-platform package manager. And if you’re on an individual laptop, a lot of the problems with dealing with data science software and environments can be solved with Conda if you also put some best practices into place.
[00:13:02] Unfortunately, these best practices are not well known and the tooling for these best practices is inadequate when you move away from thinking of how an individual will handle software environments to how a organization or a team will handle software environments. And so that’s why we came in and said, okay, we have a reasonable solution for an individual. How do we expand this solution for an organization or a team? And how do we enforce the best practices? Like you could enforce best practices by having a lot of training and teaching and telling people what the best practices are. But that is fragile, and we wanted to make something more robust. And I wanted to give a case study here of a project we’re actually currently working on at Quansight for a client, where we’re actually doing some brand recognition on imagery. And we’re trying to work out what brands are an image. And there’s an open source machine learning algorithm that is very powerful and lots of good papers. And we wanted to test it out and see how well it would work for our client. And if you look at this screenshot, this software was released on November 30th, 2021, and that’s only like 11 months ago. And they wrote a requirements or text file which says it requires PyTorch greater than 1.4 and torchvision and NumPy and a few things. Oh great, they wrote the requirements of text.
[00:14:45] Unfortunately, we weren’t able to run this because this is not sufficient to recreate this environment. Since in the last 11 months, all of the packages there have advanced and improved. And with the latest versions of PyTorch and torchvision, this algorithm does not work. And all the tweaks were minor tweaks if we wanted to get it working with the latest version of Quartz Vision and PyTorch, that’s fine, but that took a couple of days. And the other option we had was to try and go back in time and think if we ran an install in November 2021, what would we actually get? And if we run an install today, what would we get in terms of version numbers? And so this is one of the key issues. You can do some forensics and you can go back and try and rebuild what the state of the data science world was at a particular date. There’s no easy way to do that. Recently, I was trying to pick up some software that I wrote about 10, 15 years ago, and I had a colleague who wanted to use that for something, and I tried to see if I could recreate the environment that, that was created in. It was very difficult. And a part of it is the tooling is not there to capture what you need to capture.
[00:16:17] So with Condo-Store, which is a new tool, we’ve got three major design goals. At the top is reproducibility. We need to make sure that any software environment you create is fully reproducible and fully captured, and it can be used in multiple contexts. We don’t want to say you have to use it in a Docker image and a Docker container. That’s one of the contexts, but you might also want something lighter weight. And so we want to be able to capture the environment and you might want to use it in cloud setting, in a local setting, in a research environment, in a production environment. We want to make sure it’s transportable across all those environments or contexts. We want it to be easy to use because the data scientists and engineers and software developers need to be able to quickly and easily create environments and install the software they need without going through a long process to do that. And I’m going to talk a little bit more about this because on the flip side in an organization, you have governance issues. You want to enforce policies. You want to make sure folks are only installing approved software. You might have constraints on what software goes into production. And we want to enable this balance. What I’ve seen in organizations is you have two extremes. On one extreme, you have a completely locked down production environment, and adding packages to that environment happens rarely and involves a very large, long change control process. And then when they update the environment with the new packages, anyone who depended on the older packages is now done for because that’s the new production environment.
[00:18:06] The other thing I’ve seen is there’s a free for all where the data scientists and engineers can install whatever they want and soon it becomes a mess because no one knows what was installed for each project and there’s no way to share. You share a project or a algorithm, and no one else can run it because no one knows what was used, which versions of software were used in that. So we’re trying to find the balance between these three things, reproducibility, ease of use and governance. And we do this with Conda-Store, and I’m going to go through this UI in a little bit in the next couple of slides. But essentially Conda-Store controls the environment lifecycle and it enforces best practices. It manages environment specifications. It builds those specifications in a scalable manner and it serves the environment via multiple contexts like as file system, as tarballs, as pushing things to a Docker registry. And I’m going to go through the individual features of the software. And this is a client server software so Conda-Store runs as a server. And there’s a rest API in the UI and then you can pull environments from it or you can have it build environments and place them in different locations.
[00:19:27] So first of all, both the API and the UI enforces and guides you to best practices. Instead of you progressively adding packages and saying, okay, I need pandas now, I need scikit-learn, now I need NumPy, now I need PyTorch, we give you an API and an interface where we make a clear distinction between the packages you’ve requested and version constraints you’ve put on the packages. Like I want python 3.8 or I want a certain version of pandas and places where you haven’t made version constraints. And instead of you installing things one by one, we always build a specification and then we use the specification to build an environment for you. This means that all times you know exactly what you specified. And then you have an environment that was built from that. We distinguish between packages that you requested and packages that were installed as dependencies. Going to the next slide. The way we’re enforcing reproducibility is we build log files and then we do build artifacts. I’m missing a few images on this slide. I’m not sure what happened. The last to build artifacts, should say Docker and a few other things. So the idea to make reproducible packages, you need to know the exact hashes of every package that was installed and you need to have a log file. This is best practice. This is what you use when you build a Docker container. This is what you use and anyone who’s kind of started building environments ends up with some sort of log file in whichever language they’re doing. But end users don’t want to specify the log file. It’s too complicated. They want to say, I want pandas, I want scikit-learn, I want PyTorch.
[00:21:33] So we auto generate the log file and we version it and we auto generate the build artifacts like the Docker containers and the tarballs and stuff. And we version all those things and so we on one hand have the usability of letting people build things. On the other hand, we have the exact log files and artifacts that are required to go back and forth between versions of the environment. Now again, one of the things I mentioned is we’re talking about environments as a version thing. And so what we do with Conda-Store is you have a environment, you give it a name, you give it metadata, and then we version the entire environment. People think of versioning packages, you know, I have Conda, I have Pandas 1.4 or something, but here we’re actually taking that concept and saying, I have a collection of packages and now in watching it. This is the production data science environment or production machine learning environment. And this is the version as of May 31st, 2022. But I can also switch it to a different environment or an older version. And if I add packages, it will create a new version from scratch. And so you can move backwards and forwards through these environments. And since these have unique hashes, you can also embed the environment hash in your script or in your Jupyter Notebook. So you have an exact knowledge of what environment was used in a particular project.
[00:23:10] The other part of this is the access control. As I said, we’re trying to balance the needs of, I guess the few with the needs of the many. So we have multiple namespaces within Conda-Store. And if you look at this example, we’ve set it up where a particular person, John Doe here, has his personal environments, the ones he’s created. There’s a machine learning and a web scraping one here. But then we also have shared environments for different groups like a client or a division in your organization. And those can be read only environments that John can use but not modify. Or if he is a team lead in one of those projects, maybe he’s the one who can modify them. And so this is the way we’re trying to balance the ability for individuals to create what they need and management to have the governance they need, but everything is captured. Even if John Doe makes his own environment for some experiment that is completely captured and could be migrated to a shared environment, or like all the artifacts are there and available. So three years from now, if you want to rerun that analysis with that script, you should be able to grab that environment again. With that, we’ve added some other governance features. We can restrict which mirrors or channels you’re using or if you have an internal like JFrog Artifactory or Anaconda team server and you have an internal mirror, we can say that it can only pull packages from those mirrors. We’ve also added some features which are pretty much unique to Conda-Store is, for example, if you want to allow users to create environments but you have certain required packages. You might have an internal package that you want every environment to have that has access to your internal data sets. You can say, okay, these internal packages are required and we’ll be inserted into every environment. You can also require certain versions of packages like you have to use pandas 1.2 and so you can do that requirement. And so we’re adding these governance features as we get requests and as things make sense. Last thing I’ll say is again, I have an image not showing up. We actually use Conda-Store in production with several clients. We’ve worked heavily with Morningstar. They’ve helped co-develop the project with us and we have an open source data science platform called Navari. And Morningstar has a data science platform called Analytics Lab. And Conda-Store is at the heart of both of those platforms. And unfortunately, the images and logos of those two did not show up. I’m not sure. I’ll fix that when we post the slides. That’s it. I’ll take questions.
Steve: [00:26:13] Yes. First one is how can they get a hold of this now? Is that they go to conda.store? Is that they can start using this today or what how do people get access to this?
Dharhas: [00:26:25] So conda.store is the website for the software and the software is up on GitHub. Currently we’re getting ready to do a documentation revamp. And so if the folks are trying it on their own, they might need to reach out to us on the GitHub repo if the documentation is not clear enough on how to use it, also Quasight consults around Conda-Store. And we can help you set it up and also add features as needed.
Steve: [00:27:03] It sounds good. This is amazing. I think Lua has been impressed with it and others as well. This is a very exciting technology to share. I don’t see any other questions here, but we appreciate your time Dharhas for sharing this with us. I think you’re going to make a bigger announcement at PyData in New York, correct? Are you going to be talking about this then?
Dharhas: [00:27:31] We submitted a talk for [pilot] in New York. It did not get accepted. We have another talk about our data science platform Navari Bari, and there’ll be a little tutorial about Navaia and we’ll be talking a little bit about Conda-Stone Navari. But I’m in the process of writing a blog post that will go into more detail about these.
Steve: [00:27:47] Yes, and we will for sure feature you more on this one. We appreciate you here today. And we’re going to get the word out about this and what you have. We just had a quick one here. Lua says looks really useful indeed, downloading it now. So there we go. Lua please let Dharhas and us know what you think about it.
Dharhas: [00:28:09] Yeah, I will say we have an older UI that’s not as fancy as the one I showed in the presentation. The new UI is going to be released in the next week or two.
Steve: [00:28:24] A week or two so Lua stay tuned, but looking forward to get your feedback there. So, Dharhas appreciate that. Thank you so much. And we were going to take a couple of minutes break. One minute break here. Get Rob ready for his presentation. So thank you.