scikit-learn and Hugging Face join forces

OpenTeams October 13, 2022

Hugging Face is happy to announce that we’re partnering with scikit-learn to further our support of the machine learning tools and ecosystem.

At Hugging Face, we’ve been putting a lot of effort into supporting deep learning, but we believe that machine learning as a whole can benefit from the tools we release. With statistical machine learning being essential in this field and scikit-learn dominating statistical ML, we’re excited to partner and move forward together.

As of September 2022, the Hugging Face Hub already hosts nearly 4,000 tabular classification and tabular regression model checkpoints, and we strive for this trend to continue.

Support to the scikit-learn consortium

Starting June 2022, Hugging Face is now an official sponsor of the scikit-learn consortium . Through this support, Hugging Face actively promotes the development and sustainability of sklearn. As a sponsor of the scikit-learn consortium hosted at the Inria foundation, we’ll now participate in the scikit-learn consortium technical committee

Development support

To help sustaining the development of the library , we’re happy to welcome Adrin Jalali and Benjamin Bossan to the Hugging Face team. Adrin is a core developer of scikit-learn as well as fairlearn, while Benjamin is the author of the skorch library and is now a contributor to scikit-learn.

Hugging Face is happy to support the development of scikit-learn through code contributions, issues, pull requests, reviews, and discussions.

Integration to and from the Hugging Face Hub

“Skops” is the name of the framework being actively developed as the link between the scikit-learn and the Hugging Face ecosystems. With Skops, we hope to facilitate essential workflows:

The ability to push scikit-learn models on the Hugging Face Hub
The possibility to try out models directly in the browser
The automatic creation of model cards, to improve model documentation and understanding
The ability to collaborate with others on machine learning projects

Snapshot of your work

Working at the intersection of scikit-learn and the Hub offers challenges linked to the two platforms. One of these challenges is secure persistence: the ability to serialize models in a secure, safe manner.

scikit-learn models (estimators, predictors, …) are usually saved using pickle, which is notorious for not being a secure format. Sharing scikit-learn models in this format exposes receivers to potentially malicious data which could execute arbitrary code when run.

That’s where secure persistence comes in: as the Hugging Face Hub aims to provide a platform for models, the ability to share safe, secure objects is essential. We’ve been working on adding secure persistence for scikit-learn models in skops#128 and skops#145(doc preview). Instead of serializing using pickle, the object’s contents are put into a zip file with an accompanying schema JSON file.

Read about the Skops library in the following blog post: Introducing Skops.

Improving interoperability

Skops is an example of an integration of scikit-learn within our tools, but it is not the only example! We will strive to integrate with the rest of our ecosystem so that Hugging Face users may benefit from using scikit-learn tools and vice-versa.

An example is the evaluate library, dedicated to efficiently evaluating machine learning models and datasets. We aim for this tool to natively support scikit-learn metrics in its API.

Through these efforts, we hope to kickstart a lasting relationship between the two ecosystems and provide simple, efficient bridges to lower the barrier of entry. We believe that educating and sharing models is the best way to foster inclusive machine learning from which all can benefit. We’re excited to partner with scikit-learn for this endeavor.

Categories: Artificial Intelligence & Machine Learning Posts, Scikit-learn

PyMC Open Source Development

In this episode of Open Source Directions, we were joined by Thomas Wiecki once again who talked about the work being done with PyMC. PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

OpenTeams September 7, 2020

Jupyter & Nteract Open Source Development

In this episode of Open Source Directions we were joined by Matthew Seal who talked about the work he has been doing with Jupyter and Nteract. Matthew also discussed a particular topic: common Jupyter tools and their adoption for various use cases in the wild.

OpenTeams July 21, 2020

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens: Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor: The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments. This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well. This is the first article in the series. In it, we present explorables and visualizations aiding the intuition of: Input Saliency methods that score input tokens importance to generating a token. Neuron Activations and how individual and groups of model neurons spike in response to inputs and to produce outputs. The next article addresses Hidden State Evolution across the layers of the model and what it may tell us about each layer’s role.

OpenTeams December 16, 2020

Community Management & Growth, Measuring ROI of OSS Contributions By Enterprises

In this episode, we have an engaging and very entertaining discussion with Jono Bacon, the founder of Jono Bacon Consulting. Jono was Director of Community at notable companies such as Github, Canonical, and XPRIZE. He is one of the top (if not the top) experts in the world when it comes to building strong communities.

OpenTeams September 17, 2020

Measuring ROI of OSS Contributions & Similarities Between Open Standards & OSS

In this episode, we talk with Guy Martin, the Executive Director at OASIS. Guy is a globally recognized open source strategy expert.

OpenTeams September 17, 2020

scikit-learn and Hugging Face join forces

Support to the scikit-learn consortium

Development support

Integration to and from the Hugging Face Hub

Snapshot of your work

Improving interoperability

Resources

Company

Support to the scikit-learn consortium

Development support

Integration to and from the Hugging Face Hub

Snapshot of your work

Improving interoperability

Related Articles

Resources

Company