Interview with Lucy Liu, scikit-learn Team Member

Lucy Liu joined the scikit-learn Team in September 2020. In this interview, learn more about Lucy’s journey through open source, from rstats to scikit-learn.

  1. Tell us about yourself.

    My name is Lucy, I grew up in New Zealand and I am culturally Chinese. I currently live in Australia and work for Quansight labs.

  2. How did you first become involved in open source?

    I first discovered open source when I started a research Masters, after finding my clinical Optometry job unfulfilling. I loved learning to program but was initially not game enough to contribute as I was only a beginner. After my masters, while working as a bioinformatician, I wrote some R packages for analysis of niche biomedical data and put them on github. My first contribution to an existing open source project was later when I worked at INRIA (French National Institute for Research in Digital Science and Technology) alongside the INRIA scikit-learn core developers. They helped me put up my first pull request and I have been contributing ever since!

  3. How did you get involved in scikit-learn? Can you share a few of the pull requests to scikit-learn that resonate with you?

    I’m very interested in statistics and code so I was super keen to contribute to scikit-learn. Being relatively a beginner in both areas I started by contributing to documentation, then bug fixes and features. My first PR to scikit-learn was submitted in October 2019 to improve the multiclass classification documentation. I have contributed the most to the calibration module in scikit-learn (including refactoring CalibratedClassifierCV), which has been very interesting and useful for when I later worked on post-processing of weather forecasts at the Bureau of Meteorology in Australia.

    Reference: Lucy’s list of pull requests

  4. To which OSS projects and communities do you contribute?

    I contribute to Sphinx-Gallery and scikit-learn. Sphinx-Gallery was a great introduction to open source for me as it is a small package that does not get a large number of issues and pull requests (unlike scikit-learn!).

  5. What do you find alluring about OSS?

    I think the ability to see the source code and contribute back to the project are the best parts. If there is a feature you are interested in you can suggest and add it yourself, all the while learning from code reviews during the process!

  6. What pain points do you observe in community-led OSS?

    I think some of the positive aspects of the OSS community can also lead to pain. While it is great that you are able to get many different perspectives from people of various backgrounds, it also makes forming a consensus more difficult, slow progress. People from any geographical location can work together asynchronously but this can also mean people work in their own silos, making it difficult to have a cohesive direction for the project. Large projects also have a difficult learning curve, making it difficult for new contributors and contributors interested in becoming core-developers. The latter is the problem if the project lacks core-developer time for project maintenance and reviewing PRs.

  7. If we discuss how far OS has evolved in 10 years, what would you like to see happen?

    Some system that enables continuity of funding, which can combine funds from public and private sources. This would enable long term planning of OS projects and give developers more job stability. Better coordination between projects within the same area (e.g., scientific Python) would allow a better experience for users using Python for their projects.

  8. What are your favorite resources, books, courses, conferences, etc?

    Real Python have great tutorials and regex101 makes regular expressions so much easier to write and review!

    I also love the YouTube channel statquest, which explains statistical concepts in a very accessible manner and introduces videos with a jingle – what more could you want?

  9. What are your hobbies, outside of work and open source?

    I love cycling and feel strongly about designing cities for people instead of cars. I also enjoy rock climbing (indoors and outdoors), though sadly have not had much time for this recently.

Related Articles

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens: Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor: The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments. This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well. This is the first article in the series. In it, we present explorables and visualizations aiding the intuition of: Input Saliency methods that score input tokens importance to generating a token. Neuron Activations and how individual and groups of model neurons spike in response to inputs and to produce outputs. The next article addresses Hidden State Evolution across the layers of the model and what it may tell us about each layer’s role.