Everyone involved in exploratory data analysis has a workflow that they prefer to follow. Usually, these workflows involve some combination of inspection of raw data, calculation of summary statistics, creation of visualizations, and data cleaning, usually applied iteratively. In Python, Jupyter and pandas are the de facto standard tools for this task—however, the default interfaces of these tools often provide for an inefficient data exploration experience. The rendering of tabular data as formatted text makes it challenging to explore efficiently, and common data cleaning and summarization tasks require tedious, error-prone manual code entry.
Buckaroo is a Jupyter-based wrapper around pandas dataframes that provides a streamlined, enhanced exploratory data analysis experience. In addition to a significantly improved interface, its built-in capabilities for creating useful summary stats, performing data cleaning, and displaying dataframes can accelerate many workflow steps without further customization. In situations where the defaults aren’t suitable, Buckaroo can be customized and extended to fit the need. Specific Buckaroo features include:
- A performant table widget based on AG Grid
- Sensible, override-able defaults for data cleaning
- Automatic summary statistic calculations for every column
- Lightweight histograms rendered directly in column headers
- Smart downsampling for faster rendering
- A low code UI with codegen to quickly perform common tasks like group-by and fillNA
- Interactive extensibility for every feature (summary measures, data cleaning, and more)
This talk will demonstrate Buckaroo as a tool, compare Buckaroo workflows to the default Jupyter/pandas flows, illustrate how it can be extended to fit different needs, and review the architecture of the project. Come prepared with pain points from your current EDA workflow with pandas & Jupyter! Paddy will demonstrate how Buckaroo can help smooth them out.
View the recording here.