The PyData ecosystem is home to one of the largest and most successful open source communities. It’s both where most newcomers to data science start and also where cutting edge research takes place. It has been able to support the diverse needs of its users through its decentralized nature, promoting creativity and collaboration.
As the size of data has increased and our compute has moved off of our single CPUs, the nature of libraries has evolved. Whereas in the past client code would generally call out to fast pre-compiled libraries (SciPy, NumPy, etc.), now it often works via calls to a variety of distributed, out-of-core, and specialized compilation and computation backends (PyTorch, Dask, Numba, Ibis, etc.). This means a growing number of libraries do not eagerly execute a computation in the CPython interpreter, but instead optimize and translate it to some other target.
At a high level, we can see this ecosystem as a large decentralized, embedded, domain-specific compiler, translating from high-level user expressions to different low-level primitives. This calls for an exploration of tooling to help enable this translation of programs between different representations, to facilitate the efficient use of code across this distributed ecosystem.
One approach to automating this translation among different representations is the rewriting technique called “equality saturation.” This allows us to construct a data structure of equivalent programs (an ‘e-graph’), and then search that space for a functionally-equivalent program that has desirable characteristics such as improved performance or memory efficiency. Building this translation tooling once can enhance sharing and collaboration between the libraries which use it.
In this talk, Saul Shanabrook goes over how e-graphs work, how they were developed, and different ways they can be used in the PyData ecosystem. Saul also surveys the egglog library, which is one specific tool for using e-graphs in Python.
Saul Shanabrook @ GitHub: https://github.com/saulshanabrook
egglog @ GitHub: https://github.com/egraphs-good/egglog
egglog Python bindings docs: https://egg-smol-python.readthedocs.io/en/latest/
More on e-graphs and equality saturation: https://blog.sigplan.org/2021/04/06/equality-saturation-with-egg/
July 20, 2023