PyData-APIs: An Open Source Standard for Tensors and DataFrames
Machine Learning is becoming critical infrastructure for many organizations. At the same time, the current tensor and dataframe infrastructure at the foundation of machine learning is fragmented. This makes it difficult for a robust ecosystem to develop around the state of the art.
Even as Python has become a common language to express workflows, the foundational data-structures used in Python such as n-dimensional arrays (tensors) and dataframes have become more divided. For example, tensors are fragmented between Tensorflow, PyTorch, NumPy, CuPy, MXNet, Xarray, Dask, and others. Dataframes are fragmented between Pandas, PySpark, Arrow, RAPIDS, TuriCreate, Vaex, Modin, Dask, Ray, and more.
The differences between these projects provide marginal benefit but also cause inconsistencies and incompatibilities that require additional effort, limit participation, prevent re-use, and slow progress. Innovation and differentiation are most valuable when created using standard building blocks that ensure consistency and compatibility.
Quansight Labs is forming a consortium of leading companies to construct a standard for both tensors and dataframes which are the critical building blocks for machine-learning workflows. The standard will include APIs and specifications for both tensors and dataframes and will be coordinated with the wider open-source community in its creation and later dissemination. Consortium members will be able to participate early in the process of defining the standard including a basic reference implementation. The early requirements-gathering phase leading to the first draft of the standard will be conducted with consortium members including key community participants. This drafting phase will be followed by public dissemination, discussion, and refinement of the standard within the NumPy, Pandas and wider PyData communities. Participating consortium members will receive status reports and consultation from Quansight Labs during public phases of the standard construction as well that may lead to further private refinement.
Quansight Labs is well positioned to establish this Consortium because of its expertise with the technologies involved, understanding of the open-source process, strong relationships with many of the relevant projects, and because Quansight Labs does not have commercial interest in the particulars of the standard.
We are looking for around 10 member companies to join the Consortium. Members will be able to participate in the private sessions that result in the drafting of the standard. Membership will not only allow the ability to significantly influence the standard, it will also provide members with insights into the details of the standard, prior to their public dissemination. The funding obligation to participate is $50,000 per member in order to support the development of the standard and early reference implementation which is expected to take about 6 months.
Definition of the standard has commenced and there is a limited window to participate and influence the standard. Several organizations have already committed to the consortium including Quansight, OmniSci, Anaconda, Intel, Microsoft, and other participants who are not named here. We expect to close membership in the Consortium by the first week of May 2020.
N-dimensional array and computational libraries for Python.
Module designed for scientific Python that provides accessible solutions to machine learning problems.
CuPy is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture.
Dask is an open source library for natively scaling Python. It builds on existing Python libraries like NumPy, pandas, and scikit-learn to enable scalable computation on large datasets. In addition, Dask provides a general purpose framework to enable advanced users to build their own parallel applications. Dask enables analysts to scale from their multi-core laptop to thousand-node cluster.
pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas’ data analysis and modeling features enable users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like R.
SciPy provides many user-friendly and efficient numerical routines for Python such as routines for numerical integration, interpolation, optimization, linear algebra and statistics.