Python Data APIs


A massive amount of effort is going into n-dimensional array (or tensor) libraries for deep learning and numerical computing. The Python array computing, data science, and deep learning space is currently fragmented. Community projects like NumPy, SciPy and Scikit-learn don’t allow for use of GPUs, and deep learning frameworks are largely not interoperable. From both community and business points of view, there is a need for a platform to collaborate and for tools and agreements to build an ecos... more

Python Data APIs: Open Source Standards for Tensors and DataFrames

Machine Learning is becoming critical infrastructure for many organizations. At the same time, the current tensor and dataframe infrastructure at the foundation of machine learning is fragmented. This makes it difficult for a robust ecosystem to develop around the state of the art.  


Even as Python has become a common language to express workflows, the foundational data-structures used in Python such as n-dimensional arrays (tensors) and dataframes have become more divided. For example, tensors are fragmented between Tensorflow, PyTorch, NumPy, CuPy, MXNet, Xarray, Dask, and others.  Dataframes are fragmented between Pandas, PySpark, Arrow, RAPIDS, TuriCreate, Vaex, Modin, Dask, Ray, and more.  


The differences between these projects provide marginal benefit but also cause inconsistencies and incompatibilities that require additional effort, limit participation, prevent re-use, and slow progress. Innovation and differentiation are most valuable when created using standard building blocks that ensure consistency and compatibility.


Quansight Labs is forming a consortium of leading companies to construct a standard for both tensors and dataframes which are the critical building blocks for machine-learning workflows. The standard will include APIs and specifications for both tensors and dataframes and will be coordinated with the wider open-source community in its creation and later dissemination. Consortium members will be able to participate early in the process of defining the standard including a basic reference implementation. The early requirements-gathering phase leading to the first draft of the standard will be conducted with consortium members including key community participants.  This drafting phase will be followed by public dissemination, discussion, and refinement of the standard within the NumPy, Pandas and wider PyData communities.  Participating consortium members will receive status reports and consultation from Quansight Labs during public phases of the standard construction as well that may lead to further private refinement. 


Quansight Labs is well positioned to establish this Consortium because of its expertise with the technologies involved, understanding of the open-source process, strong relationships with many of the relevant projects, and because Quansight Labs does not have commercial interest in the particulars of the standard.


We are looking for around 10 member companies to join the Consortium. Members will be able to participate in the private sessions that result in the drafting of the standard. Membership will not only allow the ability to significantly influence the standard, it will also provide members with insights into the details of the standard, prior to their public dissemination.  The funding obligation to participate is $50,000 per member in order to support the development of the standard and early reference implementation which is expected to take about 6 months.  


Definition of the standard has commenced and there is a limited window to participate and influence the standard. Several organizations have already committed to the consortium including Quansight, OmniSci, Anaconda, Intel, Microsoft, and other participants who are not named here.  We expect to close membership in the Consortium by the first week of May 2020.