OpenTensors: High Level APIs for Arrays, DataFrames, and DataTypes

A massive amount of effort is going into n-dimensional array (or tensor) libraries for deep learning and numerical computing. The Python array computing, data science, and deep learning space is currently fragmented. Community projects like NumPy, SciPy and Scikit-learn don’t allow for use of GPUs, and deep learning frameworks are largely not interoperable. From both community and business points of view, there is a need for a platform to collaborate and for tools and agreements to build an ecos... more
  • OpenTensors_Proposal.pdf

Goals Include:

  • A community-driven alternative to the PyTorch and Tensorflow ecosystems, that can work with them based on a collection of interoperable projects and standards for doing deep learning. This requires auto-differentiation, graph construction and optional lazy evaluation, and support for GPUs and distributed and sparse arrays across libraries.
  • Ability to use OpenTensors for other domains than deep learning - general data science and scientific computing. This is equally important to deep learning, there are many applications for differentiable computing and in need of the performance and flexibility that lazy evaluation and GPUs provide.
  • Building a community of companies, projects and people to reduce duplication of effort and to accelerate the development of the Python AI, Data Science and scientific computing ecosystem.

Achieving these goals will require equal amounts of innovation and community and consensus building. The capacity of community projects to engage with faster-paced deep learning frameworks, and define or adopt extension points and API alignments, needs to be increased. 

Innovation will consist of a mix of new libraries (e.g. for auto-differentiation), accelerating existing projects (e.g. pydata/sparse for sparse arrays) and making choices between or bridging of competing technologies (e.g. MLIR, TVM IR or Weld IR as the intermediate representation to target).

Quansight is aiming to bring companies and community projects together, and coordinate a concerted effort to build the OpenTensors community, standards and tools, and reference implementations.

Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Pomegranate

pomegranate is a Python module for fast and flexible probabilistic modeling inspired by the design of scikit-learn.

Statsmodels

Statsmodels is a Python package that provides a complement to Scipy for statistical computations including descriptive statistics and estimation of statistical models.

CuPy

CuPy is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture.

Chainer

Chainer is a powerful, flexible and intuitive deep learning framework. Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.

Dask

Dask is an open source library for natively scaling Python. It builds on existing Python libraries like NumPy, pandas, and scikit-learn to enable scalable computation on large datasets. In addition, Dask provides a general purpose framework to enable advanced users to build their own parallel applications. Dask enables analysts to scale from their multi-core laptop to thousand-node cluster.

Astropy

The Astropy Project provides software tools and infrastructure to facilitate research by professional astronomers. In addition to maintaining a core Python package, the Astropy Project supports the development of high-grade affiliated packages by members of the astronomical community.

PyMC3

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. PyMC3 features intuitive model specification syntax, powerful sampling algorithms, variational inference, and transparent support for missing value imputation. It relies on Theano, which provides computation optimization and dynamic C compilation, NumPy broadcasting and advanced indexing, linear algebra operators, and simple extensibility.

Pandas

pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas’ data analysis and modeling features enable users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like R.

SciPy

SciPy provides many user-friendly and efficient numerical routines for Python such as routines for numerical integration, interpolation, optimization, linear algebra and statistics.

NumPy

N-dimensional array and computational libraries for Python.