TensorFlow Datasets is turning 4!

Posted by the TensorFlow Datasets team

Datasets landscape has changed a lot since TensorFlow Datasets (TFDS) was introduced about 4 years ago: TFDS made sharing or re-using a dataset significantly easier, and transformed the datasets landscape by inspiring other ML tools, libraries and services.

Loading a dataset went from complicated scripts to:

import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')

for example in ds: # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
print(list(example.keys()))
image = example["image"]
label = example["label"]
print(image.shape, label)

Read the documentation for a more extensive introduction.

Over the years, TFDS has grown to become a recognized way to load datasets. To celebrate our last 4.8.2 release, we would like to take some time to reflect on the progress and improvements made over those past years and thank the community for their support.

TFDS is still a library to facilitate download, preparation and loading of datasets for ML pipelines, but it now supports hundreds of datasets and offers the following main features:

  1. A large variety of features with encoding and decoding, ranging from text to images, videos, audio and even RL-specific types (e.g. dataset of datasets).
  2. Large datasets support: TFDS is successfully used within Google to prepare and load large datasets (PBs) using high performance input pipelines.
  3. Dataset collections, to arbitrarily group together a number of existing TFDS datasets, for example used in a benchmark.
  4. Support for all main ML Python frameworks: yes there is “TF” in “TFDS”, but besides TensorFlow, one can use TFDS with Torch, Jax, NumPy, Keras and any other Python ML framework that can consume a tf.data.Dataset or a NumPy Iterator.
  5. Global shuffling at preparation time: It is good practice to shuffle training data, TFDS optionally does a global shuffling at preparation time in case the source of the data wasn’t already shuffled.
  6. Splits and slicing: datasets can specify their splits, and readers can specify which split(s) they want to read, or slices of splits they want to read, eg: test[:10%] to “load the 10 first percent of the test split”.
  7. Versioning and determinism: TFDS datasets and collection are versioned, so it is possible to reproduce experiments reliably. Loading a dataset pinned at a particular version will always return the same set of examples. This works with slicing and global shuffling too, as those are deterministic.
  8. Code-less sharing: TFDS can read TFDS prepared datasets even if the code used to prepare the dataset is not available. This facilitates sharing and versioning datasets.
  9. Community datasets and support for internal datasets within organizations: TFDS allows organizations to manage different corpuses of datasets and make them available to their internal users.
  10. Formats-specific builders: to easily define datasets based on well known formats such as CoNLL.
  11. GCS integration: TFDS works well with GCS.

Thank you to all of our contributors and users!

What’s next?

TFDS is under active development to bring you the best datasets to use as input in your ML pipelines.

Notably, we work on making transformations seamless. Sometimes, a dataset is derived from another dataset by a few transformations (e.g., data augmentation or column renaming). We want those transformations to be as easy to implement as possible. This feature is already available experimentally, don’t hesitate to give feedback on GitHub!

We are also working on making the TensorFlow dependency optional. TFDS is a framework agnostic library that provides datasets and tools to support machine learning research. TFDS does not rely on any specific machine learning framework, and we are working to make the TensorFlow dependency optional.

We have other plans too, smaller ones such as the support of partitioned datasets, and longer-term ones that could durably influence the field. Follow us on GitHub to receive future updates about those upcoming developments!

Related Articles

CycleGAN: Unpaired Image-to-Image Translation (Part 3)

Table of Contents CycleGAN: Unpaired Image-to-Image Translation (Part 3) Configuring Your Development Environment Need Help Configuring Your Development Environment? Project Structure Implementing CycleGAN Training Implementing Training Callback Implementing Data Pipeline and Model Training Perform Image-to-Image Translation Summary Citation Information CycleGAN:…
The post CycleGAN: Unpaired Image-to-Image Translation (Part 3) appeared first on PyImageSearch.

PyMC Open Source Development

In this episode of Open Source Directions, we were joined by Thomas Wiecki once again who talked about the work being done with PyMC. PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

SymPy Open Source Development

In this episode of Open Source Directions, we were joined by Aaron Meurer who will talk once again with Oscar Benjamin about the work he has been doing with SymPy. SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.

Responses

Your email address will not be published. Required fields are marked *