safetensors 0.1.0

Announcing safetensors, a new R package allowing for reading and writing files in the safetensors format.

Related Articles

Announcing CPP-based S3 IO DataPipes

Training large deep learning models requires large datasets. Amazon Simple Storage Service (Amazon S3) is a scalable cloud object store service used for storing large training datasets. Machine learning (ML) practitioners need an efficient data pipe that can download data from Amazon S3, transform the data, and feed the data to GPUs for training models with high throughput and low latency. In this post, we introduce the new S3 IO DataPipes for PyTorch, S3FileLister and S3FileLoader. For memory efficiency and fast runs, the new DataPipes use the C++ extension to access Amazon S3. Benchmarking shows that S3FileLoader is 59.8% faster than FSSpecFileOpener for downloading a natural language processing (NLP) dataset from Amazon S3. You can build IterDataPipe training pipelines with the new DataPipes. We also demonstrate that the new DataPipe can reduce overall Bert and ResNet50 training time by 7%. The new DataPipes have been upstreamed to the open-source TorchData 0.4.0 with PyTorch 1.12.0.

PyJanitor Book Open Source Development

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm. Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format.