Posit AI Weblog: safetensors 0.1.0

[ad_1]

safetensors is a brand new, easy, quick, and secure file format for storing tensors. The design of the file format and its unique implementation are being led
by Hugging Face, and it’s getting largely adopted of their common ‘transformers’ framework. The safetensors R package deal is a pure-R implementation, permitting to each learn and write safetensor recordsdata.

The preliminary model (0.1.0) of safetensors is now on CRAN.

Motivation

The primary motivation for safetensors within the Python neighborhood is safety. As famous
within the official documentation:

The primary rationale for this crate is to take away the necessity to use pickle on PyTorch which is utilized by default.

Pickle is taken into account an unsafe format, because the motion of loading a Pickle file can
set off the execution of arbitrary code. This has by no means been a priority for torch
for R customers, for the reason that Pickle parser that’s included in LibTorch solely helps a subset
of the Pickle format, which doesn’t embody executing code.

Nevertheless, the file format has further benefits over different generally used codecs, together with:

Assist for lazy loading: You’ll be able to select to learn a subset of the tensors saved within the file.
Zero copy: Studying the file doesn’t require extra reminiscence than the file itself.
(Technically the present R implementation does makes a single copy, however that may
be optimized out if we actually want it in some unspecified time in the future).
Easy: Implementing the file format is straightforward, and doesn’t require advanced dependencies.
Because of this it’s a superb format for exchanging tensors between ML frameworks and
between totally different programming languages. As an illustration, you’ll be able to write a safetensors file
in R and cargo it in Python, and vice-versa.

There are further benefits in comparison with different file codecs frequent on this house, and
you’ll be able to see a comparability desk right here.

Format

The safetensors format is described within the determine under. It’s principally a header file
containing some metadata, adopted by uncooked tensor buffers.

Fundamental utilization

safetensors may be put in from CRAN utilizing:

set up.packages("safetensors")

library(torch)
library(safetensors)

tensors <- listing(
  x = torch_randn(10, 10),
  y = torch_ones(10, 10)
)

str(tensors)
#> Listing of two
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]

tmp <- tempfile()
safe_save_file(tensors, tmp)

tensors <- safe_load_file(tmp)
str(tensors)
#> Listing of two
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]
#>  - attr(*, "metadata")=Listing of two
#>   ..$ x:Listing of three
#>   .. ..$ form       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 0 400
#>   ..$ y:Listing of three
#>   .. ..$ form       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 400 800
#>  - attr(*, "max_offset")= int 929

Presently, safetensors solely helps writing torch tensors, however we plan so as to add
help for writing plain R arrays and tensorflow tensors sooner or later.

Future instructions

The following model of torch will use safetensors as its serialization format,
that means that when calling torch_save() on a mannequin, listing of tensors, or different
varieties of objects supported by torch_save, you’re going to get a sound safetensors file.

That is an enchancment over the earlier implementation as a result of:

It’s a lot sooner. Greater than 10x for medium sized fashions. Might be much more for giant recordsdata.
This additionally improves the efficiency of parallel dataloaders by ~30%.
It enhances cross-language and cross-framework compatibility. You’ll be able to prepare your mannequin
in R and use it in Python (and vice-versa), or prepare your mannequin in tensorflow and run it
with torch.

If you wish to attempt it out, you’ll be able to set up the event model of torch with:

remotes::install_github("mlverse/torch")

Nick Fewings on Unsplash

Reuse

Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. The figures which have been reused from different sources do not fall below this license and may be acknowledged by a observe of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Falbel (2023, June 15). Posit AI Weblog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/

BibTeX quotation

@misc{safetensors,
  writer = {Falbel, Daniel},
  title = {Posit AI Weblog: safetensors 0.1.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/},
  12 months = {2023}
}

[ad_2]

Posit AI Weblog: safetensors 0.1.0

Motivation

Format

Fundamental utilization

Future instructions

Reuse

Quotation

Leave a Reply Cancel reply

Wi-fi system WaveCore penetrates concrete partitions with out drilling

Enhancing LLMs with Structured Outputs and Perform Calling

Shaping the Way forward for Cloud Sovereignty: Why you possibly can’t afford to overlook European Sovereign Cloud Day – In individual (in Brussels) or On-line (Digital)

Leveraging Huge Information to Improve Office Lodging for Workers with Disabilities