safetensors is a brand new, easy, quick, and protected file format for storing tensors. The design of the file format and its authentic implementation are being led
by Hugging Face, and it’s getting largely adopted of their in style ‘transformers’ framework. The safetensors R bundle is a pure-R implementation, permitting to each learn and write safetensor recordsdata.
The preliminary model (0.1.0) of safetensors is now on CRAN.
Motivation
The primary motivation for safetensors within the Python group is safety. As famous
within the official documentation:
The primary rationale for this crate is to take away the necessity to use pickle on PyTorch which is utilized by default.
Pickle is taken into account an unsafe format, because the motion of loading a Pickle file can
set off the execution of arbitrary code. This has by no means been a priority for torch
for R customers, because the Pickle parser that’s included in LibTorch solely helps a subset
of the Pickle format, which doesn’t embody executing code.
Nonetheless, the file format has extra benefits over different generally used codecs, together with:
-
Assist for lazy loading: You’ll be able to select to learn a subset of the tensors saved within the file.
-
Zero copy: Studying the file doesn’t require extra reminiscence than the file itself.
(Technically the present R implementation does makes a single copy, however that may
be optimized out if we actually want it sooner or later). -
Easy: Implementing the file format is easy, and doesn’t require complicated dependencies.
Which means that it’s format for exchanging tensors between ML frameworks and
between totally different programming languages. For example, you may write a safetensors file
in R and cargo it in Python, and vice-versa.
There are extra benefits in comparison with different file codecs widespread on this house, and
you may see a comparability desk right here.
Format
The safetensors format is described within the determine beneath. It’s principally a header file
containing some metadata, adopted by uncooked tensor buffers.
Primary utilization
safetensors will be put in from CRAN utilizing:
set up.packages("safetensors")
We will then write any named record of torch tensors:
library(torch)
library(safetensors)
<- record(
tensors x = torch_randn(10, 10),
y = torch_ones(10, 10)
)
str(tensors)
#> Listing of two
#> $ x:Float [1:10, 1:10]
#> $ y:Float [1:10, 1:10]
<- tempfile()
tmp safe_save_file(tensors, tmp)
It’s doable to move extra metadata to the saved file by offering a metadata
parameter containing a named record.
Studying safetensors recordsdata is dealt with by safe_load_file
, and it returns the named
record of tensors together with the metadata
attribute containing the parsed file header.
<- safe_load_file(tmp)
tensors str(tensors)
#> Listing of two
#> $ x:Float [1:10, 1:10]
#> $ y:Float [1:10, 1:10]
#> - attr(*, "metadata")=Listing of two
#> ..$ x:Listing of three
#> .. ..$ form : int [1:2] 10 10
#> .. ..$ dtype : chr "F32"
#> .. ..$ data_offsets: int [1:2] 0 400
#> ..$ y:Listing of three
#> .. ..$ form : int [1:2] 10 10
#> .. ..$ dtype : chr "F32"
#> .. ..$ data_offsets: int [1:2] 400 800
#> - attr(*, "max_offset")= int 929
At present, safetensors solely helps writing torch tensors, however we plan so as to add
assist for writing plain R arrays and tensorflow tensors sooner or later.
Future instructions
The subsequent model of torch will use safetensors
as its serialization format,
which means that when calling torch_save()
on a mannequin, record of tensors, or different
sorts of objects supported by torch_save
, you’re going to get a legitimate safetensors file.
That is an enchancment over the earlier implementation as a result of:
-
It’s a lot quicker. Greater than 10x for medium sized fashions. May very well be much more for giant recordsdata.
This additionally improves the efficiency of parallel dataloaders by ~30%. -
It enhances cross-language and cross-framework compatibility. You’ll be able to prepare your mannequin
in R and use it in Python (and vice-versa), or prepare your mannequin in tensorflow and run it
with torch.
If you wish to strive it out, you may set up the event model of torch with:
::install_github("mlverse/torch") remotes
Photograph by Nick Fewings on Unsplash
Reuse
Textual content and figures are licensed underneath Artistic Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall underneath this license and will be acknowledged by a word of their caption: “Determine from …”.
Quotation
For attribution, please cite this work as
Falbel (2023, June 15). Posit AI Weblog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/
BibTeX quotation
@misc{safetensors, writer = {Falbel, Daniel}, title = {Posit AI Weblog: safetensors 0.1.0}, url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/}, 12 months = {2023} }