Are there any official checksums available? I'm happy to see this, even if it's ...

zb3 · on March 3, 2023

    But ML models generally can execute arbitrary code

Is it the case if we're only talking about weights? I thought the rest is actually "open".

px43 · on March 3, 2023

My understanding is that weights are normally stored as pickled python blobs, which means arbitrary code execution as they are unpickled.

kir-gadjello · on March 3, 2023

A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle

The usage boils down to

import safer_unpickle from safer_unpickle

safer_unpickle.patch_torch_load()

This overrides default torch unpickler with a relatively safe one. Hope this helps.

eigenvalue · on March 3, 2023

Sounds like this should be the default. Maybe you can submit a PR to the official Torch repo? There is no reason why a static model checkpoint should be potentially dangerous to run.

moffkalast · on March 3, 2023

"They turned the model into a pickle? Funniest shit I've ever seen."

But seriously, why not something more human readable and text-based if it's just weights?

oefrha · on March 3, 2023

Because human-readable text-based formats are really inefficient to both download and load, especially when in the hundreds of GB range. And no human cares to read billions of weights.

Yoric · on March 3, 2023

Agreed. However, there are much better formats than Python pickles for exchanging binary data. As it is, using PyTorch means that you force your users to also use PyTorch, which is a shame, as libtorch (which is what makes PyTorch work) offers a much more portable format (which I suspect might also be more efficient at least in terms of raw size, but I haven't checked).

pas · on March 3, 2023

... why not CBOR or other efficient binary format?

Yoric · on March 3, 2023

If it's PyTorch, it can definitely contain and execute arbitrary code.

One of the reasons I'm not a huge fan of PyTorch.

londons_explore · on March 3, 2023

They could contain arbitrary code... But typically do not. That means that with the right viewer application it will be trivial to know for sure.

It isn't like a multi gigabyte game for example, where knowing if there is any malicious code could easily be a multi-month reverse engineering project to get to the answer of 'probably not, but we don't have time to check every byte with a fine tooth comb'

Yoric · on March 3, 2023

In theory, this could be done, sure.

In practice, who's going to bother checking the language model? All the code that runs Stable Diffusion or other Hugging Face models that I've seen just downloads the model dynamically, then uses it without asking question. That's a pretty low-hanging supply chain attack waiting to happen, I believe.

nodogoto · on March 3, 2023

It shouldn't be much work to verify that the file is just a set of floats.

trentdotexe · on March 3, 2023

I mentioned this above, but Trail of Bits has a tool just for this purpose called Fickling. https://github.com/trailofbits/fickling

zb3 · on March 3, 2023

I only found this picklescan[0] serving this purpose, but it doesn't seem to be a finished project.

[0] - https://github.com/mmaitre314/picklescan

dmix · on March 3, 2023

Curious how long that would take on 200gb.

winddude · on March 3, 2023

Anything that loads pickles from sources your unsure of can contain executable code. There were a few samples a couple month ago showing distribution on huggingface.

Some solutions for checking: https://huggingface.co/docs/hub/security-pickle

or run them in an isolated env.

trentdotexe · on March 3, 2023

You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling

SequoiaHope · on March 3, 2023

Thanks for the tip. I tried this on the 7B parameter model and got an error.

$ fickling --check-safety consolidated.00.pth

  File "/usr/lib/python3.10/pickletools.py", line 359, in read_stringnl
    data = codecs.escape_decode(data)[0].decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 63: ordinal not in range(128)

SequoiaHope · on March 3, 2023

I am running it in docker to be safe, which works just fine.

Red_Leaves_Flyy · on March 4, 2023

Docker escapes exist and if this was released by spooks then including sandbox escapes is par. Unlikely for sure but your confidence is naïve.

SequoiaHope · on March 4, 2023

I’m aware that they exist. I figured if someone inserted a hack they wouldn’t bother with docker escapes as they would catch plenty of people who run it without docker. I figured it was a calculated risk.