Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are there any official checksums available? I'm happy to see this, even if it's an unsanctioned stunt, because I think it's really pathetic of meta to want to gatekeep their "open" model. But ML models generally can execute arbitrary code, I'd want to make sure it's the real version at least.


    But ML models generally can execute arbitrary code
Is it the case if we're only talking about weights? I thought the rest is actually "open".


My understanding is that weights are normally stored as pickled python blobs, which means arbitrary code execution as they are unpickled.


A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle

The usage boils down to

import safer_unpickle from safer_unpickle

safer_unpickle.patch_torch_load()

This overrides default torch unpickler with a relatively safe one. Hope this helps.


Sounds like this should be the default. Maybe you can submit a PR to the official Torch repo? There is no reason why a static model checkpoint should be potentially dangerous to run.


"They turned the model into a pickle? Funniest shit I've ever seen."

But seriously, why not something more human readable and text-based if it's just weights?


Because human-readable text-based formats are really inefficient to both download and load, especially when in the hundreds of GB range. And no human cares to read billions of weights.


Agreed. However, there are much better formats than Python pickles for exchanging binary data. As it is, using PyTorch means that you force your users to also use PyTorch, which is a shame, as libtorch (which is what makes PyTorch work) offers a much more portable format (which I suspect might also be more efficient at least in terms of raw size, but I haven't checked).


... why not CBOR or other efficient binary format?


If it's PyTorch, it can definitely contain and execute arbitrary code.

One of the reasons I'm not a huge fan of PyTorch.


They could contain arbitrary code... But typically do not. That means that with the right viewer application it will be trivial to know for sure.

It isn't like a multi gigabyte game for example, where knowing if there is any malicious code could easily be a multi-month reverse engineering project to get to the answer of 'probably not, but we don't have time to check every byte with a fine tooth comb'


In theory, this could be done, sure.

In practice, who's going to bother checking the language model? All the code that runs Stable Diffusion or other Hugging Face models that I've seen just downloads the model dynamically, then uses it without asking question. That's a pretty low-hanging supply chain attack waiting to happen, I believe.


It shouldn't be much work to verify that the file is just a set of floats.


I mentioned this above, but Trail of Bits has a tool just for this purpose called Fickling. https://github.com/trailofbits/fickling


I only found this picklescan[0] serving this purpose, but it doesn't seem to be a finished project.

[0] - https://github.com/mmaitre314/picklescan


Curious how long that would take on 200gb.


Anything that loads pickles from sources your unsure of can contain executable code. There were a few samples a couple month ago showing distribution on huggingface.

Some solutions for checking: https://huggingface.co/docs/hub/security-pickle

or run them in an isolated env.


You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling


Thanks for the tip. I tried this on the 7B parameter model and got an error.

$ fickling --check-safety consolidated.00.pth

  File "/usr/lib/python3.10/pickletools.py", line 359, in read_stringnl
    data = codecs.escape_decode(data)[0].decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 63: ordinal not in range(128)


I am running it in docker to be safe, which works just fine.


Docker escapes exist and if this was released by spooks then including sandbox escapes is par. Unlikely for sure but your confidence is naïve.


I’m aware that they exist. I figured if someone inserted a hack they wouldn’t bother with docker escapes as they would catch plenty of people who run it without docker. I figured it was a calculated risk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: