- how much vRAM needed to run each model parameter size? - any inference optimiz...

SequoiaHope · on March 3, 2023

The checkpoint for the 7B parameter model is 13.5GB, so maybe? Larger models are multiple chunks at 13.6GB each or 16.3GB each. I am hoping I will be able to run on my 16GB Vram but I don't know how much overhead is needed. Maybe people on reddit will do their tricks and squeeze the models in to smaller cards.

EDIT: There seems to be a lot of overhead. Here someone struggles to fit the 7B parameter model (13.5GB checkpoint) on a 24GB card: https://github.com/facebookresearch/llama/issues/55

EDIT2: actually using the tip in that link I got it to run on a P5000 with 16GB of ram! It just barely fits so I had to log out of gnome and run it terminal only.

SequoiaHope · on March 3, 2023

Following up. After rebooting in to GUI that was enough to get it to fit, I guess xorg just accumulated some cruft in my last boot. So I can run it alongside gnome.

nvidia-smi reports this model is using 15475MiB after changing the max batch size from 32 to 8 (see link in above post)

As others have stated someone may have injected unknown code in to the pickled checkpoint, so I recommend running this in docker. I use this command to run the docker image after getting nvidia docker stuff configured.

docker run --runtime=nvidia -it --mount type=bind,source=/MY_LLAMA_SOURCE_PATH,target=/llama --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04

Then install the necessary dependencies in that container (obv could make a dockerfile), stick your model as well as tokenizer files (from the root dir in download) in to some directory (here models/LLaMA-7B) and run this:

torchrun example.py --ckpt_dir ./models/LLaMA-7B --tokenizer_path ./models/LLaMA-7B/tokenizer.model

enjoy! the 7B parameter model is not amazing from my initial testing. I am very curious if larger models can be run on 1 GPU by sequentially loading checkpoints. I don't know how all this stuff is organized.

Example output below.

Prompt:

Please respond to this question: As a large language model, what are three things that you find most important?

Output:

To get students to analyze what they are doing in their learning

To get students to analyze what they are doing in their learning so that they can find the best practices

To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning

I don't think this question really made much sense because the sentence in the question is incomplete

To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning

I don't think this question really made much sense because the sentence in the question is incomplete.

I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense.

I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense. (I think it is a good question but I am not sure it makes sense).

markasoftware · on March 4, 2023

I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu

jpieper · on March 4, 2023

Beginner pytorch user here... it looks like it is using only one CPU on my machine. Is it feasible to use more than one? If so, what options/env vars/code change are necessary?

markasoftware · on March 4, 2023

Perhaps try setting `OMP_NUM_THREADS`, for example `OMP_NUM_THREADS=4 torchrun ...`.

But on my machine, it automatically used all 12 available physical cores. Setting OMP_NUM_THREADS=2 for example lets me decrease the number of cores being used, but increasing it to try and use all 24 logical threads has no effect. YMMV.

SequoiaHope · on March 4, 2023

nice!

byteknight · on March 3, 2023

Looks like you need multiple GPUs for anything >7B.

https://github.com/facebookresearch/llama/issues/55#issuecom...

joker99 · on March 3, 2023

If I may tack on a question as someone with zero clue of ML: when, if ever, will someone like me be able to run this on a Mac Studio with a M1 Ultra and 128GB of ram?

MacsHeadroom · on March 5, 2023

You can run 7B (equal to GPT-3 175B), 13B (better than GPT-3 175B), or 30B (better than anything else publicly available) but probably not 65B with that much RAM on an M1.

That would be using the CPU, as the M1 GPU is not yet supported.

terafo · on March 3, 2023

As far as I can tell you can do it right now, at least for small 13B model, not sure about bigger models.

SequoiaHope · on March 3, 2023

The 13B parameter model is two 13.5GB chunks, but the 7B parameter model is one 13.5GB chunk so that one might be possible.

EDIT: There seems to be a lot of overhead. Here someone struggles to fit the 7B parameter model (13.5GB checkpoint) on a 24GB card: https://github.com/facebookresearch/llama/issues/55

EDIT2: actually using the tip in that link I got it to run on a P5000 with 16GB of ram! It just barely fits so I had to log out of gnome and run it terminal only.

eightysixfour · on March 3, 2023

I don't believe they could, need CUDA and more VRAM...

terafo · on March 3, 2023

128 gigs is more than enough to load 13B model into. Pytorch has M1 support for some time now so CUDA isn't required.

eightysixfour · on March 3, 2023

Does M1 use system memory as VRAM as well?

jml7c5 · on March 3, 2023

Yes, it's all unified memory.

As much as I dislike the loss of socketed RAM, I have to say it's working out well for Apple users so far. I wonder if CXL will change the situation for consumer devices or if it will only be useful at the scale of a server rack.

MacsHeadroom · on March 5, 2023

13B fits on a single 3090 (24GB) in int8.

128 gigs might even be enough for 65B, if slowly.