The checkpoint for the 7B parameter model is 13.5GB, so maybe? Larger models are multiple chunks at 13.6GB each or 16.3GB each. I am hoping I will be able to run on my 16GB Vram but I don't know how much overhead is needed. Maybe people on reddit will do their tricks and squeeze the models in to smaller cards.
EDIT2: actually using the tip in that link I got it to run on a P5000 with 16GB of ram! It just barely fits so I had to log out of gnome and run it terminal only.
Following up. After rebooting in to GUI that was enough to get it to fit, I guess xorg just accumulated some cruft in my last boot. So I can run it alongside gnome.
nvidia-smi reports this model is using 15475MiB after changing the max batch size from 32 to 8 (see link in above post)
As others have stated someone may have injected unknown code in to the pickled checkpoint, so I recommend running this in docker. I use this command to run the docker image after getting nvidia docker stuff configured.
docker run --runtime=nvidia -it --mount type=bind,source=/MY_LLAMA_SOURCE_PATH,target=/llama --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04
Then install the necessary dependencies in that container (obv could make a dockerfile), stick your model as well as tokenizer files (from the root dir in download) in to some directory (here models/LLaMA-7B) and run this:
enjoy! the 7B parameter model is not amazing from my initial testing. I am very curious if larger models can be run on 1 GPU by sequentially loading checkpoints. I don't know how all this stuff is organized.
Example output below.
Prompt:
Please respond to this question: As a large language model, what are three things that you find most important?
Output:
To get students to analyze what they are doing in their learning
To get students to analyze what they are doing in their learning so that they can find the best practices
To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning
To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning
I don't think this question really made much sense because the sentence in the question is incomplete
To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning
I don't think this question really made much sense because the sentence in the question is incomplete.
I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense.
I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense. (I think it is a good question but I am not sure it makes sense).
Beginner pytorch user here... it looks like it is using only one CPU on my machine. Is it feasible to use more than one? If so, what options/env vars/code change are necessary?
Perhaps try setting `OMP_NUM_THREADS`, for example `OMP_NUM_THREADS=4 torchrun ...`.
But on my machine, it automatically used all 12 available physical cores. Setting OMP_NUM_THREADS=2 for example lets me decrease the number of cores being used, but increasing it to try and use all 24 logical threads has no effect. YMMV.
If I may tack on a question as someone with zero clue of ML: when, if ever, will someone like me be able to run this on a Mac Studio with a M1 Ultra and 128GB of ram?
You can run 7B (equal to GPT-3 175B), 13B (better than GPT-3 175B), or 30B (better than anything else publicly available) but probably not 65B with that much RAM on an M1.
That would be using the CPU, as the M1 GPU is not yet supported.
EDIT2: actually using the tip in that link I got it to run on a P5000 with 16GB of ram! It just barely fits so I had to log out of gnome and run it terminal only.
As much as I dislike the loss of socketed RAM, I have to say it's working out well for Apple users so far. I wonder if CXL will change the situation for consumer devices or if it will only be useful at the scale of a server rack.
- any inference optimization we can use similar to StableDiffusion, to bring down the vRAM requirements?
I only know about these:
- use 8bit precision
- https://github.com/bigscience-workshop/petals
- https://github.com/FMInference/FlexGen
- https://github.com/microsoft/DeepSpeed
Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?