Update July 2023: LLama-2 has been released. Llama-2 was trained on 40% more data than LLaMA and scores very highly across a number of benchmarks. Here are the Llama-2 installation instructions and here's a more comprehensive guide to running LLMs on your computer.
On March 3rd, user ‘llamanon’ leaked Meta's LLaMA model on 4chan’s technology board /g/, enabling anybody to torrent it. A troll later attempted to add the torrent magnet link to Meta's official LLaMA Github repo.
This means LLaMA is the most powerful language model available to the public.
- There are four different pre-trained LLaMA models, with 7B (billion), 13B, 30B, and 65B parameters.
- Meta reports that the LLaMA-13B model outperforms GPT-3 in most benchmarks.
- Meta reports the 65B model is on-parr with Google's PaLM-540B in terms of performance.
4-bit LLaMa Installation
4-bit quantization is a technique for reducing the size of models so they can run on less powerful hardware.
Thanks to the efforts of many developers, we can now run 4-bit LLaMA on most consumer grade computers.
Here's some user-reported requirements for each model:
|Model||Model Size||Minimum Total VRAM||Card examples||RAM/Swap to Load*|
|LLaMA-7B||3.5GB||6GB||RTX 1660, 2060, AMD 5700xt, RTX 3050, 3060||16 GB|
|LLaMA-13B||6.5GB||10GB||AMD 6900xt, RTX 2060 12GB, 3060 12GB, 3080, A2000||32 GB|
|LLaMA-30B||15.8GB||20GB||RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100||64 GB|
|LLaMA-65B||31.2GB||40GB||A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000, Titan Ada||128 GB|
These instructions are for Windows & Linux. For Mac M1/M2, please look at these instructions instead.
1. Install Prerequisites
Build Tools for Visual Studio 2019
Download "2019 Visual Studio and other products" (requires creating a Microsoft account). You must download the 2019 version.
In the Visual Studio Build Tools installer, check the
Desktop development with C++ option and install:
Download and install miniconda. All default settings are OK.
Install Git if you don't already have it.
2. Create Conda Environment
Open the application
Anaconda Prompt (miniconda3) and run these commands one at a time.
It will take some time for the packages to download. If you get conda issues, you'll need to add conda to your PATH.
conda create -n textgen python=3.10.9 conda activate textgen pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
3. Oobabooga WebUI & GPTQ-for-LLaMA
Oobabooga is a good UI to run your models with. It's like AUTOMATIC1111's Stable Diffusion WebUI except it's for language instead of images. GPTQ-for-LLaMA is the 4-bit quandization implementation for LLaMA.
Navigate to the directory you want to put the Oobabooga folder in. Enter these commands one at a time:
git clone https://github.com/oobabooga/text-generation-webui.git cd text-generation-webui pip install -r requirements.txt pip install torch==1.12+cu113 -f https://download.pytorch.org/whl/torch_stable.html mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git cd GPTQ-for-LLaMa pip install ninja conda install -c conda-forge cudatoolkit-dev python setup_cuda.py install
python setup_cuda.py install doesn't work (
error: [WinError 2] The system cannot find the file specified), try this instead:
Download and unzip this
.whl wheel file. It does not matter where you put the file, you just have to install it. But since your command prompt is already navigated to the
GTPQ-for-LLaMa folder you might as well place the
.whl file in there. Then enter in command prompt:
pip install quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
If you are on Windows:
Windows only: fix bitsandbytes library
Download libbitsandbytes_cuda116.dll and put it in
Then, navigate to the file
\bitsandbytes\cuda_setup\main.py and open it with your favorite text editor. Search for the line:
if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.so', None, None, None, None
and replace with this line:
if torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None
In the same file, search for this. It will appear twice:
self.lib = ct.cdll.LoadLibrary(binary_path)
and replace both instances with:
self.lib = ct.cdll.LoadLibrary(str(binary_path))
4. Download Model Weights
Here's the latest torrent (timestamp 3-26-23 or March 26, 2023):
Torrent file: Safe-LLaMA-HF (3-26-23).zip
Magnet link: magnet:?xt=urn:btih:496ee41a35f8d845f6d6cba11baa8b332f3c3318&dn=Safe-LLaMA-HF%20(3-26-23)&tr=http%3a%2f%2fbt2.archive.org%3a6969%2fannounce&tr=http%3a%2f%2fbt1.archive.org%3a6969%2fannounce
I recommend qbitorrent if you don't already have a torrent client.
You don't have to download all the models. I suggest starting with
7b to check if everything is working properly.
There are many model weight versions floating around. The most updated ones are
After the download finishes, move the folder
llama-?b into the folder
With the most up-to-date weights, you will not need any additional files.
Now you can start the webUI. In command prompt:
python server.py --cai-chat --model llama-7b --no-stream
Remember to change
llama-7b to whatever model you are actually using.
Wait for the success message.
Then open the WebUI by navigating to:
You can obtain better results by putting a repetition_penalty(~1/0.85),temperature=0.7 in model.generate() for most LLaMA models
Troubleshooting thread: https://github.com/oobabooga/text-generation-webui/issues/147
After you start prompting you'll notice that the results aren't as good as those you would expect from ChatGPT. What gives?
LLaMA hasn't been fine-tuned for chat functionality yet.
Enter Alpaca. Stanford researchers have fine-tuned LLaMA into Stanford Alpaca to behave more like ChatGPT. While they didn't release the weights publicly, they shared the process required to replicate Alpaca. You can try an online version here.
Now that you have Oobabooga working, you can also try it with some other open source models here.
Thank you to the devs who are working tirelessly to make this stuff possible.
This is bleeding edge tech. Things will be updated and other things will break. Let us know if you have any updates or corrections in the comments below or in the Discord.
Your NVIDIA GPU must have Pascal architecture or newer. Check this thread.
Fix bitsandbytes library with these instructions.
Try starting with the command:
python server.py --cai-chat --model llama-7b --no-stream --gpu-memory 5
The command --gpu-memory sets the maxmimum GPU memory in GiB to be allocated per GPU. Example:
--gpu-memory 10 for a single GPU,
--gpu-memory 10 5 for two GPUs.
Adjust the value based on how much memory your GPU can allocate.
Download and install CUDA Toolkit 11.7 here: https://developer.nvidia.com/cuda-11-7-0-download-archive