LLM

How to run Meta's LLaMA on your computer (Windows)

By

Yubin

Updated

Update July 2023: LLama-2 has been released. Llama-2 was trained on 40% more data than LLaMA and scores very highly across a number of benchmarks. Here are the Llama-2 installation instructions and here's a more comprehensive guide to running LLMs on your computer.

On March 3rd, user ‘llamanon’ leaked Meta's LLaMA model on 4chan’s technology board /g/, enabling anybody to torrent it. A troll later attempted to add the torrent magnet link to Meta's official LLaMA Github repo.

This means LLaMA is the most powerful language model available to the public.

LLaMA quickfacts:

  • There are four different pre-trained LLaMA models, with 7B (billion), 13B, 30B, and 65B parameters.
  • Meta reports that the LLaMA-13B model outperforms GPT-3 in most benchmarks.
  • Meta reports the 65B model is on-parr with Google's PaLM-540B in terms of performance.

4-bit LLaMa Installation

4-bit quantization is a technique for reducing the size of models so they can run on less powerful hardware.

Thanks to the efforts of many developers, we can now run 4-bit LLaMA on most consumer grade computers.

Here's some user-reported requirements for each model:

ModelModel SizeMinimum Total VRAMCard examplesRAM/Swap to Load*
LLaMA-7B3.5GB6GBRTX 1660, 2060, AMD 5700xt, RTX 3050, 306016 GB
LLaMA-13B6.5GB10GBAMD 6900xt, RTX 2060 12GB, 3060 12GB, 3080, A200032 GB
LLaMA-30B15.8GB20GBRTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V10064 GB
LLaMA-65B31.2GB40GBA100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000, Titan Ada128 GB

These instructions are for Windows & Linux. For Mac M1/M2, please look at these instructions instead.

1. Install Prerequisites

Build Tools for Visual Studio 2019

Download "2019 Visual Studio and other products" (requires creating a Microsoft account). You must download the 2019 version.

In the Visual Studio Build Tools installer, check the Desktop development with C++ option and install:

Install Miniconda

Download and install miniconda. All default settings are OK.

Install Git

Install Git if you don't already have it.

2. Create Conda Environment

Open the application Anaconda Prompt (miniconda3) and run these commands one at a time.

It will take some time for the packages to download. If you get conda issues, you'll need to add conda to your PATH.

conda create -n textgen python=3.10.9

conda activate textgen

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

3. Oobabooga WebUI & GPTQ-for-LLaMA

Oobabooga is a good UI to run your models with. It's like AUTOMATIC1111's Stable Diffusion WebUI except it's for language instead of images. GPTQ-for-LLaMA is the 4-bit quandization implementation for LLaMA.

Navigate to the directory you want to put the Oobabooga folder in. Enter these commands one at a time:

git clone https://github.com/oobabooga/text-generation-webui.git

cd text-generation-webui

pip install -r requirements.txt

pip install torch==1.12+cu113 -f https://download.pytorch.org/whl/torch_stable.html

mkdir repositories

cd repositories

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git

cd GPTQ-for-LLaMa

pip install ninja

conda install -c conda-forge cudatoolkit-dev

python setup_cuda.py install

If python setup_cuda.py install doesn't work (error: [WinError 2] The system cannot find the file specified), try this instead:

Download and unzip this .whl wheel file. It does not matter where you put the file, you just have to install it. But since your command prompt is already navigated to the GTPQ-for-LLaMa folder you might as well place the .whl file in there. Then enter in command prompt:

pip install quant_cuda-0.0.0-cp310-cp310-win_amd64.whl

If you are on Windows:

Windows only: fix bitsandbytes library

Download libbitsandbytes_cuda116.dll and put it in C:\Users\MYUSERNAME\miniconda3\envs\textgen\Lib\site-packages\bitsandbytes\

Then, navigate to the file \bitsandbytes\cuda_setup\main.py and open it with your favorite text editor. Search for the line:

if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.so', None, None, None, None

and replace with this line:

if torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None

In the same file, search for this. It will appear twice: 

self.lib = ct.cdll.LoadLibrary(binary_path)

and replace both instances with: 

self.lib = ct.cdll.LoadLibrary(str(binary_path))

4. Download Model Weights

Here's the latest torrent (timestamp 3-26-23 or March 26, 2023):

Torrent file: Safe-LLaMA-HF (3-26-23).zip

Magnet link: magnet:?xt=urn:btih:496ee41a35f8d845f6d6cba11baa8b332f3c3318&dn=Safe-LLaMA-HF%20(3-26-23)&tr=http%3a%2f%2fbt2.archive.org%3a6969%2fannounce&tr=http%3a%2f%2fbt1.archive.org%3a6969%2fannounce

I recommend qbitorrent if you don't already have a torrent client.

You don't have to download all the models. I suggest starting with 7b to check if everything is working properly.

There are many model weight versions floating around. The most updated ones are .safetensor files

After the download finishes, move the folder llama-?b into the folder text-generation-webui/models.

With the most up-to-date weights, you will not need any additional files.

Start

Now you can start the webUI. In command prompt:

python server.py --cai-chat --model llama-7b --no-stream

Remember to change llama-7b to whatever model you are actually using.

Wait for the success message.

Then open the WebUI by navigating to:

http://localhost:7860/

You can obtain better results by putting a repetition_penalty(~1/0.85),temperature=0.7 in model.generate() for most LLaMA models

Troubleshooting thread: https://github.com/oobabooga/text-generation-webui/issues/147

What now?

After you start prompting you'll notice that the results aren't as good as those you would expect from ChatGPT. What gives?

LLaMA hasn't been fine-tuned for chat functionality yet.

Enter Alpaca. Stanford researchers have fine-tuned LLaMA into Stanford Alpaca to behave more like ChatGPT. While they didn't release the weights publicly, they shared the process required to replicate Alpaca. You can try an online version here.

Now that you have Oobabooga working, you can also try it with some other open source models here.

Thank you to the devs who are working tirelessly to make this stuff possible.

This is bleeding edge tech. Things will be updated and other things will break. Let us know if you have any updates or corrections in the comments below or in the Discord.

Troubleshooting

quant_cuda_kernel.cu(654): error: identifier "__hfma2" is undefined

Your NVIDIA GPU must have Pascal architecture or newer. Check this thread.

❗ RuntimeError: CUDA Setup failed despite GPU being available.

Fix bitsandbytes library with these instructions.

❗ RuntimeError: CUDA out of memory. Tried to allocate...

Try starting with the command:
python server.py --cai-chat --model llama-7b --no-stream --gpu-memory 5

The command --gpu-memory sets the maxmimum GPU memory in GiB to be allocated per GPU. Example: --gpu-memory 10 for a single GPU, --gpu-memory 10 5 for two GPUs.
Adjust the value based on how much memory your GPU can allocate.

❗ missing hip_runtime_api.h

Download and install CUDA Toolkit 11.7 here:  https://developer.nvidia.com/cuda-11-7-0-download-archive

That's the article! Also check out:

FREE RESOURCE
SDXL Prompt Magic

A handbook that helps you improve your SDXL results, fast. Includes:

  • Easy step-by-step instructions
  • My favorite SDXL ComfyUI workflow
  • Recommendations for SDXL models, LoRAs & upscalers
  • Realistic and stylized/anime prompt examples

56 thoughts on “How to run Meta's LLaMA on your computer (Windows)”

  1. After following the steps, I cannot find the folder ‘anaconda3\env\textgen\Lib\site-packages\bitsandbytes\cuda_setup)’ with the main.py file in it to continue

    Reply
  2. Getting issue with the last step.
    C:\Users\Eman.QTIS\WebUI\text-generation-webui\models>python server.py –load-in-4bit –model llama-7b
    python: can’t open file ‘C:\\Users\\Eman.QTIS\\WebUI\\text-generation-webui\\models\\server.py’: [Errno 2] No such file or directory

    I don’t see any server.py in any folder in the models directory

    Reply
  3. Hello! Great article! Thank you! I need help 🙁
    I stuck at #3 (Win 10 x64, VS 2019 is installed ok, all previous steps are ok too)
    The line I stuck :
    python setup_cuda.py install
    It produces me these lines : ( 🙁 )

    (textgen) λ python setup_cuda.py install
    running install
    C:\Users\nuke\miniconda3\envs\textgen\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
    C:\Users\nuke\miniconda3\envs\textgen\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
    running bdist_egg
    running egg_info
    writing quant_cuda.egg-info\PKG-INFO
    writing dependency_links to quant_cuda.egg-info\dependency_links.txt
    writing top-level names to quant_cuda.egg-info\top_level.txt
    C:\Users\nuke\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
    warnings.warn(msg.format(‘we could not find ninja.’))
    reading manifest file ‘quant_cuda.egg-info\SOURCES.txt’
    writing manifest file ‘quant_cuda.egg-info\SOURCES.txt’
    installing library code to build\bdist.win-amd64\egg
    running install_lib
    running build_ext
    C:\Users\nuke\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
    warnings.warn(f’Error checking compiler version for {compiler}: {error}’)
    error: [WinError 2] The system cannot find the file specified

    Please help 🙁

    Reply
  4. I’m getting this error when running this comand:

    (textgen) PS C:\Users\weefl\oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa> python setup_cuda.py install
    Traceback (most recent call last):
    File “C:\Users\weefl\oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\setup_cuda.py”, line 6, in
    ext_modules=[cpp_extension.CUDAExtension(
    File “C:\ProgramData\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py”, line 1048, in CUDAExtension
    library_dirs += library_paths(cuda=True)
    File “C:\ProgramData\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py”, line 1186, in library_paths
    paths.append(_join_cuda_home(lib_dir))
    File “C:\ProgramData\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py”, line 2223, in _join_cuda_home
    raise EnvironmentError(‘CUDA_HOME environment variable is not set. ‘
    OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

    Reply
    • Run the following command in the terminal:

      export CUDA_HOME=/usr/local/cuda-X.X

      Where you replace X.X by the first two digits of your version number (can be found out e.g. via nvcc –version).

      Reply
  5. Hello,
    When i want to launch Llama with “python server.py –cai-chat –wbits 4 –model llama-7b –no-stream”
    Command writes “server.py: error: unrecognized arguments: –cai-chat –wbits 4 –model llama-7b –no-stream”
    Can you help me pls ?

    Reply
  6. hello, script setup_cuda.py is not avilable in folder C:\Users\…\text-generation-webui\repositories\GPTQ-for-LLaMa

    28/03/2023 14:13 1 143 convert_llama_weights_to_hf.py
    28/03/2023 14:13 6 455 datautils.py
    28/03/2023 14:13 5 622 gptq.py
    28/03/2023 14:13 16 831 llama.py
    28/03/2023 14:13 3 946 llama_inference.py
    28/03/2023 14:13 12 438 llama_inference_offload.py
    28/03/2023 14:13 413 modelutils.py
    28/03/2023 14:13 18 935 opt.py
    28/03/2023 14:13 12 739 quant.py
    28/03/2023 14:13 8 603 README.md
    28/03/2023 14:13 122 requirements.txt

    which one should be executed instead of setup_cuda.py ?

    Reply
  7. I tries also this

    (base) C:\Users\xxxxx\miniconda3\envs\textgen\Lib\site-packages\bitsandbytes\cuda_setup>python main.py install
    Traceback (most recent call last):
    File “C:\Users\xxxxx\miniconda3\envs\textgen\Lib\site-packages\bitsandbytes\cuda_setup\main.py”, line 22, in
    import torch
    ModuleNotFoundError: No module named ‘torch’

    Reply
  8. I have this problem anybody knows why ?
    python server.py -cai-chat -wbits 4 -model llama-13b -no-stream
    usage: server.py [-h] [–model MODEL] [–lora LORA] [–notebook] [–chat] [–cai-chat] [–cpu] [–load-in-8bit]
    [–gptq-bits GPTQ_BITS] [–gptq-model-type GPTQ_MODEL_TYPE] [–gptq-pre-layer GPTQ_PRE_LAYER]
    [–wbits WBITS] [–model_type MODEL_TYPE] [–groupsize GROUPSIZE] [–pre_layer PRE_LAYER] [–bf16]
    [–auto-devices] [–disk] [–disk-cache-dir DISK_CACHE_DIR]
    [–gpu-memory GPU_MEMORY [GPU_MEMORY …]] [–cpu-memory CPU_MEMORY] [–no-cache] [–flexgen]
    [–percent PERCENT [PERCENT …]] [–compress-weight] [–pin-weight [PIN_WEIGHT]] [–deepspeed]
    [–nvme-offload-dir NVME_OFFLOAD_DIR] [–local_rank LOCAL_RANK] [–rwkv-strategy RWKV_STRATEGY]
    [–rwkv-cuda-on] [–no-stream] [–settings SETTINGS] [–extensions EXTENSIONS [EXTENSIONS …]]
    [–model-dir MODEL_DIR] [–lora-dir LORA_DIR] [–verbose] [–listen] [–listen-port LISTEN_PORT]
    [–share] [–auto-launch] [–gradio-auth-path GRADIO_AUTH_PATH]
    server.py: error: unrecognized arguments: -cai-chat -wbits 4 -model llama-13b -no-stream

    Reply
  9. On “python setup_cuda.py install” step I get an error saying “No CUDA runtime is found, using CUDA_HOME=’C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3′ ” What have I done wrong?

    Reply
  10. Naturally, the instruction STILL doesn’t work, though I must congratulate you that using this version I managed to get nearly to the end.

    1. You forgot to add change directory instructions when calling download-model.py and server.py.
    2. More importantly, the launch command is incorrect, the proper form is
    “python server.py –cai-chat –wbits 4 –model llama-13b-hf –no-stream” (notice -hf part – yes, this is how the directory is actually called).
    3. Lastly, launch never happens, instead you are presented with a gazillion of “size mismatch for model.layers…” messages (tried both for 13 and 30 versions).

    The verdict:

    – You still didn’t bother actually following your own instructions to see if they actually work.

    – The method described is not working.

    Reply
    • I deserve your criticism. The guide had multiple errors when you first read it.

      The “size mismatch” errors come from linking to the old model weights… have replaced this with the most recent magnet link. Have also updated the dashes and cd instructions.

      Let me know if you can get it working.

      Reply
      • I managed to get the bloody thing working, and in GPU mode no less too. Sadly only llama-13 fits into my puny GPU’s 24 GBs of memory, but at least it pushes out answers in about 10 seconds or less. I’ve been playing with bot a lot.

        Took me a while to figure out all the quirks of its mechanics. This thing is dumb as a rock.

        For example, I have two characters: bot’s character, named Bob, and me, portraying Bob’s father character. The moment I, as Bob’s father, told that Bob’s father’s name is Jack, bot created a THIRD person by the name of Jack, with Jack being Bob’s best friend and pretty much being the duplicate of his father, but NOT actually his father. Since I continued to chat to bot as Jack bot would start complaining to himself as his father’s character that Bob doesn’t spend enough time with him, preferring Jack’s company instead.

        The most sad part is how little bot remembers. Well, truth be told ChatGPT 3.5 isn’t much better.

        Reply
  11. Hello,
    First of all, thank you for your work and all this information.
    I’m stuck at step 3.
    For step 3 “Install Oobabooga”, at the point where you need to write this command:
    conda install -c conda-forge cudatoolkit-dev
    I’m stuck, this command does not run. Here’s what I have on the screen:
    “E:\interf_oobanooga\text-generation-webui>conda install -c conda-forge cudatoolkit-dev ‘conda’ is not recognized as an internal or external command, operable program or batch file.
    Could you help me ?

    Reply
  12. Hello,
    Thanks a lot for the updated guide (I had an issue using CUDA 11.6 and installing 11.7 did it).
    Just wanted to say that the link you posted for the llama-65b-4bit.pt is the llama-30b-4bit.pt one.

    Reply
  13. Hello you have a problem on this article at the section 4. Download Model Weights
    when download llama-65b-4bit.pt i see the same file with llama-30b-4bit.pt

    Thanks for everything !

    Reply
  14. Hey !
    Just downloaded the new weights for the 7b model with the torrent file but I still have the following errors:
    -Missing key(s) in state_dict: “model.layers.0.self_attn.k_proj.qzeros”, …
    -Unexpected key(s) in state_dict: “model.layers.0.self_attn.k_proj.zeros”,…
    -lots of “size mismatch for model.layers”
    The sizes mismatch are always like this:
    copying a param with shape torch.Size([11008, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008])
    Do you know how this cuold be solved ?
    Thank you !

    Reply
    • Hello Cedric, you have downloaded the 3-26-23 weights?

      It is strange, the new weights removed the size mismatch error for me.

      One thing you can try is conda remove -n textgen --all to remove the conda env and then try recreating it.

      Reply
  15. C:\Users\stb13\miniconda3\envs\textgen\python.exe: can’t open file ‘C:\\WINDOWS\\system32\\text-generation-webui\\repositories\\GPTQ-for-LLaMa\\server.py’: [Errno 2] No such file or directory
    (textgen) PS C:\WINDOWS\system32\text-generation-webui\repositories\GPTQ-for-LLaMa> python server.py –cai-chat –model llama-7b –no-stream

    Sorry but where is the server.py ?

    Reply
  16. Hello Ybin Ma, thank you so much for your articles, they are very helpful. However, I am experiencing some issues with the Start section.

    torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 8.00 GiB total capacity; 7.28 GiB already allocated; 0 bytes free; 7.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

    Screenshot : https://i.imgur.com/PEE6CKi.png

    I tried some fixes without success.
    – python -c “import torch; torch.cuda.set_device(0); torch.cuda.empty_cache()”
    – $env:PYTORCH_CUDA_ALLOC_CONF=”max_split_size_mb:512″

    Here is some informations :
    – GPU = RTX 2060 Super 8GB
    – torch.cuda.is_available() = True
    – torch.__version__ = 1.13.1+cu116
    – python –version = Python 3.10.9

    Do you have any suggestions or solutions for this issue?

    .
    Amaster

    Reply
    • Hello Amaster, try starting with the command: python server.py --cai-chat --model llama-7b --no-stream --gpu-memory 5

      The command –gpu-memory sets the maximum GPU memory (in GiB) to be allocated by GPU. You can adjust the value based on how much memory your GPU can allocate.

      Reply
    • Hello, try starting with the command: python server.py --cai-chat --model llama-7b --no-stream --gpu-memory 5

      The command –gpu-memory sets the maximum GPU memory (in GiB) to be allocated by GPU. You can adjust the value based on how much memory your own GPU can allocate.

      Reply
  17. Great news! Thank you so much, Yubin Ma, for your help. Your solution worked perfectly and I really appreciate your assistance. Thanks again for your time and expertise!

    Reply
  18. I have .pth model files, not .pt. Will the .pth files still work or do I need to do some kind of conversion/transform on them??

    Reply
  19. Thanks a lot Yubin Ma !
    I followed several guide before I found you and none of them were working ! your guide is definitely the best resource we can found !

    Working like a charm !

    Reply
  20. Ty !

    i have another problem :

    CUDA SETUP: Loading binary C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll…
    Loading llama-7b…
    Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.77it/s]
    Traceback (most recent call last):
    File "C:\WINDOWS\system32\text-generation-webui\server.py", line 273, in
    shared.model, shared.tokenizer = load_model(shared.model_name)
    File “C:\WINDOWS\system32\text-generation-webui\modules\models.py”, line 56, in load_model
    model = model.cuda()
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 689, in cuda
    return self._apply(lambda t: t.cuda(device))
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 579, in _apply
    module._apply(fn)
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 579, in _apply
    module._apply(fn)
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 579, in _apply
    module._apply(fn)
    [Previous line repeated 2 more times]
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 602, in _apply
    param_applied = fn(param)
    File “C:\Users\stb13\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py”, line 689, in
    return self._apply(lambda t: t.cuda(device))
    RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 4.00 GiB total capacity; 3.40 GiB already allocated; 0 bytes free; 3.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    (textgen) PS C:\WINDOWS\system32\text-generation-webui>

    you can help me please ? ty

    Reply
  21. maybe i have a bad configuration with my computer ? i have try this :

    python server.py –cai-chat –model llama-7b –no-stream –gpu-memory 4

    i can lauch the weblocal but when i have speack with the assistant i have a error

    ty for everything

    Reply
  22. Hello ! i got a problem on the “start” section. While trying to run this cmd : python server.py –cai-chat –model llama-7b –no-stream

    I got this error :

    Traceback (most recent call last):
    File “C:\Users\a\text-generation-webui\server.py”, line 10, in
    import gradio as gr
    ModuleNotFoundError: No module named ‘gradio’

    Could you please help me ? 🙂

    Reply
  23. When I launch this command “python server.py –cai-chat –model llama-7b –no-stream”
    I have this answer:
    Traceback (most recent call last):
    File “/home/matrix/text-generation-webui/server.py”, line 10, in
    import gradio as gr
    ModuleNotFoundError: No module named ‘gradio’

    How to fix “server.py”?
    As you see I’m on LInux, it may have nothing to do with the issue?

    Reply
  24. Hi,

    I have another question.
    What do I need to do to have the program up-to-date?
    Do I need to go through all the steps from the beginning?

    Regards.

    Reply

Leave a Comment