Skip to content

[Bug] Windows Studio llama-server runs CPU-only unless torch\lib is added to PATH for CUDA 13 DLLs #5491

@MilleniumGenAI

Description

@MilleniumGenAI

Description

On Windows, Unsloth Studio detects CUDA and launches llama-server with GPU offload enabled (-ngl -1), but the GGUF model still runs CPU/RAM-only unless the PyTorch CUDA DLL directory is added to the child process PATH.

This was reproduced with Unsloth Studio 2026.5.2 and then fixed locally by prepending the Studio venv's torch\lib directory to the PATH used when spawning llama-server.

Possibly related to #5106 and #4949, but this issue has a confirmed Windows PATH / CUDA DLL root cause.

Environment

  • OS: Windows
  • GPU: NVIDIA GeForce RTX 5060 Ti, 16 GB VRAM
  • Driver: NVIDIA 591.86
  • CUDA shown by nvidia-smi: 13.1
  • CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9
  • Unsloth Studio: 2026.5.2
  • Torch: 2.10.0+cu130
  • torch.cuda.is_available(): True
  • Model: unsloth/gemma-4-E2B-it-GGUF
  • GGUF: gemma-4-E2B-it-UD-Q4_K_XL.gguf
  • llama.cpp build: b9174-59778f019

Symptoms

Studio launched llama-server with GPU offload enabled:

llama-server.exe -m ...\gemma-4-E2B-it-UD-Q4_K_XL.gguf --port 54538 -c 131072 --parallel 1 --flash-attn on --no-context-shift -ngl -1 --threads -1 --jinja ...

But the model did not move into VRAM:

  • nvidia-smi stayed around ~1.4 GB / 16 GB, mostly desktop/graphics usage
  • no llama-server.exe compute process appeared in nvidia-smi
  • llama-server.exe used about 4.25 GB system RAM
  • inference worked, but appeared to be CPU/RAM-only

Root Cause

The Windows launch path in studio/backend/core/inference/llama_cpp.py adds only:

  • the llama.cpp binary directory
  • CUDA_PATH\bin

before spawning llama-server.

On this machine, CUDA_PATH points to CUDA 12.9. However, the Studio/PyTorch environment includes CUDA 13 runtime DLLs in:

C:\Users\Anton\.unsloth\studio\unsloth_studio\Lib\site-packages\torch\lib

That directory contains the required CUDA 13 DLLs:

cudart64_13.dll
cublas64_13.dll
cublasLt64_13.dll

ggml-cuda.dll exists next to llama-server.exe, but without the PyTorch CUDA DLL directory in PATH, the CUDA backend does not appear to load/use the GPU correctly, even though -ngl -1 is passed.

Local Fix

Adding the PyTorch DLL directory to PATH before spawning llama-server fixed the issue.

Minimal local patch in studio/backend/core/inference/llama_cpp.py:

if sys.platform == "win32":
    path_dirs = [binary_dir]
    torch_lib = Path(__file__).resolve().parents[4] / "torch" / "lib"
    if torch_lib.is_dir():
        # PyTorch wheels bundle the CUDA runtime DLLs Studio's
        # llama.cpp build may need (for example cudart64_13.dll).
        path_dirs.append(str(torch_lib))

    cuda_path = os.environ.get("CUDA_PATH", "")
    if cuda_path:
        cuda_bin = os.path.join(cuda_path, "bin")
        if os.path.isdir(cuda_bin):
            path_dirs.append(cuda_bin)

After restarting Studio / reloading the model, the same model correctly loaded into VRAM.

Expected Behavior

When Studio launches a CUDA-capable llama-server with -ngl -1, the model should load into VRAM if it fits, independent of whether the user's system-level CUDA_PATH points to an older toolkit than the CUDA runtime bundled with the PyTorch wheel.

Actual Behavior

Studio launches with -ngl -1, but llama-server runs CPU/RAM-only unless torch\lib is available in the spawned process PATH.

Suggested Fix

On Windows, include the active Python environment's torch\lib directory in the PATH used for the llama-server subprocess, preferably before CUDA_PATH\bin, so the bundled CUDA runtime DLLs are available to ggml-cuda.dll.

A more defensive version could attempt this only when torch\lib contains CUDA DLLs such as cudart64_13.dll, but simply prepending the directory seems consistent with how PyTorch wheels bundle CUDA dependencies on Windows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions