[Bug] Windows Studio llama-server runs CPU-only unless torch\lib is added to PATH for CUDA 13 DLLs

## Description

On Windows, Unsloth Studio detects CUDA and launches `llama-server` with GPU offload enabled (`-ngl -1`), but the GGUF model still runs CPU/RAM-only unless the PyTorch CUDA DLL directory is added to the child process `PATH`.

This was reproduced with Unsloth Studio `2026.5.2` and then fixed locally by prepending the Studio venv's `torch\lib` directory to the `PATH` used when spawning `llama-server`.

Possibly related to #5106 and #4949, but this issue has a confirmed Windows PATH / CUDA DLL root cause.

## Environment

- OS: Windows
- GPU: NVIDIA GeForce RTX 5060 Ti, 16 GB VRAM
- Driver: NVIDIA 591.86
- CUDA shown by `nvidia-smi`: 13.1
- `CUDA_PATH`: `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9`
- Unsloth Studio: `2026.5.2`
- Torch: `2.10.0+cu130`
- `torch.cuda.is_available()`: `True`
- Model: `unsloth/gemma-4-E2B-it-GGUF`
- GGUF: `gemma-4-E2B-it-UD-Q4_K_XL.gguf`
- llama.cpp build: `b9174-59778f019`

## Symptoms

Studio launched `llama-server` with GPU offload enabled:

```text
llama-server.exe -m ...\gemma-4-E2B-it-UD-Q4_K_XL.gguf --port 54538 -c 131072 --parallel 1 --flash-attn on --no-context-shift -ngl -1 --threads -1 --jinja ...
```

But the model did not move into VRAM:

- `nvidia-smi` stayed around `~1.4 GB / 16 GB`, mostly desktop/graphics usage
- no `llama-server.exe` compute process appeared in `nvidia-smi`
- `llama-server.exe` used about `4.25 GB` system RAM
- inference worked, but appeared to be CPU/RAM-only

## Root Cause

The Windows launch path in `studio/backend/core/inference/llama_cpp.py` adds only:

- the llama.cpp binary directory
- `CUDA_PATH\bin`

before spawning `llama-server`.

On this machine, `CUDA_PATH` points to CUDA 12.9. However, the Studio/PyTorch environment includes CUDA 13 runtime DLLs in:

```text
C:\Users\Anton\.unsloth\studio\unsloth_studio\Lib\site-packages\torch\lib
```

That directory contains the required CUDA 13 DLLs:

```text
cudart64_13.dll
cublas64_13.dll
cublasLt64_13.dll
```

`ggml-cuda.dll` exists next to `llama-server.exe`, but without the PyTorch CUDA DLL directory in `PATH`, the CUDA backend does not appear to load/use the GPU correctly, even though `-ngl -1` is passed.

## Local Fix

Adding the PyTorch DLL directory to `PATH` before spawning `llama-server` fixed the issue.

Minimal local patch in `studio/backend/core/inference/llama_cpp.py`:

```python
if sys.platform == "win32":
    path_dirs = [binary_dir]
    torch_lib = Path(__file__).resolve().parents[4] / "torch" / "lib"
    if torch_lib.is_dir():
        # PyTorch wheels bundle the CUDA runtime DLLs Studio's
        # llama.cpp build may need (for example cudart64_13.dll).
        path_dirs.append(str(torch_lib))

    cuda_path = os.environ.get("CUDA_PATH", "")
    if cuda_path:
        cuda_bin = os.path.join(cuda_path, "bin")
        if os.path.isdir(cuda_bin):
            path_dirs.append(cuda_bin)
```

After restarting Studio / reloading the model, the same model correctly loaded into VRAM.

## Expected Behavior

When Studio launches a CUDA-capable `llama-server` with `-ngl -1`, the model should load into VRAM if it fits, independent of whether the user's system-level `CUDA_PATH` points to an older toolkit than the CUDA runtime bundled with the PyTorch wheel.

## Actual Behavior

Studio launches with `-ngl -1`, but `llama-server` runs CPU/RAM-only unless `torch\lib` is available in the spawned process `PATH`.

## Suggested Fix

On Windows, include the active Python environment's `torch\lib` directory in the `PATH` used for the `llama-server` subprocess, preferably before `CUDA_PATH\bin`, so the bundled CUDA runtime DLLs are available to `ggml-cuda.dll`.

A more defensive version could attempt this only when `torch\lib` contains CUDA DLLs such as `cudart64_13.dll`, but simply prepending the directory seems consistent with how PyTorch wheels bundle CUDA dependencies on Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Windows Studio llama-server runs CPU-only unless torch\lib is added to PATH for CUDA 13 DLLs #5491

Description

Environment

Symptoms

Root Cause

Local Fix

Expected Behavior

Actual Behavior

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Windows Studio llama-server runs CPU-only unless torch\lib is added to PATH for CUDA 13 DLLs #5491

Description

Description

Environment

Symptoms

Root Cause

Local Fix

Expected Behavior

Actual Behavior

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions