Description
On Windows, Unsloth Studio detects CUDA and launches llama-server with GPU offload enabled (-ngl -1), but the GGUF model still runs CPU/RAM-only unless the PyTorch CUDA DLL directory is added to the child process PATH.
This was reproduced with Unsloth Studio 2026.5.2 and then fixed locally by prepending the Studio venv's torch\lib directory to the PATH used when spawning llama-server.
Possibly related to #5106 and #4949, but this issue has a confirmed Windows PATH / CUDA DLL root cause.
Environment
- OS: Windows
- GPU: NVIDIA GeForce RTX 5060 Ti, 16 GB VRAM
- Driver: NVIDIA 591.86
- CUDA shown by
nvidia-smi: 13.1
CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9
- Unsloth Studio:
2026.5.2
- Torch:
2.10.0+cu130
torch.cuda.is_available(): True
- Model:
unsloth/gemma-4-E2B-it-GGUF
- GGUF:
gemma-4-E2B-it-UD-Q4_K_XL.gguf
- llama.cpp build:
b9174-59778f019
Symptoms
Studio launched llama-server with GPU offload enabled:
llama-server.exe -m ...\gemma-4-E2B-it-UD-Q4_K_XL.gguf --port 54538 -c 131072 --parallel 1 --flash-attn on --no-context-shift -ngl -1 --threads -1 --jinja ...
But the model did not move into VRAM:
nvidia-smi stayed around ~1.4 GB / 16 GB, mostly desktop/graphics usage
- no
llama-server.exe compute process appeared in nvidia-smi
llama-server.exe used about 4.25 GB system RAM
- inference worked, but appeared to be CPU/RAM-only
Root Cause
The Windows launch path in studio/backend/core/inference/llama_cpp.py adds only:
- the llama.cpp binary directory
CUDA_PATH\bin
before spawning llama-server.
On this machine, CUDA_PATH points to CUDA 12.9. However, the Studio/PyTorch environment includes CUDA 13 runtime DLLs in:
C:\Users\Anton\.unsloth\studio\unsloth_studio\Lib\site-packages\torch\lib
That directory contains the required CUDA 13 DLLs:
cudart64_13.dll
cublas64_13.dll
cublasLt64_13.dll
ggml-cuda.dll exists next to llama-server.exe, but without the PyTorch CUDA DLL directory in PATH, the CUDA backend does not appear to load/use the GPU correctly, even though -ngl -1 is passed.
Local Fix
Adding the PyTorch DLL directory to PATH before spawning llama-server fixed the issue.
Minimal local patch in studio/backend/core/inference/llama_cpp.py:
if sys.platform == "win32":
path_dirs = [binary_dir]
torch_lib = Path(__file__).resolve().parents[4] / "torch" / "lib"
if torch_lib.is_dir():
# PyTorch wheels bundle the CUDA runtime DLLs Studio's
# llama.cpp build may need (for example cudart64_13.dll).
path_dirs.append(str(torch_lib))
cuda_path = os.environ.get("CUDA_PATH", "")
if cuda_path:
cuda_bin = os.path.join(cuda_path, "bin")
if os.path.isdir(cuda_bin):
path_dirs.append(cuda_bin)
After restarting Studio / reloading the model, the same model correctly loaded into VRAM.
Expected Behavior
When Studio launches a CUDA-capable llama-server with -ngl -1, the model should load into VRAM if it fits, independent of whether the user's system-level CUDA_PATH points to an older toolkit than the CUDA runtime bundled with the PyTorch wheel.
Actual Behavior
Studio launches with -ngl -1, but llama-server runs CPU/RAM-only unless torch\lib is available in the spawned process PATH.
Suggested Fix
On Windows, include the active Python environment's torch\lib directory in the PATH used for the llama-server subprocess, preferably before CUDA_PATH\bin, so the bundled CUDA runtime DLLs are available to ggml-cuda.dll.
A more defensive version could attempt this only when torch\lib contains CUDA DLLs such as cudart64_13.dll, but simply prepending the directory seems consistent with how PyTorch wheels bundle CUDA dependencies on Windows.
Description
On Windows, Unsloth Studio detects CUDA and launches
llama-serverwith GPU offload enabled (-ngl -1), but the GGUF model still runs CPU/RAM-only unless the PyTorch CUDA DLL directory is added to the child processPATH.This was reproduced with Unsloth Studio
2026.5.2and then fixed locally by prepending the Studio venv'storch\libdirectory to thePATHused when spawningllama-server.Possibly related to #5106 and #4949, but this issue has a confirmed Windows PATH / CUDA DLL root cause.
Environment
nvidia-smi: 13.1CUDA_PATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.92026.5.22.10.0+cu130torch.cuda.is_available():Trueunsloth/gemma-4-E2B-it-GGUFgemma-4-E2B-it-UD-Q4_K_XL.ggufb9174-59778f019Symptoms
Studio launched
llama-serverwith GPU offload enabled:But the model did not move into VRAM:
nvidia-smistayed around~1.4 GB / 16 GB, mostly desktop/graphics usagellama-server.execompute process appeared innvidia-smillama-server.exeused about4.25 GBsystem RAMRoot Cause
The Windows launch path in
studio/backend/core/inference/llama_cpp.pyadds only:CUDA_PATH\binbefore spawning
llama-server.On this machine,
CUDA_PATHpoints to CUDA 12.9. However, the Studio/PyTorch environment includes CUDA 13 runtime DLLs in:That directory contains the required CUDA 13 DLLs:
ggml-cuda.dllexists next tollama-server.exe, but without the PyTorch CUDA DLL directory inPATH, the CUDA backend does not appear to load/use the GPU correctly, even though-ngl -1is passed.Local Fix
Adding the PyTorch DLL directory to
PATHbefore spawningllama-serverfixed the issue.Minimal local patch in
studio/backend/core/inference/llama_cpp.py:After restarting Studio / reloading the model, the same model correctly loaded into VRAM.
Expected Behavior
When Studio launches a CUDA-capable
llama-serverwith-ngl -1, the model should load into VRAM if it fits, independent of whether the user's system-levelCUDA_PATHpoints to an older toolkit than the CUDA runtime bundled with the PyTorch wheel.Actual Behavior
Studio launches with
-ngl -1, butllama-serverruns CPU/RAM-only unlesstorch\libis available in the spawned processPATH.Suggested Fix
On Windows, include the active Python environment's
torch\libdirectory in thePATHused for thellama-serversubprocess, preferably beforeCUDA_PATH\bin, so the bundled CUDA runtime DLLs are available toggml-cuda.dll.A more defensive version could attempt this only when
torch\libcontains CUDA DLLs such ascudart64_13.dll, but simply prepending the directory seems consistent with how PyTorch wheels bundle CUDA dependencies on Windows.