Skip to content

Add JarvisLabs backend#3875

Open
peterschmidt85 wants to merge 2 commits into
masterfrom
jarvislabs
Open

Add JarvisLabs backend#3875
peterschmidt85 wants to merge 2 commits into
masterfrom
jarvislabs

Conversation

@peterschmidt85
Copy link
Copy Markdown
Contributor

@peterschmidt85 peterschmidt85 commented May 12, 2026

Adds JarvisLabs as a dstack backend.

Implementation notes:

  • Adds backend registration, config models, configurator, API client, compute implementation, docs, and backend tests.
  • Uses the JarvisLabs provider from gpuhunt for offer selection. This branch depends on Add JarvisLabs provider gpuhunt#231 until the provider is released.
  • Supports JarvisLabs VM workloads only. GPU VMs and CPU VMs use separate JarvisLabs create/destroy APIs.
  • Supports GPU spot by passing the selected offer's spot flag to JarvisLabs GPU VM creation. CPU spot is not emitted by gpuhunt and is not supported.
  • Does not select a JarvisLabs template or custom image; provisioning uses the provider default VM image.
  • Validates configured regions against gpuhunt's JarvisLabs supported-region map and fails closed if an unsupported region reaches a regional API call.
  • Registers the project SSH key in JarvisLabs before creating an instance.
  • Starts the dstack shim over SSH and persists hostname only after shim startup succeeds, so provisioning can retry after a server restart.
  • Maps immediate and delayed JarvisLabs create capacity failures to NoCapacityError and destroys any failed machine id returned by JarvisLabs before retrying another offer. Non-capacity failed create status raises ProvisioningError. After a VM is running, interruption/unreachability is handled by the generic VM health path, as with other VM backends.
  • Wraps JarvisLabs request failures and malformed success responses as BackendError instead of leaking raw transport/JSON exceptions.

E2E validation:

  • CPU on-demand task provisioned and completed on JarvisLabs.
  • L4 GPU on-demand task provisioned and completed CUDA tensor matmul on the GPU.
  • H100 GPU spot task provisioned with JarvisLabs is_spot: true and completed CUDA tensor matmul on the GPU.
  • Requested 120GB/200GB disks were visible inside containers in the live disk checks.
  • Server restart was tested while JarvisLabs runs were active; provisioning resumed instead of losing the run.
  • L4 spot no-capacity was observed from JarvisLabs and handled as a capacity failure.

Added tests cover config validation, API payloads, API error normalization, spot flag propagation, region failure behavior, capacity-failure mapping and cleanup, CPU/GPU provisioning data, disk sizing, SSH username parsing, termination routing, and restart-safe hostname persistence.

@peterschmidt85 peterschmidt85 force-pushed the jarvislabs branch 4 times, most recently from c8850b2 to 3ad620f Compare May 12, 2026 21:01
@peterschmidt85 peterschmidt85 marked this pull request as ready for review May 12, 2026 21:17
@peterschmidt85 peterschmidt85 requested a review from jvstme May 12, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant