Skip to content

[Bug]: dstack apply fails with UnicodeDecodeError when a modified tracked file contains non-UTF-8 bytes #3880

@jkbhagatio

Description

@jkbhagatio

dstack version

0.20.19

Python version

3.13.7

Host OS

Linux 5.15.0-135-generic (Ubuntu)

Host Arch

x86_64

What happened?

dstack apply -f <run config> aborts during repo-diff packaging when the working
tree contains a modified tracked file whose content has any sequence the UTF-8
decoder rejects (e.g., a stray Latin-1 byte, BOM artifact, or otherwise
malformed UTF-8). In my case the offending file was a LaTeX paper
(*.tex) unrelated to the run.

Workaround: git stash the offending file before dstack apply, then
git stash pop afterwards. This obviously doesn't scale if the file has to be
in the diff.

Steps to reproduce

  1. In a git repo, modify a tracked text file so it contains at least one
    non-UTF-8 byte sequence (e.g., a Latin-1-encoded character not valid in
    UTF-8, or a malformed multi-byte sequence).
  2. Without committing, run dstack apply -f <any run config> from that repo.
  3. The CLI aborts with the trace below before any plan is shown.

Relevant log output

File ".../dstack/_internal/cli/services/configurators/run.py", line 567, in get_repo
    repo = get_repo_from_dir(local_path)
File ".../dstack/_internal/core/models/repos/remote.py", line 372, in _repo_diff_verbose
    _interactive_git_proc(repo.git.diff(repo_hash, as_process=True), collector)
File ".../dstack/_internal/core/models/repos/remote.py", line 363, in _interactive_git_proc
    collector.write(stdout)
File ".../dstack/_internal/core/models/repos/remote.py", line 259, in write
    self.buffer.write(v.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 28317: invalid continuation byte

Root cause

_DiffCollector.write at remote.py:259 calls v.decode() (i.e.,
v.decode(\"utf-8\") with default strict errors) on the raw bytes returned by
git diff. Any non-UTF-8 byte sequence in a tracked file makes this raise,
which propagates out of _repo_diff_verbose and kills dstack apply before
the user can see the plan.

Suggested fix

In dstack/_internal/core/models/repos/remote.py, change:

```python
def write(self, v: bytes):
self.buffer.write(v.decode())
```

to either:

```python
def write(self, v: bytes):
self.buffer.write(v.decode("utf-8", errors="replace"))
```

(simplest) or use `codecs.getincrementaldecoder("utf-8")(errors="replace")` if
chunked decoding across multiple `write` calls is expected. Same fix likely
wanted in the `untracked_files` loop in `_repo_diff_verbose` for parity with
the spirit of #390.

Related

#390 (closed) — same exception class but for untracked binary files. The fix
there handled the untracked-binary path; tracked-text-with-non-UTF-8 still
falls through to the strict decoder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions