Common failures

Quickly recognize a failure class, understand the root cause, and take the highest-leverage next step.

Not sure where to start? Begin with the Diagnostics ladder.

All paths below are relative to .loom/.runtime/logs/<run_id>/ unless stated otherwise.

Schema validation failures (workflow YAML)


Symptom	`loom check` or `loom run` exits non-zero with "schema validation" errors referencing unknown fields, wrong types, or missing required keys.
Root cause	Workflow YAML does not match the expected schema version (field name, nesting, or type).
First step	Fix the first reported error — later errors are often cascading.

loom check

Cross-reference the exact field name and type against Workflow schema v1. If the error only reproduces during execution, verify you are running loom check from the repo that contains .loom/workflow.yml and compare that file with any alternate path you pass to loom run.

Example error signatures (from schema validation tests):

WF_SCHEMA_V1 /include/0/local: set include.local to .loom/templates/<name>.yml or .yaml

WF_SCHEMA_V1 /.a/extends: resolve extends cycle by removing one extends edge: .a -> .b -> .a

See also: Syntax v1 · loom check · loom run

Docker provider issues (jobs using `image:`)


Symptom	Jobs with `image:` fail to start, hang at startup, or fail during pull. Errors mention "cannot connect to Docker", "pull access denied", "no such image", or "exec format error".
Root cause	Docker is not running or not reachable, the provider backend does not match your environment, or there is an image pull / auth / platform mismatch (e.g. arm64 vs amd64).
First step	Verify Docker is reachable from the shell where you run Loom.

docker version
docker info

If Docker is reachable, confirm which workspace mount mode Loom is using:

Method	Example
CLI flag	`--docker-workspace-mount bind_mount` or `--docker-workspace-mount ephemeral_volume`
Environment variable	`LOOM_DOCKER_WORKSPACE_MOUNT=bind_mount`

If you have run logs, follow the pointer chain to the failing provider events:

pipeline/manifest.json → failing_job_id
jobs/<job_id>/manifest.json → system_sections[].events_path (typically jobs/<job_id>/system/provider/events.jsonl)

Example provider failure (from Docker provider tests):

Cannot connect to the Docker daemon at unix:///var/run/docker.sock.

See also: Docker provider · loom run · Diagnostics ladder

Cache divergence quarantine (`--cache-diff`)


Symptom	A run with `--cache-diff` reports divergence and quarantines cache-hit keys (or refuses to trust a previously cached result).
Root cause	Loom detected a correctness risk: a cached result does not match a recomputation. Common causes include nondeterministic scripts, untracked inputs, environment drift, or tool version changes.
First step	Treat this as a correctness signal, not "cache being flaky". Localize where divergence was detected.

Follow the pointer chain to the failing unit events:

pipeline/summary.json → status and exit code
pipeline/manifest.json → failing_job_id
jobs/<job_id>/manifest.json → failing_step_events_path
Open the pointed-to events.jsonl and look for messages containing "quarantine", "diverged", "cache-diff", or "recompute"

To establish a known-good baseline, rerun without --cache-diff first to confirm the step is stable, then re-enable --cache-diff to isolate what diverges.

See also: Cache concept · Cache workflow · loom run

Variable precedence surprises (unexpected values)


Symptom	A command behaves as if a variable has a different value than expected — wrong branch, wrong token, wrong flags. You set it in the workflow, but the step sees something else.
Root cause	Multiple layers provide values (workflow vars, job vars, step env, CLI overrides, provider defaults) and a higher-precedence layer is overriding yours.
First step	Inspect what the failing step actually received — do not reason from memory.

Loom does not currently emit a dedicated "effective env" artifact. Add a temporary diagnostic line to your workflow to capture the resolved environment:

script:
  - env | grep -E '^CI_|^LOOM_|^YOURPREFIX_' | sort
  - <your real command>

Then open the step's events.jsonl (via the pointer chain in jobs/<job_id>/manifest.json → failing_step_events_path) and look for JSONL lines with:

event: "step_output"
stream: "stdout"
message: containing your printed env lines

Compare the resolved values against each source, from highest to lowest precedence:

Step-level explicit environment / overrides
Job-level variables / environment
Workflow-level variables / environment
CLI flags / shell environment passed to loom run

See also: Variables concept · Variables workflow · loom run

Secrets resolution failures (`SECRETS_*`)


Symptom	A job fails before or during script execution with a `SECRETS_*` error code.
Root cause	The secret reference cannot be resolved — provider unavailable, ref malformed, entry missing, or required secret absent.
First step	Read the error code from CLI output or job events, then match it to the table below.

Code	Meaning	Fix
`SECRETS_PROVIDER_UNAVAILABLE`	Provider backend not available (e.g. `keepass://` is currently stubbed)	Use `env://` instead
`SECRETS_REF_INVALID`	Malformed or unsupported `ref` URI	Fix the `ref` value — check scheme and syntax
`SECRETS_REF_NOT_FOUND`	Provider resolved but entry/field does not exist	Verify env var is exported (`env://`) or entry path is correct (`keepass://`)
`SECRETS_REQUIRED_MISSING`	Required secret could not be resolved	Export the env var, or set `required: false` if the secret is optional
`SECRETS_UNSAFE_DEBUG_TRACE`	`CI_DEBUG_TRACE=true` with `file: false` secrets	Disable debug trace, or switch secrets to `file: true`

If the error is in runtime logs, follow the pointer chain:

pipeline/manifest.json → failing_job_id
jobs/<job_id>/manifest.json → failing unit events path

Events include diagnostic metadata (provider scheme, secret variable name) but never secret bytes.

Example — required env-based secret where the var is not exported:

SECRETS_REQUIRED_MISSING: secret "DEPLOY_TOKEN" (env://DEPLOY_TOKEN) could not be resolved

Fix: export DEPLOY_TOKEN="your-value" before running Loom.

Pointer mismatch confusion (can't find "the real error")


Symptom	You have a run id and a logs directory, but you can't tell which job or step failed. Large log files do not reveal the root cause.
Root cause	You are reading output-heavy files instead of following the pointer documents that identify the exact failing unit.
First step	Restart from the canonical pointer sequence.

The pointer chain is designed to be low-noise:

Step	File	What it tells you
1	`pipeline/summary.json`	Pipeline status + exit code
2	`pipeline/manifest.json`	`failing_job_id` + `failing_job_manifest_path`
3	`jobs/<job_id>/summary.json`	Job status + exit code
4	`jobs/<job_id>/manifest.json`	`failing_step_events_path` (user step) or `system_sections[].events_path` (provider/system)
5	The pointed-to `events.jsonl`	The actual failure evidence

If the failure is in a provider or system section (not a user script), manifest.json will not contain failing_step_events_path. Instead, look at the system_sections array for the relevant events_path.

Example pointer excerpt (from a failing local run):

{
  "failing_job_id": "check-pnpm",
  "failing_job_manifest_path": "jobs/check-pnpm/manifest.json"
}

See also: Diagnostics ladder · Runtime logs contract

Still stuck?

Follow the full diagnostic flow: Diagnostics ladder
Prepare a minimal report: What to share
Include in your report:
- The exact command you ran (including flags)
- The run id or receipt path
- pipeline/summary.json
- pipeline/manifest.json
- The failing unit events.jsonl (user step or system section)

Common failures

Schema validation failures (workflow YAML)​

Docker provider issues (jobs using image:)​

Cache divergence quarantine (--cache-diff)​

Variable precedence surprises (unexpected values)​

Secrets resolution failures (SECRETS_*)​

Pointer mismatch confusion (can't find "the real error")​

Still stuck?​