Common failures
Quickly recognize a failure class, understand the root cause, and take the highest-leverage next step.
Not sure where to start? Begin with the Diagnostics ladder.
All paths below are relative to .loom/.runtime/logs/<run_id>/ unless stated otherwise.
Schema validation failures (workflow YAML)
| Symptom | loom check or loom run exits non-zero with "schema validation" errors referencing unknown fields, wrong types, or missing required keys. |
| Root cause | Workflow YAML does not match the expected schema version (field name, nesting, or type). |
| First step | Fix the first reported error — later errors are often cascading. |
loom check
Cross-reference the exact field name and type against Workflow schema v1. If the error only reproduces during execution, verify you are running loom check from the repo that contains .loom/workflow.yml and compare that file with any alternate path you pass to loom run.
Example error signatures (from schema validation tests):
WF_SCHEMA_V1 /include/0/local: set include.local to .loom/templates/<name>.yml or .yaml
WF_SCHEMA_V1 /.a/extends: resolve extends cycle by removing one extends edge: .a -> .b -> .a
See also: Syntax v1 · loom check · loom run
Docker provider issues (jobs using image:)
| Symptom | Jobs with image: fail to start, hang at startup, or fail during pull. Errors mention "cannot connect to Docker", "pull access denied", "no such image", or "exec format error". |
| Root cause | Docker is not running or not reachable, the provider backend does not match your environment, or there is an image pull / auth / platform mismatch (e.g. arm64 vs amd64). |
| First step | Verify Docker is reachable from the shell where you run Loom. |
docker version
docker info
If Docker is reachable, confirm which workspace mount mode Loom is using:
| Method | Example |
|---|---|
| CLI flag | --docker-workspace-mount bind_mount or --docker-workspace-mount ephemeral_volume |
| Environment variable | LOOM_DOCKER_WORKSPACE_MOUNT=bind_mount |
If you have run logs, follow the pointer chain to the failing provider events:
pipeline/manifest.json→failing_job_idjobs/<job_id>/manifest.json→system_sections[].events_path(typicallyjobs/<job_id>/system/provider/events.jsonl)
Example provider failure (from Docker provider tests):
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
See also: Docker provider · loom run · Diagnostics ladder
Cache divergence quarantine (--cache-diff)
| Symptom | A run with --cache-diff reports divergence and quarantines cache-hit keys (or refuses to trust a previously cached result). |
| Root cause | Loom detected a correctness risk: a cached result does not match a recomputation. Common causes include nondeterministic scripts, untracked inputs, environment drift, or tool version changes. |
| First step | Treat this as a correctness signal, not "cache being flaky". Localize where divergence was detected. |
Follow the pointer chain to the failing unit events:
pipeline/summary.json→ status and exit codepipeline/manifest.json→failing_job_idjobs/<job_id>/manifest.json→failing_step_events_path- Open the pointed-to
events.jsonland look for messages containing "quarantine", "diverged", "cache-diff", or "recompute"
To establish a known-good baseline, rerun without --cache-diff first to confirm the step is stable, then re-enable --cache-diff to isolate what diverges.
See also: Cache concept · Cache workflow · loom run
Variable precedence surprises (unexpected values)
| Symptom | A command behaves as if a variable has a different value than expected — wrong branch, wrong token, wrong flags. You set it in the workflow, but the step sees something else. |
| Root cause | Multiple layers provide values (workflow vars, job vars, step env, CLI overrides, provider defaults) and a higher-precedence layer is overriding yours. |
| First step | Inspect what the failing step actually received — do not reason from memory. |
Loom does not currently emit a dedicated "effective env" artifact. Add a temporary diagnostic line to your workflow to capture the resolved environment:
script:
- env | grep -E '^CI_|^LOOM_|^YOURPREFIX_' | sort
- <your real command>
Then open the step's events.jsonl (via the pointer chain in jobs/<job_id>/manifest.json → failing_step_events_path) and look for JSONL lines with:
event: "step_output"stream: "stdout"message:containing your printed env lines
Compare the resolved values against each source, from highest to lowest precedence:
- Step-level explicit environment / overrides
- Job-level variables / environment
- Workflow-level variables / environment
- CLI flags / shell environment passed to
loom run
See also: Variables concept · Variables workflow · loom run
Secrets resolution failures (SECRETS_*)
| Symptom | A job fails before or during script execution with a SECRETS_* error code. |
| Root cause | The secret reference cannot be resolved — provider unavailable, ref malformed, entry missing, or required secret absent. |
| First step | Read the error code from CLI output or job events, then match it to the table below. |
| Code | Meaning | Fix |
|---|---|---|
SECRETS_PROVIDER_UNAVAILABLE | Provider backend not available (e.g. keepass:// is currently stubbed) | Use env:// instead |
SECRETS_REF_INVALID | Malformed or unsupported ref URI | Fix the ref value — check scheme and syntax |
SECRETS_REF_NOT_FOUND | Provider resolved but entry/field does not exist | Verify env var is exported (env://) or entry path is correct (keepass://) |
SECRETS_REQUIRED_MISSING | Required secret could not be resolved | Export the env var, or set required: false if the secret is optional |
SECRETS_UNSAFE_DEBUG_TRACE | CI_DEBUG_TRACE=true with file: false secrets | Disable debug trace, or switch secrets to file: true |
If the error is in runtime logs, follow the pointer chain:
pipeline/manifest.json→failing_job_idjobs/<job_id>/manifest.json→ failing unit events path
Events include diagnostic metadata (provider scheme, secret variable name) but never secret bytes.
Example — required env-based secret where the var is not exported:
SECRETS_REQUIRED_MISSING: secret "DEPLOY_TOKEN" (env://DEPLOY_TOKEN) could not be resolved
Fix: export DEPLOY_TOKEN="your-value" before running Loom.
See also: Secrets error codes · Concepts → Secrets · Workflows → Secrets
Pointer mismatch confusion (can't find "the real error")
| Symptom | You have a run id and a logs directory, but you can't tell which job or step failed. Large log files do not reveal the root cause. |
| Root cause | You are reading output-heavy files instead of following the pointer documents that identify the exact failing unit. |
| First step | Restart from the canonical pointer sequence. |
The pointer chain is designed to be low-noise:
| Step | File | What it tells you |
|---|---|---|
| 1 | pipeline/summary.json | Pipeline status + exit code |
| 2 | pipeline/manifest.json | failing_job_id + failing_job_manifest_path |
| 3 | jobs/<job_id>/summary.json | Job status + exit code |
| 4 | jobs/<job_id>/manifest.json | failing_step_events_path (user step) or system_sections[].events_path (provider/system) |
| 5 | The pointed-to events.jsonl | The actual failure evidence |
If the failure is in a provider or system section (not a user script), manifest.json will not contain failing_step_events_path. Instead, look at the system_sections array for the relevant events_path.
Example pointer excerpt (from a failing local run):
{
"failing_job_id": "check-pnpm",
"failing_job_manifest_path": "jobs/check-pnpm/manifest.json"
}
See also: Diagnostics ladder · Runtime logs contract
Still stuck?
- Follow the full diagnostic flow: Diagnostics ladder
- Prepare a minimal report: What to share
- Include in your report:
- The exact command you ran (including flags)
- The run id or receipt path
pipeline/summary.jsonpipeline/manifest.json- The failing unit
events.jsonl(user step or system section)