Skip to main content

Common failures

Quickly recognize a failure class, understand the root cause, and take the highest-leverage next step.

Not sure where to start? Begin with the Diagnostics ladder.

All paths below are relative to .loom/.runtime/logs/<run_id>/ unless stated otherwise.


Schema validation failures (workflow YAML)

Symptomloom check or loom run exits non-zero with "schema validation" errors referencing unknown fields, wrong types, or missing required keys.
Root causeWorkflow YAML does not match the expected schema version (field name, nesting, or type).
First stepFix the first reported error — later errors are often cascading.
loom check

Cross-reference the exact field name and type against Workflow schema v1. If the error only reproduces during execution, verify you are running loom check from the repo that contains .loom/workflow.yml and compare that file with any alternate path you pass to loom run.

Example error signatures (from schema validation tests):

WF_SCHEMA_V1 /include/0/local: set include.local to .loom/templates/<name>.yml or .yaml
WF_SCHEMA_V1 /.a/extends: resolve extends cycle by removing one extends edge: .a -> .b -> .a

See also: Syntax v1 · loom check · loom run


Docker provider issues (jobs using image:)

SymptomJobs with image: fail to start, hang at startup, or fail during pull. Errors mention "cannot connect to Docker", "pull access denied", "no such image", or "exec format error".
Root causeDocker is not running or not reachable, the provider backend does not match your environment, or there is an image pull / auth / platform mismatch (e.g. arm64 vs amd64).
First stepVerify Docker is reachable from the shell where you run Loom.
docker version
docker info

If Docker is reachable, confirm which workspace mount mode Loom is using:

MethodExample
CLI flag--docker-workspace-mount bind_mount or --docker-workspace-mount ephemeral_volume
Environment variableLOOM_DOCKER_WORKSPACE_MOUNT=bind_mount

If you have run logs, follow the pointer chain to the failing provider events:

  1. pipeline/manifest.jsonfailing_job_id
  2. jobs/<job_id>/manifest.jsonsystem_sections[].events_path (typically jobs/<job_id>/system/provider/events.jsonl)

Example provider failure (from Docker provider tests):

Cannot connect to the Docker daemon at unix:///var/run/docker.sock.

See also: Docker provider · loom run · Diagnostics ladder


Cache divergence quarantine (--cache-diff)

SymptomA run with --cache-diff reports divergence and quarantines cache-hit keys (or refuses to trust a previously cached result).
Root causeLoom detected a correctness risk: a cached result does not match a recomputation. Common causes include nondeterministic scripts, untracked inputs, environment drift, or tool version changes.
First stepTreat this as a correctness signal, not "cache being flaky". Localize where divergence was detected.

Follow the pointer chain to the failing unit events:

  1. pipeline/summary.json → status and exit code
  2. pipeline/manifest.jsonfailing_job_id
  3. jobs/<job_id>/manifest.jsonfailing_step_events_path
  4. Open the pointed-to events.jsonl and look for messages containing "quarantine", "diverged", "cache-diff", or "recompute"

To establish a known-good baseline, rerun without --cache-diff first to confirm the step is stable, then re-enable --cache-diff to isolate what diverges.

See also: Cache concept · Cache workflow · loom run


Variable precedence surprises (unexpected values)

SymptomA command behaves as if a variable has a different value than expected — wrong branch, wrong token, wrong flags. You set it in the workflow, but the step sees something else.
Root causeMultiple layers provide values (workflow vars, job vars, step env, CLI overrides, provider defaults) and a higher-precedence layer is overriding yours.
First stepInspect what the failing step actually received — do not reason from memory.

Loom does not currently emit a dedicated "effective env" artifact. Add a temporary diagnostic line to your workflow to capture the resolved environment:

script:
- env | grep -E '^CI_|^LOOM_|^YOURPREFIX_' | sort
- <your real command>

Then open the step's events.jsonl (via the pointer chain in jobs/<job_id>/manifest.jsonfailing_step_events_path) and look for JSONL lines with:

  • event: "step_output"
  • stream: "stdout"
  • message: containing your printed env lines

Compare the resolved values against each source, from highest to lowest precedence:

  1. Step-level explicit environment / overrides
  2. Job-level variables / environment
  3. Workflow-level variables / environment
  4. CLI flags / shell environment passed to loom run

See also: Variables concept · Variables workflow · loom run


Secrets resolution failures (SECRETS_*)

SymptomA job fails before or during script execution with a SECRETS_* error code.
Root causeThe secret reference cannot be resolved — provider unavailable, ref malformed, entry missing, or required secret absent.
First stepRead the error code from CLI output or job events, then match it to the table below.
CodeMeaningFix
SECRETS_PROVIDER_UNAVAILABLEProvider backend not available (e.g. keepass:// is currently stubbed)Use env:// instead
SECRETS_REF_INVALIDMalformed or unsupported ref URIFix the ref value — check scheme and syntax
SECRETS_REF_NOT_FOUNDProvider resolved but entry/field does not existVerify env var is exported (env://) or entry path is correct (keepass://)
SECRETS_REQUIRED_MISSINGRequired secret could not be resolvedExport the env var, or set required: false if the secret is optional
SECRETS_UNSAFE_DEBUG_TRACECI_DEBUG_TRACE=true with file: false secretsDisable debug trace, or switch secrets to file: true

If the error is in runtime logs, follow the pointer chain:

  1. pipeline/manifest.jsonfailing_job_id
  2. jobs/<job_id>/manifest.json → failing unit events path

Events include diagnostic metadata (provider scheme, secret variable name) but never secret bytes.

Example — required env-based secret where the var is not exported:

SECRETS_REQUIRED_MISSING: secret "DEPLOY_TOKEN" (env://DEPLOY_TOKEN) could not be resolved

Fix: export DEPLOY_TOKEN="your-value" before running Loom.

See also: Secrets error codes · Concepts → Secrets · Workflows → Secrets


Pointer mismatch confusion (can't find "the real error")

SymptomYou have a run id and a logs directory, but you can't tell which job or step failed. Large log files do not reveal the root cause.
Root causeYou are reading output-heavy files instead of following the pointer documents that identify the exact failing unit.
First stepRestart from the canonical pointer sequence.

The pointer chain is designed to be low-noise:

StepFileWhat it tells you
1pipeline/summary.jsonPipeline status + exit code
2pipeline/manifest.jsonfailing_job_id + failing_job_manifest_path
3jobs/<job_id>/summary.jsonJob status + exit code
4jobs/<job_id>/manifest.jsonfailing_step_events_path (user step) or system_sections[].events_path (provider/system)
5The pointed-to events.jsonlThe actual failure evidence

If the failure is in a provider or system section (not a user script), manifest.json will not contain failing_step_events_path. Instead, look at the system_sections array for the relevant events_path.

Example pointer excerpt (from a failing local run):

{
"failing_job_id": "check-pnpm",
"failing_job_manifest_path": "jobs/check-pnpm/manifest.json"
}

See also: Diagnostics ladder · Runtime logs contract


Still stuck?

  1. Follow the full diagnostic flow: Diagnostics ladder
  2. Prepare a minimal report: What to share
  3. Include in your report:
    • The exact command you ran (including flags)
    • The run id or receipt path
    • pipeline/summary.json
    • pipeline/manifest.json
    • The failing unit events.jsonl (user step or system section)