Playbook: diagnose a failing run
This playbook is the longer-form version of the Diagnostics ladder.
If you only remember one thing: follow pointers (summary.json → manifest.json → events.jsonl). Do not start with “big logs”.
Related references (keep open)
- Diagnostics ladder (canonical sequence)
- Runtime logs contract (file layout + pointer fields)
- Receipts contract (what a receipt contains)
- What to share (minimal, safe evidence set)
Step-by-step
All paths below are relative to the run logs directory:
.loom/.runtime/logs/<run_id>/
1) Locate the receipt (pointer to the run)
- Receipt path:
.loom/.runtime/receipts/<...>.json - In the receipt, look for:
logs_dir(the run logs directory)phase_report_path(optional run-level phase validation pointer)
The current loom run --local receipt does not carry a separate run_id field. Use the directory name at the end of logs_dir.
2) Confirm pipeline status (one file)
Open:
pipeline/summary.json
This answers:
- Did the pipeline fail?
- What was the
exit_code? - What
schema_versionis this run using (for exampleloom.runtime.logs.v2)?
3) Find the failing job pointer (one file)
Open:
pipeline/manifest.json
If the pipeline failed, this file should contain:
failing_job_idfailing_job_manifest_path(usuallyjobs/<job_id>/manifest.json)
4) Jump into the failing job (two files)
Open:
jobs/<job_id>/summary.jsonjobs/<job_id>/manifest.json
The job manifest is your “routing table”. It includes:
failing_sectionfailing_step_events_path(when the failure is a user step)system_sections[].events_path(when the failure is in a system/provider section)
5) Read only the failing unit event stream (one file)
Pick one path from the job manifest:
- User step failure (most common): open
failing_step_events_path, which points to something like:jobs/<job_id>/user/execution/script/<NN>/events.jsonl
- System/provider failure: open the relevant
system_sections[].events_path, which points to something like:jobs/<job_id>/system/provider/events.jsonljobs/<job_id>/system/<system_section>/events.jsonl
6) Extract the smallest excerpt that explains the failure (10–20 lines)
From the events.jsonl you opened, copy only:
- the failing command line (if present)
- the error message
- the exit code / failure marker
- 1–2 lines above/below for context
Then use What to share to package it safely.
Example (real run excerpt: pointers + failing unit events)
This excerpt is from a real local run:
run_id:loom-run-local-1772917135143116000- failing job:
build - failing unit:
jobs/build/user/execution/script/02/events.jsonl
Pointer 1: pipeline manifest → failing job id
From pipeline/manifest.json (trimmed):
{
"status": "failure",
"failing_job_id": "build",
"failing_job_manifest_path": "jobs/build/manifest.json"
}
Pointer 2: job manifest → failing unit events path
From jobs/build/manifest.json (trimmed):
{
"failing_section": "script",
"failing_step_index": 2,
"failing_step_events_path": "jobs/build/user/execution/script/02/events.jsonl"
}
Failing unit: excerpt from events.jsonl
From jobs/build/user/execution/script/02/events.jsonl (trimmed to the failure):
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"\u003e nx run loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":37,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"[INFO] [en] Creating an optimized production build...","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":38,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"NX Running target build for 3 projects and 12 tasks they depend on failed","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":43,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"Failed tasks: loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":45,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"ELIFECYCLE Command failed with exit code 1.","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":46,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"duration_ms":40794,"event":"phase_finish","exit_code":1,"job_id":"build","job_name":"build","level":"error","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":47,"status":"failed","step_id":"script-02","step_index":2}
Notes:
- The current event contract uses
phase_start,output, andphase_finish. - JSONL files are one JSON object per line; you generally only need a small excerpt.
Escalation
- If failing unit events are insufficient: widen to job-level events:
jobs/<job_id>/events.jsonl(if present)
- If you suspect cross-job or runner-level issues: widen to pipeline-level events:
pipeline/events.jsonl(if present)
- If structured events are still insufficient: only then fall back to legacy stdout/stderr logs. [NEEDS SOURCE] (document exact legacy filenames and when they appear)
Privacy & redaction checklist (before you paste)
Before sharing any excerpt, scan and redact:
- secrets (
*_TOKEN,*_KEY,*_SECRET, passwords, private keys) - credentials embedded in URLs
- internal hostnames/IPs (if sensitive)
- usernames/home directories (optional)
- environment variable values that may include secrets
If you redact, say what you redacted (“redacted an S3 bucket name”) so helpers can reason about the missing piece.