Playbook: diagnose a failing run

This playbook is the longer-form version of the Diagnostics ladder.

If you only remember one thing: follow pointers (summary.json → manifest.json → events.jsonl). Do not start with “big logs”.

Diagnostics ladder (canonical sequence)
Runtime logs contract (file layout + pointer fields)
Receipts contract (what a receipt contains)
What to share (minimal, safe evidence set)

Step-by-step

All paths below are relative to the run logs directory:

.loom/.runtime/logs/<run_id>/

1) Locate the receipt (pointer to the run)

Receipt path: .loom/.runtime/receipts/<...>.json
In the receipt, look for:
- logs_dir (the run logs directory)
- phase_report_path (optional run-level phase validation pointer)

The current loom run --local receipt does not carry a separate run_id field. Use the directory name at the end of logs_dir.

2) Confirm pipeline status (one file)

Open:

pipeline/summary.json

This answers:

Did the pipeline fail?
What was the exit_code?
What schema_version is this run using (for example loom.runtime.logs.v2)?

3) Find the failing job pointer (one file)

Open:

pipeline/manifest.json

If the pipeline failed, this file should contain:

failing_job_id
failing_job_manifest_path (usually jobs/<job_id>/manifest.json)

4) Jump into the failing job (two files)

Open:

jobs/<job_id>/summary.json
jobs/<job_id>/manifest.json

The job manifest is your “routing table”. It includes:

failing_section
failing_step_events_path (when the failure is a user step)
system_sections[].events_path (when the failure is in a system/provider section)

5) Read only the failing unit event stream (one file)

Pick one path from the job manifest:

User step failure (most common): open failing_step_events_path, which points to something like:
- jobs/<job_id>/user/execution/script/<NN>/events.jsonl
System/provider failure: open the relevant system_sections[].events_path, which points to something like:
- jobs/<job_id>/system/provider/events.jsonl
- jobs/<job_id>/system/<system_section>/events.jsonl

6) Extract the smallest excerpt that explains the failure (10–20 lines)

From the events.jsonl you opened, copy only:

the failing command line (if present)
the error message
the exit code / failure marker
1–2 lines above/below for context

Then use What to share to package it safely.

Example (real run excerpt: pointers + failing unit events)

This excerpt is from a real local run:

run_id: loom-run-local-1772917135143116000
failing job: build
failing unit: jobs/build/user/execution/script/02/events.jsonl

Pointer 1: pipeline manifest → failing job id

From pipeline/manifest.json (trimmed):

{
  "status": "failure",
  "failing_job_id": "build",
  "failing_job_manifest_path": "jobs/build/manifest.json"
}

Pointer 2: job manifest → failing unit events path

From jobs/build/manifest.json (trimmed):

{
  "failing_section": "script",
  "failing_step_index": 2,
  "failing_step_events_path": "jobs/build/user/execution/script/02/events.jsonl"
}

Failing unit: excerpt from `events.jsonl`

From jobs/build/user/execution/script/02/events.jsonl (trimmed to the failure):

{"event":"output","job_id":"build","job_name":"build","level":"info","message":"\u003e nx run loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":37,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"[INFO] [en] Creating an optimized production build...","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":38,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"NX Running target build for 3 projects and 12 tasks they depend on failed","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":43,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"Failed tasks: loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":45,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"ELIFECYCLE Command failed with exit code 1.","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":46,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"duration_ms":40794,"event":"phase_finish","exit_code":1,"job_id":"build","job_name":"build","level":"error","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":47,"status":"failed","step_id":"script-02","step_index":2}

Notes:

The current event contract uses phase_start, output, and phase_finish.
JSONL files are one JSON object per line; you generally only need a small excerpt.

Escalation

If failing unit events are insufficient: widen to job-level events:
- jobs/<job_id>/events.jsonl (if present)
If you suspect cross-job or runner-level issues: widen to pipeline-level events:
- pipeline/events.jsonl (if present)
If structured events are still insufficient: only then fall back to legacy stdout/stderr logs. [NEEDS SOURCE] (document exact legacy filenames and when they appear)

Privacy & redaction checklist (before you paste)

Before sharing any excerpt, scan and redact:

secrets (*_TOKEN, *_KEY, *_SECRET, passwords, private keys)
credentials embedded in URLs
internal hostnames/IPs (if sensitive)
usernames/home directories (optional)
environment variable values that may include secrets

If you redact, say what you redacted (“redacted an S3 bucket name”) so helpers can reason about the missing piece.

Playbook: diagnose a failing run

Related references (keep open)​

Step-by-step​

1) Locate the receipt (pointer to the run)​

2) Confirm pipeline status (one file)​

3) Find the failing job pointer (one file)​

4) Jump into the failing job (two files)​

5) Read only the failing unit event stream (one file)​

6) Extract the smallest excerpt that explains the failure (10–20 lines)​

Example (real run excerpt: pointers + failing unit events)​

Pointer 1: pipeline manifest → failing job id​

Pointer 2: job manifest → failing unit events path​

Failing unit: excerpt from events.jsonl​

Escalation​

Privacy & redaction checklist (before you paste)​

Related references (keep open)

Step-by-step

1) Locate the receipt (pointer to the run)

2) Confirm pipeline status (one file)

3) Find the failing job pointer (one file)

4) Jump into the failing job (two files)

5) Read only the failing unit event stream (one file)

6) Extract the smallest excerpt that explains the failure (10–20 lines)

Example (real run excerpt: pointers + failing unit events)

Pointer 1: pipeline manifest → failing job id

Pointer 2: job manifest → failing unit events path

Failing unit: excerpt from `events.jsonl`

Escalation

Privacy & redaction checklist (before you paste)