Skip to main content

Playbook: diagnose a failing run

This playbook is the longer-form version of the Diagnostics ladder.

If you only remember one thing: follow pointers (summary.jsonmanifest.jsonevents.jsonl). Do not start with “big logs”.

Step-by-step

All paths below are relative to the run logs directory:

.loom/.runtime/logs/<run_id>/

1) Locate the receipt (pointer to the run)

  • Receipt path: .loom/.runtime/receipts/<...>.json
  • In the receipt, look for:
    • logs_dir (the run logs directory)
    • phase_report_path (optional run-level phase validation pointer)

The current loom run --local receipt does not carry a separate run_id field. Use the directory name at the end of logs_dir.

2) Confirm pipeline status (one file)

Open:

  • pipeline/summary.json

This answers:

  • Did the pipeline fail?
  • What was the exit_code?
  • What schema_version is this run using (for example loom.runtime.logs.v2)?

3) Find the failing job pointer (one file)

Open:

  • pipeline/manifest.json

If the pipeline failed, this file should contain:

  • failing_job_id
  • failing_job_manifest_path (usually jobs/<job_id>/manifest.json)

4) Jump into the failing job (two files)

Open:

  • jobs/<job_id>/summary.json
  • jobs/<job_id>/manifest.json

The job manifest is your “routing table”. It includes:

  • failing_section
  • failing_step_events_path (when the failure is a user step)
  • system_sections[].events_path (when the failure is in a system/provider section)

5) Read only the failing unit event stream (one file)

Pick one path from the job manifest:

  • User step failure (most common): open failing_step_events_path, which points to something like:
    • jobs/<job_id>/user/execution/script/<NN>/events.jsonl
  • System/provider failure: open the relevant system_sections[].events_path, which points to something like:
    • jobs/<job_id>/system/provider/events.jsonl
    • jobs/<job_id>/system/<system_section>/events.jsonl

6) Extract the smallest excerpt that explains the failure (10–20 lines)

From the events.jsonl you opened, copy only:

  • the failing command line (if present)
  • the error message
  • the exit code / failure marker
  • 1–2 lines above/below for context

Then use What to share to package it safely.

Example (real run excerpt: pointers + failing unit events)

This excerpt is from a real local run:

  • run_id: loom-run-local-1772917135143116000
  • failing job: build
  • failing unit: jobs/build/user/execution/script/02/events.jsonl

Pointer 1: pipeline manifest → failing job id

From pipeline/manifest.json (trimmed):

{
"status": "failure",
"failing_job_id": "build",
"failing_job_manifest_path": "jobs/build/manifest.json"
}

Pointer 2: job manifest → failing unit events path

From jobs/build/manifest.json (trimmed):

{
"failing_section": "script",
"failing_step_index": 2,
"failing_step_events_path": "jobs/build/user/execution/script/02/events.jsonl"
}

Failing unit: excerpt from events.jsonl

From jobs/build/user/execution/script/02/events.jsonl (trimmed to the failure):

{"event":"output","job_id":"build","job_name":"build","level":"info","message":"\u003e nx run loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":37,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"[INFO] [en] Creating an optimized production build...","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":38,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"NX Running target build for 3 projects and 12 tasks they depend on failed","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":43,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"Failed tasks: loom-docs:build","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":45,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"event":"output","job_id":"build","job_name":"build","level":"info","message":"ELIFECYCLE Command failed with exit code 1.","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":46,"step_id":"script-02","step_index":2,"stream":"stdout"}
{"duration_ms":40794,"event":"phase_finish","exit_code":1,"job_id":"build","job_name":"build","level":"error","phase_code":"execution.script","phase_family":"user","pipeline_id":"loom-local-1772917135143116000","run_id":"loom-run-local-1772917135143116000","schema_version":"loom.runtime.logs.v2","scope":"step","section":"execution","section_family":"user","seq":47,"status":"failed","step_id":"script-02","step_index":2}

Notes:

  • The current event contract uses phase_start, output, and phase_finish.
  • JSONL files are one JSON object per line; you generally only need a small excerpt.

Escalation

  • If failing unit events are insufficient: widen to job-level events:
    • jobs/<job_id>/events.jsonl (if present)
  • If you suspect cross-job or runner-level issues: widen to pipeline-level events:
    • pipeline/events.jsonl (if present)
  • If structured events are still insufficient: only then fall back to legacy stdout/stderr logs. [NEEDS SOURCE] (document exact legacy filenames and when they appear)

Privacy & redaction checklist (before you paste)

Before sharing any excerpt, scan and redact:

  • secrets (*_TOKEN, *_KEY, *_SECRET, passwords, private keys)
  • credentials embedded in URLs
  • internal hostnames/IPs (if sensitive)
  • usernames/home directories (optional)
  • environment variable values that may include secrets

If you redact, say what you redacted (“redacted an S3 bucket name”) so helpers can reason about the missing piece.