Add CI design docs
Inline v0 markers separate the current implementation reality
(in-process eval, hook-as-transport, single-context eval) from
the locked-in target so future readers can tell firm decisions
from temporary deviations.
Assisted-by: Claude Opus 4.7 (1M context) via Claude Code
diff --git a/docs/CI-FENNEL.md b/docs/CI-FENNEL.md
new file mode 100644
index 0000000..a8554c0
--- /dev/null
+++ b/docs/CI-FENNEL.md
@@ -0,0 +1,287 @@
+# quire — `.quire/ci.fnl` design
+
+The job spec language. Sibling to CI.md (which covers the runtime). This doc is about what `.quire/ci.fnl` *looks like* and the model it expresses.
+
+## Framing
+
+A CI config is a **dataflow graph of jobs**. Each job is a function from inputs to outputs. Edges are input references: job B taking job A as an input creates an A → B edge.
+
+Inputs come in two flavors, used uniformly:
+
+- **Job references.** `[:build]` — depend on another job's outputs.
+- **Source references.** `[:quire/push]` — depend on an external event. The runner provides the outputs. Builtins live under the `quire/` namespace; user job ids cannot contain `/`.
+
+There's no structural distinction between "trigger jobs" and "regular jobs." Sources are just things you list as inputs, in the same place as job references.
+
+The mental model: **jobs are functions from inputs to outputs; sources are reserved input names whose outputs the runner provides; runs are slices of the graph that fire when a source's event arrives.**
+
+This is closer to Concourse's resources-and-jobs model than to GitHub Actions' triggers-and-jobs model. More elegant, less familiar. Worth being deliberate about.
+
+## The `job` primitive
+
+```fennel
+(job id inputs run)
+```
+
+Three positional arguments:
+
+- **`id`** — keyword. The job's identity. Cannot contain `/`.
+- **`inputs`** — list of names. Each is a job id or a source ref. Must be non-empty. v1: strings/keywords only (see "Future: input args" below).
+- **`run`** — function from inputs to outputs (a table) or `nil` (skipped).
+
+That's the entire surface. Image selection, conditional firing, output extraction — all done inside `run` using runtime primitives. Fennel-as-code means there's no need for config-language conveniences when a function will do.
+
+If a fourth concept ever genuinely needs to be expressible at the job level (per-job timeout, retry policy, secret scoping), that's the moment to introduce a map-form variant — `(job id {:inputs ... :run ... :timeout ...})`. Migration would be mechanical. Until then, the positional form is shorter and reads better for the actual surface.
+
+## Inputs
+
+```fennel
+[:quire/push :compute-version]
+```
+
+A list of names. Each name is either a job id (defined elsewhere via `(job :foo ...)`) or a source ref (a reserved name in the `quire/` namespace — for v1, just `:quire/push`). The runner gathers each named input's outputs and passes them in as a table on the function's argument.
+
+The dependency graph is *derived* from the inputs list. No separate `:needs` field. Topological sort and cycle detection run over the input graph. Sources are leaf nodes (nothing flows into them).
+
+**Dependency without data.** If you depend on a job for ordering but don't read its outputs, list it anyway: `[:setup :quire/push]`. The runner doesn't enforce that you read what you list. (If a convention helps readability, prefix unused inputs with underscore: `[:_setup :quire/push]`.)
+
+### Accessing inputs
+
+The function receives an outer table with an `:inputs` key. Standard pattern is to destructure:
+
+```fennel
+(fn [{: inputs}]
+ (.. "checkout " inputs.build.sha))
+```
+
+For source inputs whose names contain `/`, the dot-access syntax is awkward. **Destructure at the function arg** — both cleaner and less error-prone:
+
+```fennel
+(fn [{:inputs {:quire/push push}}]
+ (.. "checkout " push.sha))
+```
+
+The `push` local rebinding is the recommended idiom for any source input. Use the same pattern when destructuring multiple inputs:
+
+```fennel
+(fn [{:inputs {:quire/push push : build : compute-version}}]
+ (.. "deploying " compute-version.version " from " push.sha))
+```
+
+### Sources
+
+For v1, the only source is `:quire/push`. Outputs:
+
+```
+{:sha "abc123..."
+ :ref "refs/heads/main"
+ :branch "main" ; nil if the ref is a tag
+ :tag nil ; set if the ref is a tag
+ :previous-sha "def456..." ; the ref's previous head; nil for new refs
+ :files-changed [...] ; paths changed between previous-sha and sha
+ :pusher "alice"
+ :commit-message "..."}
+```
+
+Every push to any ref fires a run that includes every job whose transitive inputs include `:quire/push`. Filtering "which pushes do I care about" happens inside `run` — return `nil` to skip:
+
+```fennel
+(job :test-main [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (when (= "main" push.branch)
+ (container {:image "rust:1.75"
+ :cmd (.. "git checkout " push.sha " && cargo test")}))))
+
+(job :release [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (when (and push.tag (string.match push.tag "^v"))
+ (container {:image "alpine"
+ :cmd (.. "publish " push.tag)}))))
+```
+
+Fennel's `when` returns `nil` if the predicate is false, otherwise the body. That nil propagates out as the `run` return value, the runner records the job as skipped. The gate and the work are in the same expression.
+
+This means **every push starts a run**, even if no job's predicate matches. Skipped jobs show in the run record as skipped. For a personal forge this is fine; the run-creation cost is small and the explicit skip record is useful debugging information ("did my filter match?"). If push frequency ever makes this wasteful, a future pre-execution skip hook is the escape valve — but the v1 model is "skipping is just early-return."
+
+### Future: input args
+
+Source types that need configuration — cron schedules, webhook paths — can't be expressed as bare keywords. The planned shape is a constructor call returning a value the runner recognizes:
+
+```fennel
+(job :nightly-audit [(cron :daily)]
+ (fn [{:inputs {: cron}}] ...))
+
+(job :hourly-check [(cron :every "1h" :as :hourly)]
+ (fn [{:inputs {: hourly}}] ...))
+```
+
+`cron`, `webhook`, etc. would be quire-provided functions in the eval scope. They return marker values; the runner inspects the inputs list for them, registers their event sources, instantiates runs when they fire. The `:as` keyword names the binding when the default name (the source type) would collide.
+
+The same constructor form is the natural place for **cherry-picking job outputs** when that becomes desired: `(output :build :sha :as :commit)` would name a single output of an upstream job. Same mechanism, different target.
+
+**v1 supports only string/keyword inputs.** Constructor calls are the planned extension for cron, webhook, and cherry-picking — not implemented. The shape above is settled enough to commit to; the implementation waits until cron is the second source we want.
+
+### Validation
+
+Three structural rules at registration eval, plus one at parse time. All fail-closed.
+
+1. **Acyclic.** No cycles in the input graph. Detected by Kahn's algorithm; error names the cycle.
+2. **Non-empty inputs.** Every job must list at least one input. The error tells the user what to fix:
+ `Job 'setup' has empty inputs. Pass [:quire/push] (or another input) as the second argument so it has something to fire it.`
+3. **Reachability.** Every job's transitive inputs must include at least one source ref. Pure job-to-job chains with no source at the root are dead code; the error names the orphaned jobs.
+4. **No `/` in user job ids** (parse time). Error: `Job id 'foo/bar' contains '/', which is reserved for the 'quire/' source namespace. Use 'foo-bar' or another delimiter.`
+
+A bad `ci.fnl` push gets a CI run that fails immediately with the parse error, same path as a Fennel syntax error.
+
+## `run` — the only primitive
+
+`run` is a host-side Fennel function (the container can't run Fennel) called when the job is about to execute, with all upstream inputs resolved. It returns either:
+
+- **A table** — the job's outputs. Whatever keys are in it become available to dependent jobs as `inputs.<this-job>.<key>`.
+- **`nil`** — the job is skipped. Dependents see `inputs.<this-job>` as `nil`.
+
+That's the whole contract. No sugar layer, no introspection, no defaulting. The runner records what was returned.
+
+Inside `run`, the function uses **runtime primitives** to do work. The most important is `(container {...})`, which runs a container and returns a result table:
+
+```fennel
+(job :test [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (container {:image "rust:1.75"
+ :cmd (.. "git checkout " push.sha " && cargo test")})))
+```
+
+`(container ...)` returns `{:exit :stdout :stderr :duration}`. That's what `run` returns. The runner records it as the outputs.
+
+For more complex jobs, the function does its own orchestration: multiple containers, host-side work between them, computed outputs derived from intermediate results:
+
+```fennel
+(job :test-and-package [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (let [test (container {:image "rust:1.75"
+ :cmd ["git checkout" push.sha "&&" "cargo test"]})]
+ (when (= 0 test.exit)
+ (let [pkg (container {:image "alpine"
+ :cmd "tar czf out.tar.gz target/release"})]
+ {:exit pkg.exit
+ :artifacts ["out.tar.gz"]
+ :test-stdout test.stdout})))))
+```
+
+If the test fails, the outer `(when ...)` returns nil → job skipped. If it passes, the package step runs and the function returns a custom output table. One mechanism, scales from "run a command" to "orchestrate a multi-step pipeline."
+
+### Why `run` is "just a function"
+
+Earlier drafts of this design had three return shapes (string, list of strings, table) plus an `:outputs` field for declarative output extension plus a `:when` field for conditional firing plus an `:image` field for the default container image. All gone. They were paying for conveniences that aren't conveniences in a code-first config:
+
+- **String sugar.** `:run "cargo test"` saves about ten characters over `(fn [_] (container {:image "rust:1.75" :cmd "cargo test"}))`. Not worth a second mental model.
+- **`:outputs` declarative extension.** "Read coverage.json after the container exits" is a Fennel one-liner inside `run`: `(let [r (container {...})] {:exit r.exit :coverage (read-json "coverage.json")})`. Helpers compose to clean up repetition.
+- **`:when`.** Returning `nil` from `run` already means "skip." Filtering and work end up in the same expression, which makes the intent more visible, not less.
+- **`:image`.** Image lives on the `(container ...)` call where it's actually used. Lets a single job legitimately use multiple images.
+
+The residual things that *aren't* "just functions" — the inputs list and the id — are the ones that genuinely need to be language-level. They define the graph and the identity. Everything else is user-space.
+
+## Runtime primitives
+
+Functions in scope inside `run`:
+
+- `(container {opts})` — run a container, return `{:exit :stdout :stderr :duration}`. Opts: `:image`, `:cmd` (string or list), `:env`, `:cwd`, `:cache` (cache dir mount, defaults to job's image-keyed cache).
+- `(sh cmd)` — run a command on the host, no container. For cheap utility work. Returns the same shape as `container`.
+- `(read-file path)`, `(read-json path)`, `(write-file path content)` — workspace I/O. Paths relative to the workspace.
+- `(log msg)` — append to the job's log file. Visible in the web UI.
+- `(env name)` — read an environment variable from the runner's environment (typically secrets).
+
+Each of these blocks the Fennel function until it returns. Multi-container parallelism inside one job is a v2 want; the v1 model is "the function runs sequentially, calling primitives that block."
+
+The wallclock and memory limits on Fennel eval (10s, 512 MB by default — see CI.md) **don't apply to time spent inside primitives**, because the function blocks on real work. The runner accounts for container time separately. The eval budget is for Fennel-side computation between primitive calls.
+
+## A worked example
+
+```fennel
+;; Helper: a parameterized test job
+(fn rust-test [version]
+ (job (.. "test-" version) [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (when (= "main" push.branch)
+ (container {:image (.. "rust:" version)
+ :cmd [(.. "git checkout " push.sha)
+ "cargo test --all-features"]})))))
+
+;; Matrix testing on every push to main
+(each [_ v (ipairs [:1.75 :1.76 :stable])]
+ (rust-test v))
+
+;; Build only if all tests passed
+(job :build [:test-1.75 :test-1.76 :test-stable :quire/push]
+ (fn [{:inputs {:quire/push push : test-1.75 : test-1.76 : test-stable}}]
+ (when (and test-1.75 test-1.76 test-stable
+ (= 0 test-1.75.exit)
+ (= 0 test-1.76.exit)
+ (= 0 test-stable.exit))
+ (let [r (container {:image "rust:1.75"
+ :cmd [(.. "git checkout " push.sha)
+ "cargo build --release"]})]
+ {:exit r.exit
+ :artifacts ["target/release/quire"]}))))
+
+;; Deploy on push to main only
+(job :deploy [:build]
+ (fn [{:inputs {: build}}]
+ (when build
+ (container {:image "alpine"
+ :cmd "scp target/release/quire host:/usr/local/bin/"}))))
+
+;; Tagged release: publish to a registry
+(job :publish [:quire/push]
+ (fn [{:inputs {:quire/push push}}]
+ (when (and push.tag (string.match push.tag "^v"))
+ (container {:image "rust:1.75"
+ :cmd [(.. "git checkout " push.tag)
+ "cargo publish"]}))))
+```
+
+What this expresses:
+- Every push fires a run. Test jobs check `push.branch` and return nil for non-main pushes; build/deploy chain skips with them (their inputs are nil, their `(when ...)` checks see nil).
+- Tagged pushes additionally fire `:publish`, which has its own predicate.
+- The "all tests passed" check in `:build` is now visible in code rather than implicit. More verbose than a `:when` field, but the verbosity is honest about what's happening — and a helper (`(all-passed test-1.75 test-1.76 test-stable)`) would clean it up if the pattern repeats.
+
+## Evaluation timing
+
+> **v0 status:** the three-context model below is the eventual target. Initial implementation collapses to a single in-process eval per run — registration and per-job execution happen together at run start. The model expands back out to three contexts when cross-job inputs (job B consuming job A's outputs) and the subprocess sandbox land.
+
+`ci.fnl` is evaluated in **three contexts**, all using the subprocess machinery from CI.md (10s wallclock, 512 MB memory cap):
+
+1. **Registration eval.** When `ci.fnl` changes on the default branch. The runner walks the resulting job set, runs structural validation (cycles, non-empty inputs, reachability, namespace rule). For v1, nothing else needs to happen here — `:quire/push` is implicit, requires no registration. When source types that need registration arrive (cron schedules, webhook routes), they'll be discovered here via the constructor form in inputs.
+2. **Run eval.** When a push arrives and a run starts. The runner evaluates `ci.fnl` to get the current job set, computes which jobs are reachable from `:quire/push`, schedules them.
+3. **Per-job eval.** When a job is about to execute, its `run` function is invoked with concrete input values. Same subprocess, same limits, but per job.
+
+The three-context model means **`ci.fnl` is re-evaluated more than you might expect.** Pure functions, no caching across runs. This is fine — eval is fast and bounded — but worth knowing if a future helper does expensive work at the top level (parsing a large file, hitting a network endpoint). Top-level work runs three times per change, plus once per job. Move expensive work into `run` where it runs once per job execution.
+
+## Open questions
+
+- **Source events with no matching jobs.** If `ci.fnl` has no jobs whose transitive inputs include `:quire/push`, do pushes still create empty runs? Probably no — skip silently. But worth being explicit.
+- **What's the exact set of runtime primitives?** `container`, `sh`, `read-file` are obvious. Less obvious: do we expose `tcp-connect`, `http-get`? They'd enable real "jobs as observers" patterns, but they're a long road into "Fennel is a real programming environment." Probably no, defer.
+- **Artifacts as inputs.** Job B with `[:build]` as inputs — does B's workspace start with build's artifacts already in place? Probably yes; otherwise the `:artifacts` output is data-only and you can't use them in subsequent containers. Implementation: artifacts unpacked into B's workspace before B's container starts.
+- **Image pre-pull discoverability.** Without a top-level `:image` field, the runner can't statically know what images a job uses — it has to actually run the function (or analyze it, which is fragile). Probably acceptable for v1: pull-on-demand from `(container ...)` calls works fine, just with a one-time latency per new image. A `quire ci pull <image>` command lets users warm explicitly.
+- **Error semantics inside `run`.** What if it throws? Job marked failed, exception text into the log. What if it returns a malformed value (not nil, not a table)? Mark failed, log a schema warning.
+- **Push payload size.** `:quire/push.files-changed` could be huge for a large merge. Do we cap it? Stream it differently? Defer to first time it bites.
+- **Composition across files.** A `quire/stdlib.fnl` of common helpers, or per-repo Fennel modules. Real want eventually; not v1.
+- **Pre-execution skip hook.** "Every push starts a run" is fine for personal scale. If it ever isn't, a hook that runs *before* workspace materialization to skip the whole run is the escape valve. Currently you can return nil from any `run` to skip that job, but the run still happens.
+- **Map-form variant trigger.** What's the threshold for switching from `(job id inputs run)` positional to `(job id {:inputs ... :run ... :extra ...})` map-form? First option that genuinely needs to exist at the job level — likely candidates would be per-job timeout or retry policy. None planned for v1.
+
+## Locked-in decisions
+
+- **`(job id inputs run)`** — three positional arguments. No options map; if a fourth option ever needs to exist, that's the moment to introduce a map-form variant.
+- **`id`** is a keyword; cannot contain `/`. Validation rule, parse-time error.
+- **`inputs`** is a non-empty list of names. Each is either a job id or a source ref (reserved name in the `quire/` namespace).
+- **v1 supports only strings/keywords in `inputs`.** Constructor calls (for cron, webhook, output cherry-picks) are the planned extension; shape settled, implementation deferred.
+- **Builtins live under `quire/`**; user job ids cannot contain `/`.
+- **For v1, the only source is `:quire/push`.** Cron, webhook, manual deferred.
+- **Filtering happens inside `run`** by returning `nil`. Every push starts a run; jobs that return nil from `run` are skipped.
+- **Destructure source inputs at the function arg** — `(fn [{:inputs {:quire/push push}}] ...)` — to avoid awkward dot-access on `/`-containing keys.
+- **Dependency graph derived from the inputs list**, not declared separately. No `:needs`.
+- **Four structural validations**: acyclic (registration eval), non-empty inputs (registration eval), reachability from a source (registration eval), no `/` in user job ids (parse time). All fail-closed with named-target error messages.
+- **`run` is a function** `(fn [{: inputs}] ...)`. Returns a table (the outputs) or `nil` (skipped). No sugar.
+- **`(container {opts})` is the primary primitive** for running containers. Opts include `:image`, so a single job can use multiple images by making multiple container calls.
+- **Three eval contexts** — registration, run start, per job — all using the same subprocess machinery and limits.
+- **Source registration sourced from the default branch only** (relevant once registration becomes meaningful — for v1 it's a no-op since `:quire/push` needs no registration).
diff --git a/docs/CI.md b/docs/CI.md
new file mode 100644
index 0000000..239b5e3
--- /dev/null
+++ b/docs/CI.md
@@ -0,0 +1,267 @@
+# quire — CI design
+
+How CI works in quire. Slots alongside PLAN.md; will likely fold in once the open questions settle.
+
+## Shape
+
+The runner lives **in-process with `quire serve`**, as a long-lived tokio task in the same binary. It owns a queue of pending runs (in-memory, reconstructed from disk on startup), watches it for new entries, materializes a workspace per run, evaluates `.quire/ci.fnl`, and shells out to execute each job. Jobs are **ephemeral** — fresh sandbox per job, torn down on exit.
+
+The runner itself is not a container. It's a tokio task. The thing the runner *spawns* is the sandbox.
+
+```
+quire (one process)
+ ├── HTTP server (quire serve)
+ ├── ci-runner task
+ │ ├── run #1: <sandbox> rust:1.75 ... (ephemeral)
+ │ ├── run #2: <sandbox> python:3.12 ... (ephemeral)
+ │ └── ...
+ └── (shared state: run queue, log broadcasts)
+```
+
+Not the long-lived-per-image runner pool that GitHub Actions and GitLab use. That model amortizes startup at the cost of hermeticity — job N+1 inherits whatever job N left behind in the filesystem, which becomes a permanent class of "fails after the previous job" debugging. The speedup mostly comes from cache reuse, which is achievable with bind-mounted cache directories without taking on the statefulness debt. Personal forge doing dozens of runs/week, not thousands/day; container-per-job is strictly better here.
+
+The runner doesn't get its own process because **it doesn't execute user code in its address space**. With container-per-job, the runner reads files, builds a `docker run` argv, spawns it, copies stdout to a log file, reads exit code. None of those steps run user code in-process. A bug in `cargo test` can't crash the runner because it's running in a different container with its own kernel namespace. Process isolation between web and runner would buy nothing here — the docker boundary is doing that work. Don't pay twice for it.
+
+## Communication: filesystem as state of record, channels as optimization
+
+Run records on disk are the **durable truth** once written. The hook is a thin transport: it sends a push event over a Unix socket to `quire serve`, which is the sole writer of run records on disk.
+
+| Component | Reads from disk | Writes to disk | In-memory comms |
+|---|---|---|---|
+| Hook (`post-receive`) | — | — | push event → `quire serve` socket listener |
+| Runner (in-process with `quire serve`) | run records on startup | `meta.json`, `state.json`, `jobs/*/`, logs | wakeup from listener (mpsc); broadcast logs → web |
+| Web (`quire serve`) | run records on demand | — | subscribe to log broadcasts |
+
+The listener task (also inside `quire serve`) bridges the hook process boundary to the in-process runner. It binds `/var/quire/server.sock` on startup, parses incoming events, writes the initial run record, and signals the runner via mpsc. The wakeup signal carries no payload — the runner re-derives state by scanning `pending/`, so missed or duplicated wakes are idempotent.
+
+On startup, the runner walks `runs/pending/` and `runs/active/`, reconstructs the queue, and reconciles orphans (any `active/` entry whose container is no longer running gets marked failed). Crash resilience covers `quire serve` restart: any run record committed before the crash gets picked up.
+
+**v1 limitation: zero-loss-on-server-down is not provided.** If `quire serve` is down at push time, the hook's socket connect fails, the pusher sees a stderr warning, and no run is created. The push itself remains accepted by git (post-receive runs after acceptance). The v1 mitigation is "run `quire serve` under a supervisor that restarts it"; a hook fallback that writes `meta.json` directly when the socket is unreachable is a deferred follow-up if this ever bites in practice.
+
+The "we could one day extract the runner into its own process" door stays open: the on-disk schema doesn't change, the listener-to-runner mpsc becomes a Unix socket. Not building it now.
+
+## Storage: no database
+
+No SQLite in v1. Run records, job state, logs, and the queue all live as files under `/var/quire/runs/`. None of the queries the v1 web UI wants are slow at this scale; the in-memory queue handles low-latency enqueue/dequeue.
+
+**The commitment, written down so it doesn't drift:** if SQLite ever earns its keep — most likely trigger is FTS5 over logs — it enters as a **secondary index over the filesystem**, never as a primary store. Files remain canonical. The database is rebuildable: `quire reindex` walks `runs/` and repopulates. If the database is corrupted or lost, recovery is mechanical. The rule: `rm /var/quire/quire.db && quire reindex` returns the system to a working state. If that ever stops being true, the database has crossed a line it shouldn't have.
+
+This is the principle that prevents drift. The temptation to migrate state into SQLite ("just this one thing, it's so much easier") is constant once it exists; without the rule written down, the second time you reach for SQLite you'll have forgotten why you originally said no.
+
+## Concurrency: max one run at a time
+
+**One run executes at a time across the entire forge.** Job 2 of repo A waits for job 1 of repo B to finish.
+
+Implications:
+- Cache contention disappears entirely — no two jobs ever touch the same cache dir simultaneously.
+- Resource limits are trivial: the box is dedicated to whatever's running. No `--cpus`/`--memory` math, no oversubscription.
+- Queueing is FIFO from `runs/pending/`. No fairness story needed.
+
+The cost is latency under load: push to repo A while a 5-minute build of repo B is running, and you wait. For personal scale this is almost never the experience. The escape valve is documented and small: add a `max_concurrent_runs` config knob and a per-repo cache file lock; spawn N runner tasks instead of 1. The queue, supersede logic, and on-disk schema don't change.
+
+Within a run, **jobs form a DAG** (see next section), but the executor schedules them serially in topological order. Same constraint, same escape valve: the executor changes from "pick one ready job" to "pick up to N ready jobs"; the spec doesn't change.
+
+### Supersede semantics
+
+When a new push arrives for a ref that already has work in flight or queued for the same `(repo, ref)`:
+
+- **Queued, not yet started:** new push replaces the queued one. Old run marked `superseded`. If you pushed twice in 30 seconds, you almost certainly only care about the second result.
+- **Currently running:** kill the in-flight sandbox (`docker kill <id>`), mark the run `superseded`, enqueue the new one.
+- **Different ref of same repo:** unaffected. Pushing to `feature-branch` should not kill a running build of `main`.
+
+Cheap to get right *if* the run record stores the ref it's building from the start, and queue lookups are "any pending or active runs for `<repo>:<ref>`?" Both are one-line conditions.
+
+## The job DAG
+
+Jobs declare dependencies via `:needs`. Missing `:needs` means no dependencies — ready immediately. Failure of a job marks all transitive dependents as `skipped`, unless the failing job has `:allow-failure true` (in which case dependents proceed normally).
+
+```fennel
+{:jobs
+ [{:id "setup"
+ :image "rust:1.75"
+ :run "rustup component add clippy rustfmt"}
+
+ {:id "lint"
+ :image "rust:1.75"
+ :needs ["setup"]
+ :allow-failure true
+ :run "cargo clippy -- -D warnings"}
+
+ {:id "test"
+ :image "rust:1.75"
+ :needs ["setup"]
+ :run "cargo test"}
+
+ {:id "deploy"
+ :image "alpine"
+ :needs ["test"]
+ :run "scp target/release/quire host:/usr/local/bin/"}]}
+```
+
+With max-concurrency 1, executor topo-sorts and picks one ready job at a time (FIFO among ready jobs = spec order). `lint` and `test` are both ready after `setup`; lint runs first, then test, then deploy. If `setup` fails, all three skip.
+
+Schema decisions baked in:
+- `:needs` is `needs-all` (job runs only when *all* listed jobs succeed). `needs-any` is a real but rare want; the schema can grow `:needs-any` later without breaking existing specs.
+- Job ids are arbitrary non-empty strings. Cycle detection at parse time via Kahn's algorithm — fails closed, error message names the cycle.
+- `:allow-failure` exists from v1. Without it, the only way to express "lint can fail and we still want to deploy" is to remove the dependency, which loses the ordering signal.
+
+## Fennel evaluation
+
+`.quire/ci.fnl` is **executed**, not parsed. The Fennel program runs to completion and produces a value — a table of jobs with `:needs` references. The runner schedules against the resulting structure. Eval happens once at run start; the DAG is static after eval returns.
+
+Code, not data, means matrix builds, helpers, and conditionals fall out for free without dedicated schema features:
+
+```fennel
+(local rust-versions [:1.75 :1.76 :stable])
+
+{:jobs
+ (collect [_ v (ipairs rust-versions)]
+ {:id (.. "test-" v)
+ :image (.. "rust:" v)
+ :needs ["setup"]
+ :run "cargo test"})}
+```
+
+### Eval is sandboxed by subprocess + timeouts
+
+> **v0 status:** initial implementation evaluates `ci.fnl` in-process inside `quire serve`, with no wallclock or memory cap. A buggy or hostile `ci.fnl` can hang or OOM the server. Subprocess + the limits described below are the eventual target; they land when ci.fnl runaway becomes a real liability.
+
+Eval runs as a subprocess: `quire eval-ci-config <workspace>`. The child reads the file, runs the Fennel evaluator, serializes the result table to JSON on stdout, exits. Runner reads stdout, kills the child after the deadline:
+
+```rust
+let mut cmd = Command::new(env::current_exe()?);
+cmd.args(["eval-ci-config", "--workspace"]).arg(workspace)
+ .stdout(Stdio::piped()).stderr(Stdio::piped());
+
+unsafe {
+ cmd.pre_exec(|| {
+ let lim = libc::rlimit {
+ rlim_cur: 512 * 1024 * 1024,
+ rlim_max: 512 * 1024 * 1024,
+ };
+ libc::setrlimit(libc::RLIMIT_AS, &lim);
+ Ok(())
+ });
+}
+
+let child = cmd.spawn()?;
+let output = timeout(Duration::from_secs(10), child.wait_with_output()).await
+ .map_err(|_| anyhow!("ci.fnl evaluation exceeded 10s deadline"))??;
+```
+
+Defaults, both global config in `config.fnl`:
+- **10 second wallclock.** If `ci.fnl` needs longer than 10s to *decide what jobs to run*, something's wrong — that's design-time work, not runtime work.
+- **512 MB memory.** Same logic; eval shouldn't be doing heavy computation.
+
+Per-repo overrides intentionally not supported: the only reason a repo would need more is that the repo is doing something it shouldn't.
+
+This isn't bwrap because it doesn't need to be. The threat model is "did I accidentally write a `ci.fnl` that infinite-loops or eats the disk," not "is someone attacking me." Wallclock kill works regardless of what the eval is doing in C; OOM kills the child, not the runner; crash isolation is free. Filesystem and network isolation that bwrap would add buy nothing — eval is reading files in a workspace it has legitimate access to anyway.
+
+The in-process Lua `debug.sethook` approach is clever but has a blind spot inside C functions (`string.rep("x", 2^30)` returns instantly from Lua's perspective, the hook never fires during it, the runner OOMs). Subprocess + kill-on-timeout is boring and correct.
+
+## Run lifecycle
+
+1. **`post-receive` hook** sends a push event (one JSON line: `{type, repo, pushed_at, refs: [{ref, old_sha, new_sha}, ...]}`) over `/var/quire/server.sock` and exits. The listener task in `quire serve` parses the event, allocates a run-id per ref, writes `runs/<repo>/<run-id>/{meta.json, state.json}`, and signals the runner via mpsc. No CI work runs in the hook itself.
+2. **Runner picks up** the entry from the queue. Atomic rename `pending/<id>` → `active/<id>` for state-machine clarity.
+3. **Materialize workspace.** `git --git-dir=repos/foo.git archive <sha> | tar -x -C workspace/`. No worktree, no checkout state on the bare repo. Workspace is throwaway; deleted at end of run.
+4. **Evaluate `.quire/ci.fnl`** via subprocess (see above). Result is the job DAG.
+5. **Per ready job:** spawn the sandbox with workspace + caches mounted, stream stdout/stderr to `jobs/<job-id>/log` (and broadcast for live web tailing), capture exit code, record container ID for cancellation.
+6. **Aggregate.** Write final status to the run directory. Move `active/<id>` → `complete/<id>` (or `failed/<id>`).
+
+## Run record schema
+
+```
+runs/<repo>/<run-id>/
+ meta.json # immutable: sha, ref, pusher, pushed_at
+ state.json # mutable: status, started_at, finished_at, runner_pid, sandbox_id
+ jobs/
+ <job-id>/
+ spec.json # immutable: image, cmds, env, needs (extracted from ci.fnl)
+ state.json # mutable: status, started_at, finished_at, exit_code, sandbox_id
+ log # append-only stdout+stderr
+ cancel # touch-file; runner checks before each job
+```
+
+Two principles fall out:
+- **Immutable vs. mutable files are separate.** `meta.json` is written once and never touched. Readers (the web UI) can cache `meta.json` indefinitely and only re-read `state.json`.
+- **Append-only logs.** Web UI tails the log file; runner appends; no coordination needed. Live tailing also goes through a `tokio::sync::broadcast` channel for sub-second latency, but the file is the source of truth.
+
+## Sandbox backend — the real fork in the road
+
+Polyglot toolchains rule out "just bind-mount host `/`" — that path requires every toolchain on the host. So the sandbox is either Docker images or OCI images extracted to disk and run under bubblewrap. Both work. They imply different overall architectures.
+
+### Path A: Docker (DooD)
+
+`docker run --rm -v <ws>:/workspace -w /workspace --cpus=N --memory=M <image> sh -c '<cmds>'` per job. Shared image cache, well-trodden, every CI system on earth has done this.
+
+Quire stays containerized. The container talks to the host's docker daemon via bind-mounted `/var/run/docker.sock`. Anyone with that socket effectively has root on the host — fine here since quire and the operator account already share the box.
+
+The gotcha that will bite once and never again: when quire calls `docker run -v /var/quire/runs/foo/123/workspace:/workspace`, the host path is resolved by the *host* daemon, not interpreted from inside the quire container. So `/var/quire` must be at the *same path* on host and inside the quire container. Get this wrong in compose and you'll spend an hour on empty workspaces.
+
+Cost: socket mount, the path-pinning rule, daemon-talking-to-daemon, quire stays containerized.
+
+### Path B: OCI + bubblewrap
+
+`skopeo copy docker://rust:1.75-slim oci:images/rust-1.75:latest`, then `umoci unpack`, then bwrap binds the rootfs and runs the job:
+
+```
+bwrap --bind rootfs/rust-1.75 / \
+ --bind <workspace> /workspace --chdir /workspace \
+ --bind <cache>/cargo /cache/cargo \
+ --setenv CARGO_HOME /cache/cargo \
+ --proc /proc --dev /dev --tmpfs /tmp \
+ --unshare-pid --unshare-ipc --die-with-parent \
+ sh -c 'cargo test'
+```
+
+Full Docker Hub image catalog. No daemon, no socket, no privilege, no DinD/DooD question. The cascade: quire becomes a systemd unit on the host; one process tree; the `/var/quire` path-pinning rule becomes irrelevant because nothing crosses a container boundary.
+
+Costs that need real work:
+- **Writable rootfs.** Most images expect to write outside the workspace (apt, scripts dropping files in `/etc`). Bwrap's `--overlay-src` gives a writable union with a throwaway upper layer. ~30 lines, but mandatory by the second image you try.
+- **Image refresh.** No auto-pull on tag updates. Either explicit `quire ci pull` or digest-check before each run.
+- **Resource limits.** No `--cpus`/`--memory`. Wrap with `systemd-run --user --scope -p MemoryMax=2G -p CPUQuota=200% bwrap ...` or write the bwrap PID into a cgroup directly.
+- **OCI config.** Images carry `entrypoint`/`cmd`/`USER` in their config; bwrap doesn't read it. Parse the JSON yourself if you want to honor it. For CI it barely matters since you're overriding the command anyway.
+
+Roughly 200-400 lines of Rust beyond the bind-host case, mostly shelling to `skopeo`/`umoci` and assembling the bwrap argv.
+
+### Recommendation
+
+**DooD for v1, OCI+bwrap as a known migration path.**
+
+- DooD gets CI working in a week. Polyglot is free.
+- The runner is one tokio task in one binary. Swapping its backend is a contained change. The Fennel job spec doesn't care which backend ran it.
+- Once the system has been used enough to know what's actually needed from it, the OCI+bwrap migration removes the last reason for quire to be containerized at all — which is the more on-brand endpoint given the rest of the design.
+
+If the impulse is to skip straight to OCI+bwrap on aesthetic grounds: defensible, but you're paying ~2 weeks of sandbox plumbing before any CI runs at all. The intermediate state of "DooD works, here's what I actually want from it" is worth a lot.
+
+## Caching
+
+Per-repo named volume (DooD) or bind-mounted directory (bwrap) at `/cache`, with `CARGO_HOME=/cache/cargo`, `npm_config_cache=/cache/npm`, etc. set via env. Cache lives on the same volume as everything else under `/var/quire/cache/<repo>/`. Same model in both backends; just plumbed differently.
+
+Punt on cache invalidation until it actually annoys. "Delete the cache dir" is a fine v1 escape hatch.
+
+## Open questions
+
+- **Fennel stdlib surface.** What does `eval-ci-config` expose? At minimum: env access (`(env :GITHUB_TOKEN)`, scoped to repo secrets), table-building for jobs, maybe a `matrix` helper. Bigger question: does eval get to read files from the workspace (`(read-file "Cargo.toml")` to decide what jobs to register)? "Yes" is the thin end of the dynamic-jobs wedge; "no" keeps the model strict.
+- **Image pre-warming.** First run of any image pulls hundreds of MB. Want both implicit pull-on-demand and an explicit `quire ci pull <image>` to warm before pushing.
+- **Log streaming UX.** SSE tailing the log file works for the web UI, but the broadcast-channel-vs-file-tail interaction has subtleties around "client connects mid-job, wants backlog + live."
+- **Image GC.** Host accumulates layers. Weekly `docker image prune` via host cron is the dumb correct answer for DooD; OCI+bwrap needs a `quire ci gc` that walks `images/` and `rootfs/` against last-used timestamps.
+- **Services / sidecars.** Some jobs want postgres or redis alongside. The shape is "bring up sidecar, run job against it, tear down." Adds a small orchestration layer. Not v1.
+- **Secrets.** CI jobs that need API tokens. Probably env-injected from `config.fnl`, scoped per-repo. Worth designing the surface area before the first job needs one.
+- **Cycle detection error UX.** Where do parse errors surface — does the push fail (post-receive returns nonzero) or does the run start and immediately error? Probably the latter, since hooks should be fast and CI errors belong in CI history.
+
+## Locked-in decisions
+
+- **Runner is in-process** with `quire serve` as a tokio task; not a separate process. Filesystem is the state of record; channels are the wakeup optimization.
+- **No SQLite in v1.** If it enters later, it's a secondary index over the filesystem, never primary. `rm quire.db && quire reindex` must always recover.
+- **Container-per-job**, not long-lived runners.
+- **DooD for v1**; OCI+bwrap as planned migration path.
+- **Workspace materialized via `git archive`**, not worktree.
+- **Max concurrency 1** across the whole forge. Escape valve is `max_concurrent_runs` config + per-repo cache file lock; not building it now.
+- **Jobs are a DAG** with `:needs` (needs-all). Executor schedules serially in topological order under max-concurrency 1; lifting that constraint changes the executor, not the spec.
+- **`:allow-failure`** flag exists from v1.
+- **Supersede on same `(repo, ref)`**: replace queued, kill running.
+- **`.quire/ci.fnl` is executed**, returns the DAG.
+- **Eval runs as a subprocess** with 10s wallclock and 512 MB memory cap. Not bwrap. **v0 deviation:** initial implementation evaluates in-process; subprocess sandbox lands when ci.fnl runaway becomes a real liability.
+- **Hook is a transport, not a writer.** `post-receive` sends a push event over `/var/quire/server.sock`; `quire serve` writes the run record. Hook never touches `runs/`. Tradeoff: zero-loss-on-server-down is dropped in v1 (push lands but no run is created). Fallback to direct disk write is a deferred follow-up.
+- **Caches** are bind-mounted directories under `/var/quire/cache/<repo>/`.