Console & REST API¶

Web Console¶

TODO

Document the Turbine web console (React + TypeScript SPA in console/): how to build it (pnpm --prefix console build), how to point it at a node's REST API, and what views it exposes (cluster topology, partition assignments, state browsing, metrics).

REST API¶

Every Turbine node exposes an HTTP API (default port: 8400, configurable via api_port or TURBINE_API_PORT).

Status & monitoring¶

Endpoint	Method	Description
`/health`	GET	Node status, index, worker count
`/assignments`	GET	Partition assignments for this node
`/metrics`	GET	Prometheus metrics (see below)

# Check node health
curl http://localhost:8400/health

# List assigned partitions
curl http://localhost:8400/assignments

# Scrape Prometheus metrics
curl http://localhost:8400/metrics

Exposed metrics¶

Metric	Type	Description
`turbine_messages_total`	counter	Total messages processed
`turbine_batches_total`	counter	Total batches processed
`turbine_throughput_10s`	gauge	Messages/second over the last 10s
`turbine_throughput_60s`	gauge	Messages/second over the last 60s
`turbine_state_ops_total`	counter	Total state operations committed (user puts/deletes + offset)
`turbine_state_ops_10s`	gauge	State operations/second over the last 10s
`turbine_state_ops_60s`	gauge	State operations/second over the last 60s
`turbine_consumer_lag`	gauge	Consumer lag (messages behind high watermark)
`turbine_event_time_watermark_ms`	gauge	Event-time watermark per partition, epoch ms (event-time subscriptions only)
`turbine_event_time_watermark_lag_ms`	gauge	Wall clock − watermark, refreshed ~10 s — keeps growing on a silent stream
`turbine_event_time_null_total`	counter	Rows with missing/unparsable event time, dropped from event-time bucketing
`turbine_dlq_messages_total`	counter	Dead-lettered / dropped records (labels: `topic`, `partition`, `phase`, `action`)
`turbine_dlq_messages_60m`	gauge	Dead-lettered / dropped records over the last 60 min
`turbine_last_checkpoint_timestamp_ms`	gauge	Wall-clock time of the last uploaded checkpoint
`turbine_last_checkpoint_offset`	gauge	Consumer offset captured by that checkpoint — a recovery replays everything after it
`turbine_last_checkpoint_size_bytes`	gauge	On-disk size of that checkpoint
`turbine_poll_seconds`	histogram	Time waiting for messages from the broker
`turbine_decode_seconds`	histogram	JSON → Arrow deserialization time
`turbine_process_seconds`	histogram	User business logic time
`turbine_produce_seconds`	histogram	Encoding + sending to the output topic
`turbine_commit_seconds`	histogram	Writing state + offset to RocksDB
`turbine_batch_duration_seconds`	histogram	Total batch time (all phases)
`turbine_worker_kind`	gauge (info)	Constant `1` carrying the worker's nature as a `kind` label: `stateful`, `stateless` or `repartition`. Join it onto any worker-labeled series with PromQL `group_left(kind)` to slice dashboards by worker nature
`turbine_worker_restarts_total`	counter	Supervisor restarts of a worker. A steadily growing value is the crash-loop signature — alert on it: throughput at zero alone cannot distinguish a crash-loop from an idle topic
`turbine_worker_fatal_total`	counter	Fatal worker halts (the app shuts down after recording one)
`turbine_repartition_records_total`	counter	Records re-emitted by a repartition hop, labeled by destination `partition` — compare partitions to spot key skew
`turbine_repartition_null_keys_total`	counter	Repartitioned rows whose grouping column was null (routed to the shared empty-key group)
`turbine_repartition_transit_ms`	histogram	Repartition hop transit: re-emit time minus source record timestamp. Near zero when the hop keeps up; grows when it falls behind

Histograms are exposed as summaries with per-worker quantile="0.5|0.9|0.99|…" lines plus _sum/_count. The quantiles are computed per worker over a sliding window — read them per worker (take a max for a "worst worker" view), never sum or average them across workers.

All metrics are labeled with worker="work-{sub_index}-p{partition}", where sub_index is the subscription's declaration order. A worker belongs to exactly one subscription, so when several subscriptions read one topic they have distinct worker labels — group by worker (or join on the sub_index from /assignments) rather than by topic to keep them apart.

Cluster management (Raft mode)¶

These endpoints are available when running in Raft cluster mode. They support ForwardToLeader: if you hit a follower, you get a redirect response with the leader's address.

Endpoint	Method	Description	Body
`/cluster/add_learner`	POST	Add a node as Raft learner	`{"node_id": 4, "addr": "http://10.0.0.4:8400"}`
`/cluster/change_membership`	POST	Promote learners to voters	`{"add_voter_ids": [4]}`
`/cluster/remove_voter`	POST	Remove a voter (drain before maintenance)	`{"remove_voter_ids": [3]}`

Response format:

// Success
{"status": "Ok"}

// Redirect (hit a follower)
{"status": "ForwardToLeader", "leader_id": 1, "leader_addr": "http://10.0.0.1:8400"}

// Error
{"status": "Error", "message": "..."}

State introspection¶

Query the RocksDB state stores across workers for debugging, alerting dashboards, or operational visibility.

`POST /state` — per-node query¶

Query the state stores on a single node. Available in both standalone and cluster mode.

Request body:

Field	Type	Default	Description
`worker_id`	`string \\| null`	`null`	Filter to a specific worker. `null` = all workers on this node.
`prefix`	`string \\| null`	`null`	Only return entries whose key starts with this prefix. `null` = no filter.
`limit`	`int \\| null`	`100`	Maximum entries per worker. Capped at 1000.
`include_internal`	`bool \\| null`	`false`	Include `_turbine_*` keys (offsets, timer state).

Response:

{
  "workers": [
    {
      "worker_id": "work-00-p00",
      "topic": "dev-input",
      "partition": 0,
      "truncated": false,
      "entries": [
        {"key": "acc|error_rate_short|tenant_1|region_eu|1700000000000", "value": "{\"sum\":450.2,\"count\":10}"},
        {"key": "cool|error_rate_short|tenant_1|region_eu", "value": "{\"severity\":\"warning\"}"}
      ]
    }
  ]
}

truncated is true when the worker has more entries than limit. Increase limit or narrow prefix to see more.
value is returned as UTF-8 text when the bytes are valid UTF-8, otherwise as b64:<base64>.
Keys starting with _turbine_ are excluded by default (these store offsets and timer metadata). Set include_internal: true to see them.

Examples:

# All state across all workers on this node
curl -X POST http://localhost:8400/state \
  -H 'Content-Type: application/json' -d '{}'

# Accumulators for a specific rule, limiting results
curl -X POST http://localhost:8400/state \
  -H 'Content-Type: application/json' \
  -d '{"prefix": "acc|error_rate_short", "limit": 50}'

# All state for one worker, including internal keys
curl -X POST http://localhost:8400/state \
  -H 'Content-Type: application/json' \
  -d '{"worker_id": "work-00-p00", "include_internal": true}'

`POST /cluster/state` — cluster-wide query¶

Fan-out to all Raft members, aggregate their /state responses. Only available in Raft cluster mode.

Same request body as /state. The response includes an additional unreachable field listing node addresses that did not respond within 5 seconds.

Response:

{
  "workers": [
    {"worker_id": "work-00-p00", "topic": "dev-input", "partition": 0, "truncated": false, "entries": [...]},
    {"worker_id": "work-01-p00", "topic": "dev-input", "partition": 1, "truncated": true, "entries": [...]}
  ],
  "unreachable": ["10.0.1.3:8400"]
}

Example:

# All aggregate state across the entire cluster
curl -X POST http://any-node:8400/cluster/state \
  -H 'Content-Type: application/json' \
  -d '{"prefix": "agg|"}'

State key conventions¶

The alerting app (examples/alerting_app.py) uses the following key layout. Prefix-based queries make it easy to find data per tenant or rule:

Key pattern	Prefix for queries	Content
`agg\\|{rule}\\|{group}\\|{window_start}`	`agg\\|` or `agg\\|{rule}\\|`	JSON aggregate state (sum, count, etc.)
`cool\\|{rule}\\|{group}`	`cool\\|` or `cool\\|{rule}\\|`	Cooldown state machine (severity, ts)
`_turbine_offset`	`_turbine_`	Last committed offset (internal)
`_turbine_timer:{id}`	`_turbine_timer:`	Scheduled timer fire times (internal)

Topology¶

GET /topology returns the application's static topology: one entry per subscription with its consumed / produced / dead-letter topics, error and delivery policies, and declared windows. It is a pure function of the app's @subscribe declarations, so it is identical on every node — query any one of them. Live numbers (rates, lag, DLQ counts) come from /cluster/nodes; join the two by topic.

curl http://localhost:8400/topology

{
  "subscriptions": [
    {
      "sub_index": 0,
      "name": "score_cpu_usage_per_user",
      "source_topic": "events",
      "output_topic": "scores",
      "dlq_topic": "events-dlq",
      "on_error": "dlq",
      "processing_guarantee": "exactly_once",
      "partition_key": "tenant_id",
      "event_time": true,
      "windows": ["risk_score"]
    }
  ]
}

name is the handler's function or class name (or an explicit @app.subscribe(name=…)). windows lists statically declared windows only — windows created lazily inside process() appear in /cluster/windows once the app is running, but not here. Optional fields (output_topic, dlq_topic, partition_key, empty windows) are omitted.

sub_index is the subscription's declaration order, and matches the sub_index on /assignments. Since several subscriptions may read one topic, a topic can appear on several rows — join a subscription to its workers on sub_index, never on source_topic, or you will merge the stats of every subscription reading that topic.

Events¶

Each node keeps a small in-memory ring (512 entries) of operational events: worker starts, stops, failures, restarts. This is what makes a crash-looping worker visible — gauges show the same value between two polls, the event feed shows every restart in between.

Endpoint	Method	Description
`/events`	GET	This node's events, oldest first. `?since=<seq>` cursors on the per-node sequence, `?limit=` caps (default 200, max 512)
`/cluster/events`	GET	Every node's feed merged, newest first, each event stamped with its `node_id`

curl 'http://localhost:8400/cluster/events?limit=50'

{
  "events": [
    {
      "seq": 42,
      "timestamp_ms": 1781278835645,
      "kind": "worker_failed",
      "worker_id": "work-00-p00",
      "topic": "events",
      "partition": 0,
      "node_id": 1,
      "message": "state store error: ..."
    }
  ]
}

kind is one of worker_started, worker_stopped, worker_failed, worker_fatal (the error that stops the app), worker_restarted. The ring is in-memory: it resets on process restart, and long-horizon history belongs to your logging stack — this feed answers "what just happened?", not "what happened last week?".

Dead-letter queue (DLQ) inspection¶

Surface which subscriptions route bad records where, how many have been dead-lettered, and what the dead-lettered records actually look like — without dropping to rpk. See error handling for the on_error / dlq_topic configuration these endpoints report on.

Endpoint	Method	Description
`/dlq`	GET	This node's subscriptions with their DLQ config + counters
`/cluster/dlq`	GET	Same, aggregated across the cluster (Raft mode)
`/dlq/peek`	GET	The last N raw records off a dead-letter topic

A lightweight recent-DLQ count is also folded into each worker on /cluster/nodes (dlq_60m, the rolling 60-minute count) so a node card shows DLQ activity at a glance.

`GET /dlq` and `GET /cluster/dlq`¶

curl http://localhost:8400/dlq

{
  "subscriptions": [
    {
      "name": "process_orders",
      "source_topic": "orders",
      "dlq_topic": "orders_dlq",
      "on_error": "dlq",
      "counters": {
        "total": 128,
        "recent_60m": 12,
        "by_phase_action": [
          {"phase": "deserialize", "action": "dlq", "count": 120},
          {"phase": "processing", "action": "dlq", "count": 8}
        ]
      }
    }
  ]
}

Every subscription is listed with its configured on_error policy, so a subscription using skip (no dlq_topic) still shows up with its drop counters. total is cumulative since process start; recent_60m is the rolling 60-minute count. /cluster/dlq sums each subscription's counters across nodes and adds an unreachable list for any peer that did not respond.

A subscription is identified by (source_topic, name), since several subscriptions may read one topic.

Drop counters are per topic, not per subscription

Dead-letter counters are recorded against the source topic. When several subscriptions read the same topic they each report that topic's counters, so the same drops appear on each of those rows — don't add them up. The on_error / dlq_topic config on each row is per subscription and exact.

`GET /dlq/peek`¶

curl 'http://localhost:8400/dlq/peek?topic=orders_dlq&limit=10'

topic must be a configured dlq_topic (the endpoint refuses arbitrary topics). limit defaults to 20 (max 200). The read is non-destructive — it never commits an offset.

{
  "topic": "orders_dlq",
  "records": [
    {
      "partition": 0,
      "offset": 41,
      "timestamp_ms": 1781190000000,
      "key": "tenant_42",
      "payload": "{not valid json",
      "payload_truncated": false,
      "headers": [
        {"key": "turbine_dlq_source_topic", "value": "orders"},
        {"key": "turbine_dlq_source_partition", "value": "0"},
        {"key": "turbine_dlq_source_offset", "value": "41"},
        {"key": "turbine_dlq_phase", "value": "deserialize"},
        {"key": "turbine_dlq_error", "value": "Json error: ..."},
        {"key": "turbine_dlq_worker", "value": "work-00-p00"}
      ]
    }
  ]
}

Keys, payloads, and header values are UTF-8 text where possible, otherwise base64 with a b64: prefix. Payloads over 2 KB are truncated (payload_truncated: true). Replay (re-producing from the DLQ back to the source topic) is out of scope — peek + counts only.