Windowing — advanced¶

The everyday window usage is on Windowing (shared mechanics) and Window types (the kinds). This page covers what you reach for less often: the failover guarantees of persistent windows, the in-memory trade-offs in depth, wiring one window's result into another with Ref / join_with, deferring window construction with the recovery hook, and the one caveat sessions carry at very high key cardinality.

Failover-resilient persistent windows¶

PersistentTumbling, PersistentSliding and PersistentSession survive node failure. Accumulator state and the fire-time timer are both persisted and checkpointed to durable storage. When a partition migrates — because a node crashed and Raft redistributed its partitions, or during a rolling upgrade — the new owner restores from the latest snapshot. Any in-flight window whose deadline has already passed fires immediately with the fully-merged value; windows still in the future fire on schedule.

This is a real strength: long windows (hours, days) don't restart when a node dies. There is one correctness boundary — allowed_lateness_ms. If failover completes before window_end + allowed_lateness_ms, the window closes with all the events it would have seen in a steady-state cluster. If failover runs longer than that, the window still fires (the timer is overdue and fires as soon as the new worker boots), but events produced during the outage whose timestamps fall inside the closing window may arrive after the timer and be treated as late.

Pick allowed_lateness_ms wide enough to absorb your expected failover budget. The liveness grace period defaults to 60 s, so a minute or two of lateness tolerance is a reasonable floor for event-time windows you need to be complete.

In-memory windows¶

InMemoryTumbling / InMemorySliding / InMemorySession keep window state in process memory instead of the persistent state store. Same API as their persistent counterparts — the difference is entirely in storage, recovery, and close timing.

The payoff is throughput. Every event in a persistent window pays a serialise + durable-write + commit on the state path; an in-memory window is a plain in-memory merge. That cost scales with how many windows are open and how fast events arrive, so the in-memory advantage is largest exactly where it's needed most: a large number of concurrent windows (high key × window cardinality) and/or a high event rate. If profiling shows the state-store write path dominating — or you simply expect an intense window count — the in-memory variant is the first lever to pull, provided the durability trade-off below is acceptable.

Close is traffic-driven. Each batch advances a per-partition watermark and emits any window whose deadline has been crossed. To bound how long a window can stay open on a partition that has gone quiet, the runtime also runs a best-effort idle tick approximately every 10 s that advances the watermark to wall-clock time. Net effect: a window always closes within ~10 s of its deadline, even with no traffic. The persistent kinds are tighter (sub-second) — pick them when emission timing is correctness-critical rather than UX-critical.

Trade-offs vs. the persistent kinds:

❌ No crash recovery. Window state lives in process memory only. A worker crash loses the partial aggregates for windows that hadn't closed yet — the source replay recreates them, but if your close callback has non-idempotent side effects (e.g. writes a billing event), you'll double-emit on the replayed messages.
⚠️ Loose close timing. Quiet partitions close within ~10 s of the deadline, not at a precise instant. Fine for analytics and dashboards; not fine for emissions that must hit a precise wall-clock moment.
❌ No state introspection. The REST state-browsing endpoints and the console only see the persistent state store, so in-memory windows don't show up there.
✅ No state-store overhead per event. Each event is a cheap in-memory merge — no per-event serialisation, no durable write, no commit cost. On workloads where the state-store write path was the dominant cost (cheap aggregates, high cardinality, hot per-key counters), throughput typically gains in the range of tens-of-percent to roughly 2× over the persistent kind. The exact gain depends on the workload — bench the two on yours (see Configuration).

When to pick which:

Persistent when state correctness across restarts matters, when windows are long enough that losing them hurts (minutes-to-hours), or when emission timing must be sub-second precise.
In-memory when many windows are open at once or the event rate is high enough that the per-event state-store write dominates, the window is short, and the close-callback side-effect is idempotent (or downstream dedupes). The more windows and the higher the throughput, the bigger the win. Typical: per-second top-K, rolling rate gauges, lightweight sampling, short user-activity bursts on high-cardinality keys.

max_active_windows is a safety trigger against unbounded growth: when concurrent windows on a partition exceed it, the oldest one is evicted and emitted early so the partition's memory stays bounded. Pick a value comfortably above the steady-state concurrent-window count for your workload (distinct keys × concurrent windows); the default (1 M) is rarely hit in practice.

Window-to-window references (`Ref` / `join_with`)¶

Sometimes a window needs to compare its metric against another window over the same key: "the average latency this minute, falling back to the rolling 15-minute average when this minute is sparse", or "this 5-minute error rate against a 24-hour baseline". Rather than re-deriving the baseline inside every reader, one window can reference another window's last-closed result by name, straight from inside its aggregate expression.

Two pieces wire it up:

The referenced window must give its output a name — that's what Ref(...) looks up. Use .alias("name") for a single output, or Multi(name=...) for several.
The reader declares the dependency with join_with=[upstream] and pulls a named value into its own expression with Ref("upstream_name").output_name.

from turbine import KafkaBroker, RecordBatch, TurbineState
from turbine import aggregates as agg, functions as fn
from turbine import windowing as win

@app.subscribe(kafka.topic("events", event_time="ts", event_time_unit="ms"), output=kafka.topic("metrics"))
class tenant_metrics:
    def __init__(self) -> None:
        # Coarse, longer-horizon baseline — keyed on a SUBSET of the reader's keys.
        self._rollup = win.PersistentTumbling(
            self, name="rollup_15m", size_ms=15 * 60_000, key=["tenant"],
            value=agg.Mean("latency_ms").alias("avg_latency"),   # named → referenceable
            # no close callback: this window exists only to feed the reader
        )
        # Reader: when this 5-min window is sparse, fall back to the rolling baseline.
        self._win = win.PersistentTumbling(
            self, name="metrics_5m", size_ms=5 * 60_000, key=["tenant", "region"],
            value=agg.Mean(fn.coalesce(fn.col("latency_ms"),
                                       win.Ref("rollup_15m").avg_latency)),
            join_with=[self._rollup],
            on_close_each=self._emit,
        )

    def process(self, batch: RecordBatch, state: TurbineState) -> RecordBatch | None:
        self._rollup.update(batch, state)   # both windows see every event —
        self._win.update(batch, state)      # update each one you declared
        return None
    ...

Each window you build still gets its own update() in process() — join_with= wires the reference, not the ingestion. The order of the two update() calls doesn't matter (the reader reads the upstream's last closed value, never its in-flight state — see below).

Rules¶

Name the output. A bare Ref("w") (no field) is rejected at construction — always pick a named output (.alias(name) or Multi(name=...)). Forcing the field keeps the call site self-documenting and refactor-safe: adding a sibling metric later can't silently change which value resolves.
Last-closed value, per key. A Ref resolves to the upstream's most recently closed value for that key — never its currently-open partial. That makes it deterministic on replay and free of ordering races: a given close always reads the same upstream value, regardless of arrival order during accumulation. Until the upstream has closed at least once for a key, the reference is null, so wrap it in fn.coalesce(...) (or another null-tolerant expression). The currently-open upstream state is never exposed; if you genuinely need it, model a second window of equal cadence and accumulate explicitly.
Keys must be a subset. The referenced window's key= must be a subset of the reader's key= — rollup_15m keyed by ["tenant"] is readable from a reader keyed by ["tenant", "region"], and the reader projects the extra dimensions out when it looks up. Referencing a finer-keyed window is rejected (one lookup would map to many rows). Both windows must also share the same subscription partition_key.
Persistent only. Both ends must be persistent (PersistentTumbling / PersistentSliding / PersistentSession, in any combination). In-memory windows are rejected on either side, because ref values live in the durable state store — where they're persisted and recovered with the rest of the window state, so references survive restart and rescale.
Reference several, no cycles. join_with= takes a list, so one reader can blend baselines: join_with=[w15, w24h] with fn.coalesce(fn.col("x"), win.Ref("w15").avg, win.Ref("w24h").avg). Each Ref resolves independently (N references = N lookups, never combinatorial). Within a tick the runtime closes referenced windows before their readers, so a reader sees the value produced this tick; cycles (A references B references A) are rejected at construction.

Tumbling vs sliding as the reference. A sliding upstream makes a good rolling baseline: size_ms=15min, slide_ms=1min refreshes its ref value every minute, so the reader compares against a near-current value rather than one that only updates every 15 minutes. Tumbling-as-reference fits harder horizon resets — a daily or hourly baseline.

Cost. A referenced window writes one ref-value row per key on each close — but only if at least one window actually references it (statically known at construction; references nobody consumes cost nothing). A high-cardinality upstream that closes often (a short slide) writes proportionally more, so weigh that against the cadence you pick for the reference.

Dynamic windows¶

Most pipelines build their windows eagerly in __init__. When the set of windows isn't known at construction time, you can defer building them until the first batch arrives. The canonical example is a multi-tenant alerting app (examples/alerting_app.py) that creates one PersistentTumbling per (tenant, rule) pair, with tenants and rules added or removed at runtime — you can't enumerate every window in __init__.

The trade-off: until the first batch reaches process() and triggers construction, the SDK has no record of that window. If a previously-running instance had open windows due to close before any fresh traffic reaches the new instance, they need to be rebuilt before they can fire. The on_recover_persistent_windows(timer_id, state) hook closes that gap — the SDK calls it on a recovered window the processor hasn't built yet, giving you the chance to construct it via your usual lazy code path so it can fire correctly.

@app.subscribe(kafka.topic("events"), output=kafka.topic("counts"))
class event_counters:
    def __init__(self) -> None:
        self._wins: dict[str, win.PersistentTumbling] = {}

    def _build(self, field: str) -> win.PersistentTumbling:
        if field not in self._wins:
            self._wins[field] = win.PersistentTumbling(
                self,
                name=f"events_per_{field}",
                size_ms=60_000,
                key=[field],
                value=agg.Count(),
                on_close_each=self._emit,
            )
        return self._wins[field]

    def process(self, batch: RecordBatch, state: TurbineState) -> RecordBatch | None:
        self._build("region").update(batch, state)
        self._build("application").update(batch, state)
        return None

    def on_recover_persistent_windows(self, timer_id: str, state: TurbineState) -> None:
        # SDK calls this for any recovered window the processor hasn't built yet.
        # Route by name so the right factory rebuilds.
        name = win.PersistentTumbling.name_from_timer_id(timer_id)
        if name == "events_per_region":
            self._build("region")
        elif name == "events_per_application":
            self._build("application")

    def _emit(self, state: TurbineState, window: win.Window, value: float) -> RecordBatch:
        return pa.RecordBatch.from_pylist([{
            **window.group,                    # "region" or "application" depending on the window
            "count": float(value),
            "window_start_ms": window.start_ms,
            "window_end_ms": window.end_ms,
        }])

Notes:

name_from_timer_id is the routing primitive: it parses the window name back out of a recovered timer id, so the hook can dispatch to the matching factory call.
The hook covers all persistent kinds — PersistentTumbling, PersistentSliding and PersistentSession. Use the matching class's name_from_timer_id(timer_id) when the processor manages a mix.
The in-memory kinds don't need this hook — they have no persistent state, so there's nothing to recover. Eager construction is sufficient (and required).

Sessions at scale¶

Session windows carry one performance characteristic worth knowing before you key one by a very high-cardinality column. Because a session close inherently merges, sessions are computed on a per-group path rather than the vectorised state-table layout that tumbling and sliding windows use — so the work scales with the number of distinct keys a batch touches, not just its row count. Keying by a raw user_id with millions of distinct values, where a single batch touches on the order of 100 k distinct sessions, is the regime where this shows up. It applies to both variants — it's a property of the session shape, not of persistence.

Both variants are built to stay up in that regime — neither collapses. The per-key work is batched (one bulk state load + one bulk write per batch for PersistentSession; one bulk emission per tick for both), so a single partition sustains tens of thousands of rows/s while holding on the order of a million concurrent open sessions, and the two variants land in the same ballpark there. The catch is the level: high-cardinality session throughput is several-fold below what the same window does at low cardinality (where it runs at full speed), because the per-group path can't vectorise the way a fixed-boundary window can.

Practical guidance:

Low-to-moderate distinct-key counts — thousands of active keys per batch, the typical "sessions per tenant / per device class / per account" shape — run at full session throughput. There PersistentSession stays within a few percent of the in-memory variant and adds negligible commit cost under exactly-once.
Very high-cardinality keys (hundreds of thousands of distinct sessions per batch) are supported but slower. Spread the load across more shards with partition_key + parallelism (see Partitioning) to recover aggregate throughput, and size the deployment for the reduced per-partition rate. If that still isn't enough, a tumbling approximation keyed on a coarser bucket sidesteps the per-group path entirely.