Kafka & Partitioning¶

This page covers the Kafka partitioning constraints Turbine imposes for stateful workloads, the role of partition_key for sub-partition parallelism, and repartition_by= for grouping on a column the producer didn't key by.

Topic Partitioning¶

For any stateful subscription (uses state.put, windows, or partition_key), the input Kafka topic's partition count must be a divisor of 960. The 28 valid counts are:

Range	Valid partition counts
Small (dev, single-broker)	1, 2, 3, 4, 5, 6, 8, 10, 12
Medium (3-6 broker clusters)	15, 16, 20, 24, 30, 32
Large (high-resource deployments)	40, 48, 60, 64, 80, 96, 120
Very large (1 partition per Turbine worker thread)	160, 192, 240, 320, 480, 960

Common Kafka choices that are not valid: 7, 9, 11, 13, 17, 100, 128, 256, 512. The boot banner warns when a subscribed topic uses an invalid count and suggests the closest valid alternatives; turbine-cli rescale rejects invalid --new-partitions values with the same suggestion.

Why this constraint exists. Turbine shards state into a fixed number of internal groups (960) so that the topic can be repartitioned without moving state around. Each Kafka partition owns a contiguous range of those groups, and rows produced with Kafka's standard partitioner land on the worker that already holds their state — but only when the partition count divides the group count evenly. Picking a non-divisor count works at steady state but cannot be rescaled cleanly later.

Stateless subscriptions (no state parameter, no windows, no partition_key) are unaffected — any partition count works.

Repartitioning a Stream (`repartition_by`)¶

A stateful subscription can only group correctly on a column when all records sharing a value of that column arrive at the same worker — which is true when the producer keyed the topic by it, and false otherwise. repartition_by= closes that gap:

app = Turbine(app_id="fraud")  # explicit app_id required
kafka = KafkaBroker(bootstrap="localhost:9092")

@app.subscribe(kafka.topic("transactions"), repartition_by="user_id")
def per_user(batch, state):
    ...  # every record of a given user_id arrives here together,
         # even though "transactions" is keyed by tenant_id

Declaring repartition_by="user_id" inserts a re-key hop in front of your handler: Turbine re-emits each record onto an internal topic, keyed and placed by the user_id value (using the same hash as Kafka's standard Java partitioner), and your handler consumes that internal topic instead. Records keep their original bytes and their original broker timestamps — event-time windows and with_kafka_timestamp behave exactly as they would on the source. The subscription's partition_key automatically follows repartition_by.

What you control:

repartition_by="col" — the grouping column. Must exist in the decoded payload and be a string column. Rows where it is null are grouped together (they all land on one partition, like a SQL null group).
repartition_partitions=N (optional) — partition count of the internal topic. Defaults to the source topic's count. Prefer a divisor of 960 (same rule as any stateful input, see above).

What Turbine guarantees:

Several subscriptions declaring the same (topic, repartition_by) share one hop and one internal topic — the re-emit happens once, and each subscription remains an independent consumer of it.
The hop inherits your delivery guarantee: if any sharing subscription is exactly_once, the hop runs exactly-once too, and end-to-end exactly-once holds across the chain (each stage commits its own transaction; readers only see committed records).
Undecodable records follow the input topic's on_error / dlq policy at the hop, same semantics as everywhere else.

What it costs / does not do:

One extra broker round-trip per record (produce + re-consume), plus the internal topic's storage. This is the price of a correct cross-key shuffle; there is no hidden worker-to-worker networking.
Latency across the hop is bounded by the hop's batch cadence (and, under exactly-once, by its transaction commits) — sub-second in practice, but not zero.
It does not retract or reorder anything: it is an identity re-emit, only the placement changes.

Operational notes. The internal topic is created automatically on the input topic's cluster, named turbine-{app_id}-rep-{topic}-by-{column} — this is why an explicit app_id is required (two apps with a default id would collide). It is a normal topic: it shows up in your broker tooling, can be inspected for debugging, and follows the broker's default retention. If it already exists with a different partition count than requested, Turbine refuses to start (changing the count would re-route every key); delete the topic deliberately or match its count. Renaming your handler does not change the internal topic or lose its position — the hop's identity comes from the topic and the column, not the handler name.

Monitoring the hop. The web console's Topology page draws the hop as a single "repartition by column" node: its input rate and lag are the upstream side (is the hop keeping up with the source topic?), while the lag shown on each downstream subscription is the downstream side (are your handlers keeping up with the re-keyed stream?). Expanding the node reveals the internal topic's real name and the per-partition output distribution — a single dominant bar means one hot key concentrates the downstream load, which no amount of extra partitions will fix. A "null keys" counter appears when rows are missing the grouping column. The same signals are exported on /metrics (turbine_repartition_*) for alerting.

One thing Turbine does not watch for you: the internal topic's disk footprint. It follows the broker's default retention, so on a high-volume stream, set a retention policy on it (or monitor it) with your broker tooling like any other large topic.

Cross-Key Aggregation (`combine_by`)¶

combine_by= solves the same problem as repartition_by= — "group by a column the producer didn't key the topic by" — with a different strategy built for windowed aggregation. Instead of moving every record across the broker, each worker pre-aggregates its own slice locally and ships only the partial aggregation state; a merge stage combines the partials per (window, key) and runs your close callback:

app = Turbine(app_id="metrics")  # explicit app_id required
kafka = KafkaBroker(bootstrap="localhost:9092")

@app.subscribe(kafka.topic("events"), combine_by="user_id", output=kafka.topic("scores"))
class per_user:
    def __init__(self):
        self._w = win.PersistentTumbling(
            self, name="risk", size_ms=60_000,
            key=["user_id"],
            value=agg.Mean("cpu_usage"),
            on_close=self.emit,
        )

    def process(self, batch, state):
        self._w.update(batch, state)   # pre-aggregates locally
        # must return None: results come out of on_close, below

    def emit(self, state, panes):
        return panes.to_batches()[0]   # published to "scores"

The handler looks exactly like the repartition_by version of the same program. What changes is the wire cost: with a million events per minute spread over ten thousand users, repartition_by re-emits a million records through the broker; combine_by ships ten thousand small accumulator states. The aggregates are identical either way.

Choosing between the two:

	`repartition_by`	`combine_by`
What crosses the broker	every record	one partial state per key per window
Best when	you need the raw rows grouped (arbitrary handler logic, joins, row-level output)	you aggregate with the built-in aggregate functions
Handler constraints	none	one tumbling window, mergeable aggregates, `process()` returns `None`
Wire cost	grows with event volume	grows with distinct-key count

They are mutually exclusive on a subscription — declaring both raises at subscribe() time. There is no case where stacking them wins: once records are re-keyed by a column, every key lives on one worker and there is nothing left for a combine stage to merge; and combining works regardless of how the input is placed, so a re-key in front of it buys nothing. Pick one per subscription.

What you control:

combine_by="col" — the aggregation axis. Must be a string column and a member of the window's key= (a multi-column key like ["tenant_id", "user_id"] with combine_by="user_id" is fine). Null values form one group, like a SQL null group. The subscription's partition_key automatically follows combine_by; declaring a different one is refused.
combine_partitions=N (optional) — partition count of the internal topic, which is the merge stage's scaling axis. Defaults to the source topic's count.
Everything else is your normal subscription: decode config, error policy and batching apply to the ingesting side; output and the delivery guarantee describe your output.

What Turbine guarantees:

Merged aggregates equal a single-stage GROUP BY over all events — the built-in aggregate functions (Sum, Count, Mean, Min/Max, Variance, First/Last, the sketches, Multi bundles of them) merge across workers without approximation beyond what the aggregate itself documents.
A window only closes once every source partition has confirmed it has shipped its data for that window — a slow partition delays results rather than corrupting them. The delay is observable (turbine_combine_min_watermark_lag_ms names the lag).
Under processing_guarantee="exactly_once", partials ride the same transactions as everything else: a crash and restart never loses a window's contribution and never emits a pane twice.
Under at-least-once, a redelivered partial after a window already closed is dropped and counted (turbine_combine_late_partials_total), never merged twice into a fresh duplicate pane. As with any at-least-once pipeline, a redelivery before the close can double-count — use exactly-once when the aggregates must be exact.

What it costs / current limits:

Results appear roughly one window-boundary after the window ends (the merge stage waits for every source partition's confirmation, which flows at window-boundary cadence) — bounded and deterministic, not immediate.
One tumbling persistent window per handler; sliding and session windows, early-fire, and cross-window references are not combinable today.
Custom aggregate implementations must declare themselves mergeable (IS_MERGEABLE + a merge() that is order-insensitive); Turbine refuses the subscription otherwise and names the offending aggregate.
A source partition whose records are all dropped by your error policy never confirms progress, and the merge stage waits on it indefinitely — visible as a frozen turbine_combine_min_watermark_lag_ms.
Not available in cluster (Raft) mode yet.

Operational notes. Same model as the repartition hop: an internal topic named turbine-{app_id}-cmb-{topic}-{window_name} is created automatically (explicit app_id required; count mismatches with an existing topic are refused; broker-default retention — watch its footprint yourself). The identity comes from the window name, so renaming the handler changes nothing, while renaming the window starts a fresh internal topic — the same contract as window state. The console's Topology page draws the pair as one "combine by column" node feeding your subscription; the turbine_combine_* metric family covers partial/marker volume, late drops and the merge stage's watermark lag.

`partition_key` & Sub-Partition Parallelism¶

TODO

Dedicated explainer needed:

What partition_key is and why you'd use it (sharding inside a single Kafka partition for higher per-partition throughput on free-threaded Python).
The parallelism=N knob and how records are dispatched across shards.
The windowing constraint (every window must be keyed by the same column as the subscription) — already enforced at construction time.
Trade-offs vs. simply increasing the Kafka topic's partition count (sub-partition parallelism is intra-process; multi-partition is cluster-wide).
Migration path between the two when scaling up.

Keying the Output (`message_key`)¶

By default the messages a handler publishes carry no Kafka key. Set an output key when you need downstream consumers to see each entity's messages on the same partition (per-entity ordering, co-location) or when the output topic is log-compacted (compaction keeps the last message per key).

The output key is a column of the RecordBatch your handler returns. Name it on the output topic:

from turbine import KafkaBroker, Turbine

kafka = KafkaBroker(bootstrap="localhost:9092")
app = Turbine(app_id="scoring")

@app.subscribe(
    input=kafka.topic("transactions"),
    output=kafka.topic("scores", message_key="tenant_id"),
)
def score(batch):
    ...
    return result  # each row's `tenant_id` becomes that message's Kafka key

Behaviour and rules:

Per-row keying. Every produced message is keyed by that row's value of the named column. Two rows with the same value are guaranteed to land on the same partition of the output topic.
Type. The column must be a string (Utf8). A missing or non-string column fails fast at startup with a clear error.
Null rows produce an unkeyed message (no key), which Kafka spreads across partitions — the documented behaviour for a row whose key value is null.
The column stays in the message value. Keying does not strip it; tenant_id remains a field of the JSON payload as well.
Requires an output. message_key= is meaningful only on an output= topic; it has no meaning on an input= topic and is rejected there.
Distinct from partition_key. message_key sets the raw Kafka record key on messages you produce; partition_key is the keying axis of the stream you consume (state routing, sharding). They often name the same column but answer different questions.

This is not repartition_by

message_key keys a topic you publish to for a downstream consumer's benefit. It does not move data between Turbine workers or change how this app is partitioned — for that (grouping on a column your producer didn't key by), see repartition_by above. Placement uses Kafka's standard keyed partitioner (consistent per key), which is all that co-partitioning and compaction require.

Brokers as objects (`input=` / `output=`)¶

A subscription's endpoints are broker-scoped topic references, not bare strings. KafkaBroker is a handle to a Kafka cluster; broker.topic(name) (and broker.topic(name, message_key=...) for an output) names a topic on it:

from turbine import KafkaBroker, Turbine

kafka = KafkaBroker(bootstrap="localhost:9092")
app = Turbine()

# producer
@app.subscribe(input=kafka.topic("transactions"), output=kafka.topic("scores"))
def score(batch): ...

# sink (no output)
@app.subscribe(kafka.topic("metrics"))
def audit(batch): ...

input is required and is the first positional argument — pass a topic reference (kafka.topic(...)), never a bare string.
Omit output for a sink; set output=kafka.topic(name, message_key=...) to publish (and optionally key — see above).
An app talks to one Kafka cluster: every input/output topic must be on a broker whose bootstrap matches the app's brokers=. A topic on a different cluster is refused at startup.

Kafka & Partitioning¶

Topic Partitioning¶

Repartitioning a Stream (repartition_by)¶

Cross-Key Aggregation (combine_by)¶

partition_key & Sub-Partition Parallelism¶

Keying the Output (message_key)¶

Brokers as objects (input= / output=)¶

Repartitioning a Stream (`repartition_by`)¶

Cross-Key Aggregation (`combine_by`)¶

`partition_key` & Sub-Partition Parallelism¶

Keying the Output (`message_key`)¶

Brokers as objects (`input=` / `output=`)¶