← Back to blog

Zenoh vs Kafka: Choosing the Right Messaging Backbone for Your System

·
#architecture#kafka#zenoh#messaging#distributed-systems

You’re building a system where latency is not a nice-to-have — it’s the feature. Maybe it’s a fleet of autonomous vehicles sharing sensor data. Maybe it’s a financial platform where stale prices are dangerous. Maybe it’s an IoT edge network where half your nodes are behind NAT, on spotty connections, running on constrained hardware. You reach for Kafka because that’s what engineers reach for when they think “messaging at scale.” But then you benchmark it, and the numbers don’t work. Kafka’s median latency is in the milliseconds. For your use case, you needed microseconds.

This is where Zenoh enters the picture — and why it’s worth understanding what these two systems are actually optimised for, rather than treating them as interchangeable.

The Problem With “Messaging” as a Category

Calling both Kafka and Zenoh “messaging systems” is like calling both a freight train and a courier motorcycle “transportation.” They move things from A to B, yes. But the design decisions underneath are so different that picking the wrong one will cost you either a complete rewrite or years of painful workarounds.

The core split comes down to two fundamentally different problems:

Problem 1: Durable, replayable event streams. You want a log of everything that happened. Consumers can fall behind, replay from any offset, and the data must survive broker restarts. Throughput matters more than latency. Think audit logs, event sourcing, analytics pipelines.

Problem 2: Low-latency, location-transparent communication. You want data to move between any two nodes as fast as the network allows — regardless of topology. Think real-time telemetry, robotics, edge computing, pub/sub across heterogeneous networks.

Kafka was built for Problem 1. Zenoh was built for Problem 2. Understanding this distinction saves you from a lot of pain.

How Kafka Actually Works

Kafka’s design is anchored on the distributed commit log. Every topic is a partitioned, replicated, ordered sequence of records written to disk. Producers append to the log. Consumers read from it by tracking an offset. The broker is the authority — producers push to it, consumers pull from it.

Producer → [Broker Cluster] → Consumer Group

              └── Partitions replicated across brokers
                  (writes go to disk via page cache)

The key architectural insight in Kafka is that disk is fast if you use it sequentially. By writing records append-only to a commit log and relying on the OS page cache for reads, Kafka achieves remarkable throughput. At Revolut, I saw partitions handling hundreds of thousands of events per second without breaking a sweat.

flowchart LR
    P1[Producer A] -->|append| B[Kafka Broker]
    P2[Producer B] -->|append| B
    B -->|replicate| R[Replica Brokers]
    B -->|pull by offset| C1[Consumer Group 1]
    B -->|pull by offset| C2[Consumer Group 2]
    B -->|persistent log| D[(Disk)]

The broker stores the log. Consumers pull from the broker. Nobody talks to each other directly — everything routes through the broker cluster.

The implications of this design are significant. Kafka is broker-centric by nature. Every message touches the broker, which means you have a well-defined point to apply retention policies, replication guarantees, and consumer group management. You also have a central point of failure if you misconfigure your replication factor or your Zookeeper (now KRaft) cluster.

Latency-wise, Kafka optimises for batching. The producer accumulates records and flushes in batches — this is where the throughput comes from. The cost is added latency for small or infrequent messages. In practice, end-to-end latency for a well-tuned Kafka setup lands between 5–15ms for typical configurations. With linger.ms=0 and aggressive flushing, you can get below 5ms, but you pay in throughput. You’re not getting sub-millisecond without compromising the guarantees Kafka is built around.

How Zenoh Actually Works

Zenoh (pronounced “zeno”) came out of Eclipse Foundation and was designed from the ground up for geo-distributed, heterogeneous networks where you can’t assume reliable connectivity, low latency, or even a fixed topology.

The abstraction Zenoh uses is key/value spaces — every piece of data has a key, and any node can subscribe to keys matching a pattern. Zenoh handles the routing. Crucially, it does this without requiring a central broker. Nodes can communicate peer-to-peer, through infrastructure brokers (called zenoh routers), or in a mixed topology — and the protocol adapts transparently.

flowchart LR
    A[Edge Node A] -->|put: sensors/car1/speed| R[Zenoh Router]
    B[Edge Node B] -->|put: sensors/car2/speed| R
    R -->|route| C[Cloud Subscriber]
    A <-->|peer-to-peer| B
    R -->|bridge| R2[Remote Router]
    R2 --> D[Datacenter Subscriber]

What makes Zenoh genuinely different is its protocol efficiency. The Zenoh wire protocol is designed to be extremely compact — far smaller than MQTT or AMQP headers, and orders of magnitude lighter than Kafka’s protocol. This makes it viable on constrained devices (think microcontrollers with 256KB of RAM) where Kafka’s JVM footprint is a non-starter.

Zenoh also supports multiple communication modes in the same session:

  • Publisher/Subscriber — classic pub/sub with pattern matching on keys
  • Get/Reply (queryable) — request/response where any node can serve as a data store
  • Put/Delete — key-value operations that can be persisted in storage backends

The latency profile is fundamentally different from Kafka. In peer-to-peer mode on a local network, Zenoh achieves sub-100 microsecond latencies. Even through a router, it’s routinely in the low hundreds of microseconds range. That’s 20–100x better than a well-tuned Kafka cluster.

// zenoh/examples/z_pub.rs
use zenoh::prelude::r#async::*;

#[async_std::main]
async fn main() {
    let session = zenoh::open(config::default()).res().await.unwrap();
    let publisher = session
        .declare_publisher("sensors/car1/speed")
        .res()
        .await
        .unwrap();

    loop {
        let payload = "87.3"; // speed in km/h
        publisher.put(payload).res().await.unwrap();
        async_std::task::sleep(std::time::Duration::from_millis(1)).await;
    }
}

The key thing to notice here: no broker address in sight. The default Zenoh config discovers peers automatically on the local network. When you do need to cross network boundaries, you configure router endpoints — but the application code stays identical.

The Alternatives Worth Knowing

Kafka and Zenoh are not the only players. Depending on your constraints, these four systems are worth serious consideration.

NATS / NATS JetStream

NATS is the pragmatic middle ground. The core NATS protocol is extraordinarily simple — a plain-text pub/sub protocol with a tiny server binary (single Go binary, ~20MB). Latency is sub-millisecond on a local network. NATS JetStream adds persistence, exactly-once semantics, and consumer group management — bringing it much closer to Kafka’s feature set, at the cost of some operational complexity.

flowchart LR
    P[Producer] -->|publish: orders.new| N[NATS Server]
    N -->|fanout| S1[Subscriber 1]
    N -->|fanout| S2[Subscriber 2]
    N -->|persist| J[(JetStream Store)]
    J -->|pull/push| C[Durable Consumer]

Where NATS shines: cloud-native microservices where you want Kafka-like durability without Kafka’s operational weight. JetStream has clustering, replication, and replay. The NATS server cluster is dramatically easier to operate than a Kafka+ZooKeeper (or KRaft) setup.

Where NATS struggles: when you need Kafka’s extreme throughput (millions of events/second per partition), the battle-tested ecosystem (Kafka Connect, Kafka Streams, ksqlDB), or strict ordering guarantees across consumer groups.

MQTT (with a broker like EMQX or Mosquitto)

MQTT was designed for IoT — constrained devices, unreliable networks, and simple publish/subscribe. It’s the protocol behind most consumer IoT deployments (home automation, industrial sensors). MQTT 5 added features like shared subscriptions and message expiry that bring it closer to a general-purpose messaging system.

MQTT’s strength is its ubiquity. Every IoT device speaks it. Hardware SDKs ship with MQTT clients. AWS IoT Core, Azure IoT Hub, and Google Cloud IoT all use it natively.

Its weakness is the broker model. EMQX can handle millions of concurrent connections, but you’re still routing everything through a central broker. Horizontal scaling MQTT brokers is harder than it looks, and message ordering is best-effort by default.

Apache Pulsar

Pulsar is Kafka’s most direct challenger in the durable-streaming space. LinkedIn built Pulsar at Yahoo (now Apache), and its key architectural departure from Kafka is the separation of compute and storage — Pulsar brokers are stateless, and data lives in Apache BookKeeper. This makes broker scaling completely independent of storage scaling, which is a meaningful operational advantage at large scale.

flowchart LR
    P[Producer] --> BR[Pulsar Broker]
    BR -->|write| BK[(BookKeeper Ledger)]
    BR -->|serve| C[Consumer]
    BK -->|replicate| BK2[(BookKeeper Replica)]
    ZK[ZooKeeper] -->|metadata| BR

Pulsar also has first-class multi-tenancy, geo-replication built into the protocol (not bolted on), and tiered storage to offload cold data to S3 or GCS. These are real advantages if you’re running a multi-tenant SaaS platform or a globally distributed event store.

The honest downside: Pulsar is significantly more operationally complex than Kafka. You’re managing Pulsar brokers, BookKeeper nodes, and ZooKeeper — three separate systems with their own failure modes. Kafka with KRaft has now eliminated its own ZooKeeper dependency; Pulsar hasn’t.

Redis Streams

Redis Streams (added in Redis 5.0) is often overlooked in this conversation, but it deserves a mention for specific use cases. If you’re already running Redis and you need a simple, low-overhead event stream within a single datacenter, Redis Streams is surprisingly capable — consumer groups, at-least-once delivery, XREAD blocking calls.

The ceiling is obvious: Redis is single-threaded for writes, data lives in memory (with optional persistence), and horizontal partitioning requires Redis Cluster which adds its own complexity. Redis Streams tops out at hundreds of thousands of events/second on powerful hardware; Kafka handles millions. But for medium-scale, low-ceremony setups, it’s genuinely underrated.

The Tradeoffs and What Goes Wrong

Kafka: The Operational Weight Is Real

Kafka’s guarantees come with cost. At Revolut, operating a Kafka cluster that handled real-money transactions meant obsessing over replication factor (min.insync.replicas), partition count (you can’t reduce it later without a full topic rebuild), and consumer lag monitoring. Getting this wrong means either data loss or a cascading consumer backlog that takes hours to drain.

The other failure mode engineers discover late: partition count determines parallelism. If you under-partition a topic, you can’t scale consumers beyond the partition count. Over-partitioning wastes resources and slows down leader election during failures. There’s no magic number — it depends on your throughput, your retention period, and your consumer count.

Kafka also has no native support for request/reply patterns or dynamic routing. Building an RPC layer on top of Kafka is possible (a reply-to topic per consumer), but it’s inelegant and operational overhead compounds quickly.

Zenoh: The Ecosystem Gap

Zenoh’s primary weakness right now is ecosystem maturity. Kafka has a decade of battle-tested connectors (Debezium, Confluent ecosystem, hundreds of community connectors), monitoring tools (Cruise Control, Burrow, Confluent Control Center), and cloud-managed offerings (MSK, Confluent Cloud). Zenoh has none of that yet.

If you’re building robotics middleware or an IoT edge platform, that gap may not matter — Zenoh’s ROS2 integration is actually first-class, and for peer-to-peer telemetry scenarios the ecosystem you need is smaller. But if you’re building a data pipeline that needs to feed an analytics warehouse, Kafka’s connector ecosystem is a concrete advantage.

There’s also the question of persistence semantics. Zenoh has queryable storage backends, but it’s not a durable log by design. If you need replay, you’re building it on top. Kafka’s offset-based replay is foundational to its model — it’s one less thing to build.

NATS: The Scalability Ceiling

NATS is genuinely easier to operate than Kafka. A three-node JetStream cluster is manageable by a small team. The trade-off is that NATS doesn’t match Kafka at extreme throughput. When you’re pushing millions of events per second with strict ordering guarantees across hundreds of partitions, NATS starts showing strain. This is a ceiling most teams never hit — but it’s worth knowing where it is before you build a system that depends on it.

Takeaways

  • If you need durable, replayable event streams with high throughput and a mature ecosystem: Kafka. Accept the operational complexity as the cost of its guarantees.

  • If you need sub-millisecond latency, peer-to-peer communication, or constrained hardware: Zenoh. Especially compelling for robotics, autonomous systems, and IoT edge. Just plan for the ecosystem gap.

  • If you want Kafka’s semantics with lower operational overhead: NATS JetStream. It’s the pragmatic choice for teams that can’t afford a dedicated Kafka operations specialist.

  • If you’re building a multi-tenant platform with global geo-replication: Apache Pulsar. The stateless broker architecture pays off at that scale, despite the initial setup complexity.

  • If you’re already on Redis and your scale is moderate: Redis Streams. Don’t over-engineer it if Redis already serves your persistence layer.

  • Match the abstraction to the problem. Kafka thinks in partitioned logs; Zenoh thinks in location-transparent key spaces; NATS thinks in subjects. The abstraction shapes how you model your data — pick the one that fits your mental model of the domain.

The most expensive mistake I’ve seen engineers make in this space is choosing a messaging system based on brand recognition rather than latency and durability requirements. Kafka is famous because it solved real problems at LinkedIn and Twitter. That doesn’t mean it’s the right tool for your fleet of edge sensors talking to each other at 1kHz.

If you found this useful or want to discuss it further, connect with me on GitHub or LinkedIn.