Essential Kafka Client Configurations for Production

Kafka's defaults are fine for development, but they're a liability in production. This guide covers the essential configurations to ensure reliability, performance, and operational stability in real-world applications.

Apache Kafka is an incredibly powerful and stable platform. However, many of its "out-of-the-box" settings are not truly a solid foundation for production. In fact, many of these defaults are a product of historical decisions and have been maintained for the sake of backwards-compatibility. In this article, we provide a set of recommendations based on our extensive experience.

This guide is not about hyper-optimizing for the likes of LinkedIn or Netflix, where millions of messages per second are the norm. It’s about a handful of crucial configuration changes that will make your Kafka cluster reliable, robust, and easy to operate in any standard business environment.

Our Recommendations for the Producer Configuration

The producer’s job is to send data to Kafka. Your primary concern should be ensuring that data is delivered reliably, with no loss and no duplicates while maintaining good performance and reducing resource consumption.

These two settings are non-negotiable for any production application where data integrity is crucial.

  • acks=all: (is the default since ~Kafka 3.4) Ensures that the producer waits until the message has been written to all in-sync replicas.

  • enable.idempotence=true: (is the default since ~Kafka 3.4) Ensures that messages are delivered exactly once. Survives even Broker-restarts.

enable.idempotence=true guarantees you, that the messages produced by a producer will be delivered exactly once and in the correct order. It does not guarantee that a consumer will process the message exactly once. For that, you need to use transactions.

  • delivery.timeout.ms: (default: 120000 (2 minutes)) The producer will retry to send the message for this amount of time. After this timeout, the producer will give up and throw an error.

If you simply retry after the timeout in your own code, it would be better if you simply increase the delivery.timeout.ms. Read more about reliability in this article about reliable producing.

  • client.id={hostname}: (default: auto-generated) Set it for better monitoring and logging. If you are using Kubernetes, set it to the hostname of the pod.

Throughput Tuning

For most "medium data" applications, you can get a significant performance boost by simply batching your messages without sacrificing (too much) latency.

  • batch.size= e.g. 900000: (Default: 16KiB) We believe that this value is too small for most use cases. Increase it to up to 1MB. This allows your producer to send more data in a single network request, reducing overhead and improving throughput.

  • linger.ms=5, 10 or 100: (Default: 0ms) This setting works hand-in-hand with batch.size. It tells the producer to wait up to a certain amount of time for a batch to fill up before sending it. Increasing this to a few milliseconds can significantly improve throughput without a major impact on latency.

  • compression.type=lz4 or zstd: (Default: none) Compressing your messages before they are sent to Kafka can save network bandwidth and improve throughput.

Additional Settings

  • transactional.id={hostname}: This setting is only needed if you are using producer transactions. The transactional.id must be unique to each producer instance but should be persistent over restarts. When a producer instance restarts, Kafka uses this ID to prevent "zombie" producers from committing transactions. In Kubernetes, this is a key reason to use StatefulSets for your producers, as they provide a stable identity.

In Kubernetes you should use StatefulSets for your transactional producers.

  • partitioner=murmur2_random: (Default: Murmur2Partitioner in Java, consistent_random in librdkafka) The partitioner determines which partition a message is sent to. If you are using both Java and non-Java producers, please ensure that all producers have the partitioner set to murmur2_random to ensure consistent partitioning.

If you are using both Java and non-Java producers, please ensure that all non-Java producers have the partitioner set to murmur2_random to ensure consistent partitioning.

Consumer Configuration

A consumer’s job is to read and process data from Kafka topics. Proper configuration is key to building a stable, resilient, and manageable application.

  • group.id={service-name}: All consumers that belong to the same group.id will work together as a single unit to consume messages from a topic. Use a clear, descriptive name for your consumer group, such as the name of your service.

  • client.id={hostname}: Similar to the producer, giving your consumer a meaningful ID is essential for monitoring and debugging.

  • isolation.level=read_committed: (default: read_uncommitted) Makes sure that the your consumer code never sees messages from uncommitted or aborted transactions. It’s a best practice to set this on all consumers by default, even if you are not currently using transactions.

If your producer is transactional and your consumer is configured with isolation.level=read_uncommitted, it can lead to severe data integrity issues by reading data from transactions that have failed. Always use read_committed.

  • auto.offset.reset=earliest: This setting dictates what to do if a consumer starts up and can’t find a valid offset for a topic partition. If you want your consumers to read all messages from the beginning, set it to earliest. If you do not want to read all messages and are ok with missing some, set it to latest.

  • enable.auto.commit=true: (default: true) Handles automatic offset commits. Disable this only if you know exactly what you are doing (like when handling transactions manually).

  • commit.interval=5000: (default: 5000) This setting determines how often a consumer’s offset is committed to Kafka. Its default is 5 seconds, which means that in the event of a crash, at most 5 seconds of data will be reprocessed. You should not change this unless you have a very specific reason to do so.

If you need exactly once processing, you need to either use a consume-process-produce loop with transactions or use Kafka Streams with the setting processing.guarantee=exactly_once_v2.

Avoiding Rebalances in Kubernetes and Improving Cost-Efficiency

Rebalances happen when consumers join or leave a consumer group. In automated infrastructures like Kubernetes failures of pods are handled by the infrastructure and we can avoid expensive rebalances in Kafka.

  • group.instance.id={hostname}: Makes sure that a consumer "remembers" its previous assignment after a restart and thus, avoids expensive rebalances. But: The restart must survive the session.timeout.ms.

  • session.timeout.ms= e.g. 60000: (default: 30000) This setting defines the maximum time a consumer can be unavailable before it is considered failed and the consumer group is rebalanced. Increase this to a value where a consumer can easily detect its failure, shut down, and restart the pod. But only together with group.instance.id.

group.instance.id and session.timeout.ms should be used together and make only sense if you are using a StatefulSet for your consumers in Kubernetes such that the identity of the pod survives the restart.

  • group.protocol=consumer: (default: classic) Use the Next Generation of the Consumer Rebalance Protocol (See KIP-848 for more information). Requires Kafka 4.0 or higher. Makes rebalancing much faster!

If you are using a Kafka older than 4.0 you cannot use the new consumer rebalance protocol. Then you will have to use the old classic protocol. But to improve rebalancing speed, you can set the partition.assignment.strategy=CooperativeStickyAssignor: (default: RangeAssignor, CooperativeStickyAssignor)

  • client.rack={dc-name}: If you are running your Kafka cluster and your clients in a multi-availability zone (AZ) cloud environment, you can set the client.rack on both the brokers and the consumers. This allows the consumer to prefer reading from a replica in the same AZ, which can significantly reduce inter-AZ data transfer costs.

Topic Configuration

A topic’s configuration is the blueprint for how Kafka handles your data. The correct settings here are fundamental to the reliability and performance of your entire system.

Reliability and Performance

  • num partitions=12: (default: 1) 12 is the magic number for the partition count. Read this article for more information.

  • replication-factor=3: (default: 1) The number of times a partition’s data is replicated across brokers. 3 is the gold standard for most production environments, providing high availability at a reasonable cost.

  • min.insync.replicas=2: (default: 1) How many in-sync replicas must be available for producers to write to a partition. With a replication factor of 3, setting this to 2 means, that one broker can fail without affecting production and another one can fail without loosing data.

Data Retention and Lifecycle

  • cleanup.policy: (default: delete) This defines how Kafka handles old messages. The default is delete, which means messages are deleted after a set period. For use cases like change data capture or event sourcing, you might use compact, which removes old messages for the same key.

  • retention.hours: (default: 168 hours (7 days)) This setting controls how long Kafka retains messages before they are deleted. The default is 7 days, but you should adjust this based on your business requirements.

Conclusion

Sadly, in Kafka, the default settings are often not "good enough" for production. I hope this article helped you to find the right settings for your production environment. If you have any questions, please feel free to contact us any time.

About Anatoly Zelenin
Hi, I’m Anatoly! I love to spark that twinkle in people’s eyes. As an Apache Kafka expert and book author, I’ve been bringing IT to life for over a decade—with passion instead of boredom, with real experiences instead of endless slides.

Continue reading

article-image
How REWE Masters Digital Transformation with Apache Kafka

Whether shopping online, using self-checkout in stores, or getting home delivery: Almost every German regularly interacts with REWE. What few people know: Behind the scenes, Apache Kafka ensures everything runs smoothly. Paul Puschmann and Patrick Wegner provide insights into the technological transformation of one of Germany's largest retailers.

Read more
article-image
Kafka Training: The 6 Most Important Guiding Principles of My Training

When am I a good trainer? That is perhaps the most important question for me in my role as an Apache Kafka expert. In this blog, I want to answer it with six guiding principles. Do you agree with all of them?

Read more