Top 5 best practices for using Kafka for data pipelines

Kafka can double up as event pipelines as well as data pipelines. In this article, we talk about how to use Kafka as a data pipeline and the best practice surrounding it.

What is Kafka?

Kafka is a messaging broker system. It means that it works on events, and messages and sends them to the right systems that are subscribed to them. And Kafka can do all these very efficiently even if the messages are millions/billions.

Where is Kafka used?

Kafka is used in organizations that deal with a lot of event-based systems. Messaging buses can also be replaced with Kafka. Kafka helps to scale smaller messaging systems into big architectures so that there is better maintainability between different systems. Kafka also has support with different languages though it was primarily supported in Java.

Best practices for using Kafka

We have used different messaging buses in our experience with clients, but Kafka has always been our favorite for various reasons. We have also found that many dev teams struggle with Kafka and thus we are documenting our experiences with Kafka as a list of best practices guide.

1. Use specific topics

Sometimes being vague with a topic can lead to a lot of unused messages for a consumer. So create specific topics for specific types of events and let the consumers decide which topics to subscribe to.

2. Multi-partition is the way to go

For each topic, have a separate set of partitions. This helps to parallelize the consumers within a consumer group.

3. Seek to a new location while recovering from a failure

In case of failures, seek the last read location and then continue from the same. If in case, the messages are re-delivered ensure that duplicate messages do not create an issue with data consistencies.

4. Ordering must be a architecture constraint

Sometimes ordering could be a deal breaker in architecture. Ensure that this constraint is well-thought-out and used. Kafka helps to have ordering in the messages, but only if we are careful with the partitions. Ordering is maintained within a partition but not outside it.