Introduce Apache Kafka

Introduce Apache Kafka

Marvyn Hoang
Marvyn Hoang

When we build a large system, we always have at least two main challenges to solve: how to handle large volume of data and how to analyze them. Kafka came as a solution. It's specifically designed for distributing high throughput systems.
This blog is written for a developers who want to use Kafka in service. If you want to dig deeper into Kafka, please read  the official document from https://kafka.apache.org. And all of this below is something I thought what you need to know before implement a service using Kafka.


What is Messaging System (MS) ?

A messaging system is responsible for transferring data from one service/application to another. MS have two available types: point to point and publish-subscribe. Most of the MS follow pub-sub pattern.

Screen-Shot-2022-04-15-at-11.15.15-PM

Point-to-point pattern

In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it gets removed.
Screen-Shot-2022-04-15-at-11.20.23-PM

Publish-subscribe pattern

In a publish-subscribe (producer-consumer) pattern, messages are persisted in a topic. A consumer can subscribe to one or many topic and get all the messages in that topic.
Screen-Shot-2022-04-15-at-11.25.52-PM


What is Apache Kafka ?

apache-kafka-streaming-platform-1
Now we know that Apache Kafka is a messaging system. More specifically, Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records simultaneously. A streaming platform needs to handle this constant influx of data, and processes the data sequentially and incrementally. Kafka offers a Pub-sub and queue-based messaging system.

Kafka main components

The main kafka components are topics, producers, consumers, consumer groups, clusters, brokers, partitions, replicas, leaders, and followers. But if you are a developer and just work with kafka by code and service then you just need to know about topics, brokers, producers, consumers, consumer groups and partitions.
1_lu6wtETiXfeG23fIUJi-hA-1

Image from Narayan Kumar - Medium

Producer

Producer or publisher is an application that is the source of the data stream. Kafka producers public API to help handle and sending messages with some data then deliver it to brokers/servers. Every messages send from producer need a destination, which is a topic, broker.

Broker - Topic - Partition

Kafka broker (kafka server) is based on kafka cluster. A kafka cluster consists of one or more brokers.
Each Broker can has one or more Topics. Kafka topics are divided into a number of partitions. Each partition can be placed on a single or multiple separate machines to allow multiple consumers to read from a topic in parallel. Topics are stored message sent from producers with ascending offset.

Consumer - Consumer group

A kafka consumer is the one that consumes or reads data from the kafka cluster via a topic. A consumer also knows from which broker it should read the data. The consumer reads the data within each partitions in an orderly manner. It means the consumer is not supposed to read data from offset 1 before reading from offset 0. Also, consumers can easily read data from multiple brokers at the same time.
A kafka consumer group is a group of multiple consumers. Each consumer in a group reads data directly from the exclusive partitions. And in case the number of consumers are more than the number of partitions, some of the consumers will be in an inactive state. Then, if we lose any active consumer within the group, the inactive one can takeover, become active and read the data.

Summary and pseudo-code

Okayyy, if you are new in kafka and not understand what the matrix text in above, don't worry. Now we will use this pseudo code to understand what it is.


Suppose we have a broker with address: 1.1.1.1:9092.
And we work with topic named: hold_message_topic.

[Global config or Environment]

const BROKER_ADDRESS = "1.1.1.1:9092"
const TOPIC_NAME = "hold_message_topic"

[Producer service]

import Producer from kafka;
// ...
//Create new a producer instance with broker address
Producer producder = new Producer(BROKER_ADDRESS)
public void kafkaProducerHandler(String msg) {
    //Wrap message with producer api
    Producer.Record record = Producer.Record(msg)
    //Send a message with topic name
    producer.send(record, TOPIC_NAME)
}

[Consumer service]

import Consumer from kafka;
//...
const GROUP_ID = "group_id_1" //Set group id/consumer group
const POLL_NUMBER = 1 //Message every commit/poll from consumer
const POLL_TIME = 1000 //Delay time every consumer poll message
const READ_OFFSET = "lastest" // consumer read message from now
public void kafkaConsumerHandler() {
    //Create new a consumer instance with broker address
    Consumer consumer = new Consumer(BROKER_ADDRESS)
    //Start consume with some config in above
    consumer.consume(TOPIC, {READ_OFFSETT, POLL_TIME})
    //Handle message after received
    while {
        Consumer.Record consumerRecords = consumer.poll(POLL_NUMBER)
        for record : consumerRecords {
            printOrHandleSomething(record)
        }
    }
}

Some problems when using Kafka

How to choose number of topics/partitions

If you are producing 100 messages per second and your consumers can process 10 messages per second then you will want at least 10 partitions (produce / consume) with 10 instances of your consumer. If you want this topic to be able to handle future growth, then you will want to increase the partition count even higher so that you can add more instances of consumer to handle the new volume.
Note: Kafka can, at max, assign one partition to one consumer. If there are more number of consumers than the partitions, Kafka would fall short of the partitions to assign.

Rebalance

When a new consumer joins a consumer group or some/any consumer are down, the set of consumers attempt to "rebalance" the load to assign partitions to each consumer. If the set of consumers changes while this assignment is taking place, the rebalance will fail and retry. This setting controls the maximum number of attempts before giving up.

When and why using Kafka

0_OkEiVShiPtkg7pdK

Image from Rinu Gour - Medium

The first thing anyone working with streaming applications should understand is the concept of event. An event is an atomic piece of data. For example, when the user registers with the system, the action creates an event. You can also think about an event like a message with data.
You will consider to using kafka if:

  • Your application needs Real-time data processing: Many modern systems require data to be processed as soon as it becomes available. Kafka can be useful here since it is able to transmit data from producers to data handlers and then to data storages
  • Your application needs activity tracking or logging, monitoring system: it is possible to publish logs into Kafka topics. The logs can be stored in a Kafka cluster for some time. There, they can be aggregated or processed. It is possible to build pipelines that consist of several producers/consumers where the logs are transformed in a certain way
  • Microservice: Kafka can be used in microservices as well. They benefit from Kafka by using it as a centric intermediary that lets them communicate with each other. That way they benefit from the publish-subscribe model

Extra: Why not RabbitMQ ?

cover

Image from Tarun Batra

Of course in design, everything has trade-offs. We're not saying Kafka greater than RabbitMQ, but Kafka have three application-level differences if you consider to choose what to use:

  • Kafka supports re-read of consumed messages while RabbitMQ doesn't.
  • Kafka supports ordering of messages in partition while RabbitMQ supports it with some constraint such as one exchange routing to the queue, one queue, one consumer to queue.
  • Kafka is faster in publishing data to partition than RabbitMQ.