Apache Kafka is a distributed streaming system that can publish and subscribe to a stream of records. In another aspect, it is an enterprise messaging system. It is a highly fast, horizontally scalable, and fault-tolerant system. Kafka has 4 core APIs
- Producer API Allows clients to connect to Kafka servers running in the cluster and publish the stream of records to one or more Kafka topics
- Consumer API Allows clients to connect to Kafka servers running in the cluster and consume streams of records from one or more Kafka topics. Kafka consumes the messages from Kafka topics.
- Streams API Allows clients to act as stream processors by consuming streams from one or more topics and producing the streams to other output topics. This allows transforming the input and output streams.
- Connector API Allows writing reusable producer and consumer code. For example, if we want to read data from any RDBMS to publish the data to the topic and consume data from the topic and write that to RDBMS. We can create reusable source and sink connector components for various data sources.
Use cases
Kafka is used for the below use cases:
- Messaging System Kafka is used as an enterprise messaging system to decouple source and target systems to exchange data. Kafka provides high throughput with partitions and fault tolerance with replication in comparison to JMS.
- Web Activity Tracking This is done to track user journey events on the website for analytics and offline data processing.
- Log Aggregation This processes the log from various systems, especially in distributed environments with microservices architectures in which the systems are deployed on various hosts. We need to aggregate the logs from various systems and make the logs available in a central place for analysis.
- Metrics Collector Kafka is used to collecting metrics from various systems and networks for operations monitoring. There are Kafka metrics reporters available for monitoring tools like Ganglia, Graphite, etc.
Broker An instance in a Kafka cluster is called a broker.
In a Kafka cluster, if you connect to a broker, you will be able to access the entire cluster.
The broker instance that we connect to in order to access the cluster is known as a bootstrap server.
Each broker is identified by a numeric ID in the cluster.
To start a Kafka cluster, three brokers is a good number, but there are clusters with hundreds of brokers.
A topic is a logical name to which the records are published.
Internally, the topic is divided into partitions to which the data is published.
These partitions are distributed across the brokers in the cluster.
For example, if a topic has three partitions with three brokers in the cluster, each broker has one partition. The published data to partition is append-only with the offset increment.
-
Below are some points we need to remember when working with partitions.
- Topics are identified by name. We can have many named topics in a cluster.
- The order of messages is maintained at the partition level, not across topics.
- Once the data written to the partition, it is not overridden. This is called immutability.
- The messages in partitions are stored with keys, values, and timestamps. Kafka ensures publishing the message to the same partition for a given key.
- From the Kafka cluster, each partition will have a leader that will take read/write operations to that partition.