Apache Kafka Basic

Apache Kafka is a distributed streaming platform initially developed by LinkedIn, later open-sourced and became an Apache top-level project. It’s mainly used for processing large-scale, high-throughput, real-time message streaming data. Kafka’s design goals are high performance, low latency, and fault tolerance, commonly used in log collection, event sourcing, real-time data pipelines, message queuing, and other scenarios.

Simply put, Kafka is a “publish-subscribe” model messaging system:

Producer: Sends messages to Kafka.
Consumer: Reads messages from Kafka.
Kafka Cluster: Responsible for storing and distributing messages.

Kafka Core Concepts

1. Topic

A Topic is a category of messages in Kafka, similar to a table name in a database. Producers send messages to a specific Topic, and consumers subscribe to messages from specified Topics.

For example: In an e-commerce project, you can create two Topics, order-events and user-events, to handle order and user-related events respectively.

2. Partition

Each Topic can be divided into multiple Partitions, which are key to Kafka’s high throughput and parallel processing.
Messages within a partition are ordered, but there’s no ordering guarantee across partitions.
Partitions can be distributed across different servers (Brokers), enabling horizontal scaling.
By dividing a Topic into multiple partitions, consumers can consume in parallel, increasing processing speed.

3. Broker

A Kafka cluster consists of multiple servers, each called a Broker. Brokers are responsible for storing and managing messages.
Multiple partitions of a Topic are distributed across different Brokers, enhancing fault tolerance (if one Broker goes down, others can still work).

4. Producer

Producers are responsible for sending messages to specified Topics. When sending, they can specify the partition or let Kafka automatically assign it (based on the message key).

For example: Messages with order ID as the key will always be sent to the same partition, ensuring order.

5. Consumer

Consumers subscribe to one or more Topics to read messages.
Consumers can form Consumer Groups, where partitions are automatically assigned among group members, achieving load balancing.
Each partition can only be consumed by one consumer in a consumer group, but multiple consumer groups can independently consume the same Topic.

6. Offset

Each message in a partition has a unique offset, similar to an array index.
Consumers record the offset to mark which message they’ve consumed, so they can continue from the last offset when restarted.

7. Replication

Kafka supports partition replication to improve data reliability. If the Broker hosting the primary partition goes down, the replica can take over.
Leader and Follower mechanism: The Leader handles read and write requests, while Followers synchronize data.

8. Zookeeper

Kafka uses Zookeeper to manage cluster metadata, such as Broker status, partition Leader election, etc.
Although Kafka is gradually reducing its dependency on Zookeeper (newer versions introduce Kafka Raft), it’s still important to understand its role.

Kafka Basic Principles

Kafka’s core design is based on the following points:

Log-Structured Storage

Kafka stores messages as append-only logs, rather than traditional database CRUD operations.
This design makes write speeds extremely fast, suitable for high-throughput scenarios.
Messages are not deleted immediately but cleaned up automatically based on configured retention period or size.

Distributed Architecture

Through partitions and replicas, Kafka achieves high availability and scalability.
Producers and consumers can operate in parallel, with throughput increasing linearly with cluster size.

Zero-Copy

Kafka uses the operating system’s zero-copy technology to transfer data directly from disk to network, avoiding user space copying and improving performance.

Push-Pull Model Combination

Producers push messages to Kafka, and consumers pull messages, combining the advantages of both:

Producers don’t need to care about consumer status.
Consumers can consume at their own pace, avoiding overload.

What You Need to Master in Development

1. Installation and Configuration

Deploying Kafka: Download Kafka (recommended version 2.8 or 3.x), start Zookeeper and Kafka service.

Basic configuration:

broker.id: Unique identifier for each Broker.
log.dirs: Message log storage path.
zookeeper.connect: Zookeeper address.
num.partitions: Default number of partitions.
log.retention.hours: Log retention time.

2. Producer Development

Dependency: Using Go’s github.com/IBM/sarama library.

Code example:

package main

import (
	"fmt"
	"log"

	"github.com/IBM/sarama"
)

func main() {
	// Configure producer
	config := sarama.NewConfig()
	config.Producer.RequiredAcks = sarama.WaitForAll // Wait for all replicas to confirm
	config.Producer.Retry.Max = 5                    // Number of retries
	config.Producer.Return.Successes = true          // Successfully sent messages will be returned in success channel

	// Create producer
	producer, err := sarama.NewSyncProducer([]string{"localhost:9092"}, config)
	if err != nil {
		log.Fatalf("Failed to create producer: %v", err)
	}
	defer func() {
		if err := producer.Close(); err != nil {
			log.Fatalf("Failed to close producer: %v", err)
		}
	}()

	// Create message
	msg := &sarama.ProducerMessage{
		Topic: "my-topic",
		Key:   sarama.StringEncoder("key"),
		Value: sarama.StringEncoder("value"),
	}

	// Send message
	partition, offset, err := producer.SendMessage(msg)
	if err != nil {
		log.Fatalf("Failed to send message: %v", err)
	}

	fmt.Printf("Message sent successfully, partition: %d, offset: %d\n", partition, offset)
}

Key points:

Configure RequiredAcks to control message confirmation level (NoResponse, WaitForLocal, WaitForAll).
Choose between synchronous (SyncProducer) or asynchronous (AsyncProducer) message sending.
Set retry mechanism through Producer.Retry.Max configuration.

3. Consumer Development