Kafka Introduction

From bibbleWiki
Jump to navigation Jump to search

Introduction

Definition

This is an introduction to Kafka which describes itself as a messaging system

Architecture

Cluster

It is a group of computers , each executing same instance of kafka broker.

Broker

It is just a meaningful name given to the kafka server, kafka producer does not directly interact with the consumer, they use kafka broker as the agent or broker to interact. In a cluster there can be more than one brokers.

Brokers are stateless, hence to maintain the cluster state they use ZooKeeper.

Zookeeper

ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly used to notify producer and consumer about the presence of any new broker in the Kafka cluster system or failure of the broker in the Kafka cluster system. As per the notification received by the Zookeeper regarding presence or failure of the broker then producer and consumer takes decision and starts coordinating their task with some other broker.

Producers

Producer is a component which pushes data to the brokers, it doesn’t wait for acknowledgement from the brokers rather sends data as fast as the brokers can handle. There can be more than one producers depending on the use case.

Consumers

Since Kafka brokers are stateless, which means that the consumer has to maintain how many messages have been consumed by using partition offset. If the consumer acknowledges a particular message offset, it implies that the consumer has consumed all prior messages. The consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to consume. The consumers can kind of rewind or skip to any point in a partition simply by supplying an offset value. Consumer offset value is notified by ZooKeeper.

LinkedIn Worked Example

Here is some stats from LinkedIn

And here is there architeture pre-2010.

Post 2010 architecture.

This is the current scale.

Installation

Doing the install

To install I went with the instructions on https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-20-04. To enable stat for zookeeper I enable it on the whitelist by changing kafka/config/zookeeper.properties to include

4lw.commands.whitelist=stat, ruok, conf, isro

Creating A Topic

We do this with the command

~/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic my_topic

Listing Topics

We do this with the command

~/kafka/bin/kafka-topics.sh --list --zookeeper localhost:2181

Send Message to a Topic

We send a message with the command

~/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my_topic
>Message 1
>Test Message 2

Receive Message from a Topic

We receive a message with the command

~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_topic --from-beginning

Partitions

What are Partitions

A partition represents the splitting up of messages from a topic. I.E when a message is received it can be delivered to one and only one of the available partitions.

Partition Trade offs

  • The more partitions the greater the Zookeeper overhead
  • Message order can become compplex
  • More partitions the longer the leader fail-over time

Replication Factor

This is the number of copies of the messages that should be replicated. I.E. the number of copies of the data to keep.

Replication Status

We can view the status of the replication using the describe command. In the screenshot below we killed one of the brokers and then brought it back up. By comparing the ReplicationFactor with the Isr (In-sync replica) we can see the outage and then the return to normal state.

Producers

Introduction

Here is a visual representation of the Producer

Where to Direct Messages

The producer looks at the message and makes an decision on where to send the message based on configurable rules.

Terms

  • Log shipping is the process of automating the backup of transaction log files on a primary (production) database server, and then restoring them onto a standby server.
  • Message Broker is an intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
  • ZooKeeper is essentially a service for distributed systems offering a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems (see Use cases). ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.