Apache Kafka

What is Apache Kafka?

Apache Kafka is a powerful computer system that helps different programs talk to each other and share information smoothly. It’s like a big digital messaging system where different applications, servers, and computers can send and receive data. Apache Kafka was first created by LinkedIn, but they later gave it to a group called the Apache Software Foundation. Nowadays, a company called Confluent takes care of it, working closely with the Apache Software Foundation. The cool thing about Apache Kafka is that it solves the problem of data moving too slowly from one place to another. It makes sure data gets from the sender to the receiver quickly and efficiently. 

What is a messaging system?

A messaging system is like sending texts or emails. It’s just a way for people or devices to exchange messages. In a publish-subscribe messaging system, it’s a bit like writing a note and passing it along. The person who writes the note is like the “sender” (in Apache Kafka, they call this person a “producer”), and the person who reads the note is like the “receiver” (in Apache Kafka, this person is called a “consumer”). So, in Apache Kafka, a “producer” sends out messages, and a “consumer” reads those messages by subscribing to them. It’s like sharing notes in a big digital classroom. 

Need Of Messaging System:

Data pipeline: In a real-world scenario, it’s essential for various systems to communicate with each other in real-time. This communication is achieved through the use of a pipeline. 

Apache Kafka Data Pipeline

Example: The chat server needs to communicate with the database server for storing messages. 

Complex Data pipelines: 

Solutions to the Complex Data Pipelines: 

Why Apache Kafka?

  • Apache Kafka is capable of handling millions of data or messages per second. 
  • Apache Kafka has a source system that generates data (like a producer), and you want that data to go to a target system (like a consumer). Apache Kafka acts as the middle person. It takes the data from the source, kind of like unhooking two LEGO pieces, and then the target system grabs it. 
  • Apache Kafka is lightning-fast. It can deliver data in less than 10 milliseconds, which is really quick. It’s like sending a text message and having the other person read it almost instantly. 
  • Apache Kafka is super reliable, like a sturdy bridge that never breaks. Organizations like, Netflix, Uber, Walmart, and many more make use of Apache Kafka. 

Kafka Terminologies: 

  1. Producer: A producer can be any application that can publish a message on a topic. 
  2. Consumer: A consumer can be any application that subscribes to a topic and consumes the messages. 
  3. Partition: Topics are broken up into ordered commit logs called partitions.
  4. Broker: Kafka cluster is a set of servers, each of which is called a broker.
  5. Topic: A topic is a category or feed name to which records are published. 
  6. Zookeeper: Zookeeper is used for managing and coordinating Kafka broker.

Kafka Cluster:  

Here, we can see multiple Producer produces data to Kafka broker and those Kafka broker reside inside a Kafka cluster.  

Again we have multiple consumers, consuming the data from the kafka broker and this kafka broker is managed by the zookeeper. 

Apache Kafka Cluster

Kafka Features:  

  1. High Throughput: Provides support for hundreds of thousands of messages with modest hardware 
  2. scalability: Highly scalable distributed systems with no downtime. 
  3. Data Loss: Kafka ensures no data loss once configured properly. 
  4. Stream Processing: Kafka can be used along with real-time streaming applications like Spark and Strom. 
  5. Durability: Provides support to persisting messages on disk. 
  6. Replication: Messages can be replicated across clusters, which supports multiple subscribers.

Apache Kafka Architecture: 

Here we have a producer, producing messages to a topic and this topic has 3 partitions. Now there are 3 consumers who are consuming from these partitions. 

So, we have 2 producers, Producer A and Producer B. Producer A produces partition1 and partition2, and Producer B which is producing partition3. Then we can see that we have one consumer for each partition. So this is the best-case scenario where you have total palladium in processing those messages. 

Kafka Use-Cases: 

1. Messaging:

An application has the capability to create messages using Kafka, and it doesn’t need to worry about how the messages are structured or formatted. 

These messages are managed by a single application, ensuring that all messages are consistently handled. This application can do the following: 

  • Ensure a common formatting for all messages to maintain a consistent look. 
  • Group multiple messages together in a single notification. 
  • Receive messages in a way that aligns with the user’s preferences. 

2. Activity Tracking: 

Kafka was created by LinkedIn to help keep track of what users do on their platform.  Imagine when a user uses a front-end application, like a website or app. This application generates messages that describe the actions the user is taking, such as clicking on links, submitting forms, and so on.  Kafka plays a crucial role in keeping a record of this user activity, from simple things like tracking clicks to more complex tasks like storing information from a user’s profile. 

3. Metrics and Logging: 

Kafka is not just for tracking user actions; it’s also great for gathering information about applications and system performance.  Applications regularly share metrics (data about how well they’re performing) by sending this data to a specific location in Kafka. The system then reads these metrics to keep an eye on performance and make adjustments when needed.  Additionally, Kafka can be used to publish log messages from various parts of the system. These log messages are then directed to specific places like Elasticsearch for searching through logs or to a security analysis application for evaluating system security.

4. Commit Log: 

Imagine your database is like a dynamic book, and when changes occur in that book, they can be shared with other applications through Kafka. This way, applications can stay up to date in real-time.  Kafka is also quite clever when it comes to handling these updates. It makes sure that changes in the database are not only sent to the local applications but also replicated to a remote system. This remote system collects changes from various applications and presents a consolidated view of the database.  Moreover, for long-term storage of changes, you can use ‘log compacted topics.’ They are like keeping the most important pages of the book, ensuring that only a single change for each key is retained, giving you a longer history of changes while avoiding redundancy.

5. Stream Processing: 

Stream processing is a term often used to describe applications that do something similar to how Hadoop’s map/reduce processing works.  In stream processing, data is handled in real-time, just as fast as messages are created. Think of it as having a little program that works with messages from Kafka as they come in.  These programs can do various tasks, like keeping track of metrics (counting things), organizing messages to make them easier for other applications to work with, and more. It’s all about processing and reacting to data as it flows in.

Apache Kafka- Installation:

Pre-requisites for Apache Kafka: You must have a good understanding of Java, distributed messaging systems, and Linux environments.

Step 1: Download the latest version of Apache Kafka from https://kafka.apache.org/downloads under Binary downloads. 

Step 2: Download and extract the contents to a directory of your choice, for example ~kafka_2.13-3.6.0.

Setup the $PATH environment variable:

Now go ahead and copy the path of the file and edit it in the user profiles bash file.

Here, you can go ahead and replace the path of the file and then you can specify the path of the bin and save it. 

Now we will start Zookeeper-

I am specifying the client port on Zookeeper will run on which is 2181, whenever I need to connect to my Zookeeper I will give this port 2181. Our zookeeper has been started. 

Now we will start Kafka Broker-

As you can see the Kafka broker has been registered with the zookeeper. Now go to the server. Properties file. Here you can see the broker.id = 0, So if you are creating a new broker you always need to change your broker ID and then delete topic enable = true. 

Let’s Configure a Single Node-Single Broker Cluster:

Step1: Start the Zookeeper server 

Step2: Start the Kafka Server 

Step3: Create a topic with the name “ Supriya-testdemo” 

Step4: Start a producer  

Step5: Start a consumer 

Step6: The Producer can publish a message on a topic that will be received by the consumer who has subscribed to that topic. 

Conclusion: 

Apache Kafka is like a highly organized postal system for data, enabling businesses to efficiently and reliably exchange information across different parts of their operations. It acts as a central hub, collecting, managing, and delivering data to its intended recipients, making it a crucial tool for handling real-time data streams and ensuring smooth communication in the digital world. Kafka’s scalability, fault tolerance, and versatility have made it an essential player in the world of data streaming and event-driven architecture, empowering organizations to harness the power of data for making informed decisions and staying ahead in a rapidly changing technological landscape. 

You can also read: Can Kafka be used for Video Streaming?

Stay tuned for more blogs!