What is Apache Kafka?
Kafka was developed around 2010 at LinkedIn by a team that included Jay Kreps, Jun Rao, and Neha Narkhede. Apache Kafka is a distributed publish-subscribe messaging system in which multiple producers send data to the Kafka cluster and which in turn serves them to consumers. In the publish-subscribe model, message producers are called publishers, and one who consumes messages is called as subscribers. Kafka has a robust queue that handles a high volume of data and passes data from one point to another. Kafka prevents data loss by persisting messages on disk and replicating data in the cluster.
Topic: A stream of messages of a particular type is called a topic.
Producer: A Producer is a source of data for the Kafka cluster. It will publish messages to one or more Kafka topics.
Consumer: A Consumer consumes records from the Kafka cluster. Multiple consumers consume or read messages from topics parallelly.
Brokers: Kafka cluster may contain multiple brokers. A broker acts as a bridge between producers and consumers. A Kafka cluster may contain 10, 100, or 1,000 brokers if needed. Each Kafka broker has a unique identifier number.
Record: Messages Sent to the Kafka are in the form of records. It is a key-value pair.
ZooKeeper: It is used to track the status of Kafka cluster nodes. It also maintains information about Kafka topics, partitions, etc.
Kafka Cluster: A Kafka cluster is a system that comprises different brokers, topics, and their respective partitions. Data is written to the topic within the cluster and read by the cluster itself.
Who uses Kafka ?
A lot of companies adopted Kafka over the last few years. I will list some of the companies that use Kafka.
Pinterest uses Kafka to handle critical events like impressions, clicks, close-ups, and repins. According to Kafka summit 2018, Pinterest has more than 2,000 brokers running on Amazon Web Services, which transports near about 800 billion messages and more than 1.2 petabytes per day, and handles more than 15 million messages per second during the peak hours.
Uber requires a lot of real-time processing. Uber collects event data from the rider and driver apps. Then they provide this data for processing to downstream consumers via Kafka.
Netflix uses Kafka clusters together with Apache Flink for stream processing.
Apache Kafka originates at LinkedIn. Linked uses Kafka for monitoring, tracking, and user activity tracking, newsfeed, and stream data.
Swiftkey uses Kafka for analytics event processing.
Apart from the above-listed companies, many companies like Adidas, Line, The New York Times, Agoda, Airbnb, Netflix, Oracle, Paypal, etc use Kafka.
Advantages of Kafka :
- High throughput – Kafka handles large volume and high-velocity data with very little hardware. It also supports message throughput of thousands of messages per second.
- Low Latency – Kafka handles messages with very low latency of the range of milliseconds.
- Scalability – As Kafka is a distributed messaging system that scales up easily without any downtime.Kafka handles terabytes of data without any overhead. It can scale up to handling trillions of messages per day.
- Durability – As Kafka persists messages on disks this makes Kafka a highly durable messaging system. Also one of another reasons for durability is message replication due to which messages are never lost.
Apart from above mentioned Kafka has advantages like reliability, fault tolerance, high concurrency and batch handling, real-time handling, etc.
In this blog we have shared details about apache kafka which is a highly scalable Publish/Subscribe messaging system.