Kafka was developed around 2010 at LinkedIn by a team that included Jay Kreps, Jun Rao, and Neha Narkhede. Apache Kafka is a distributed publish-subscribe messaging system in which multiple producers send data to the Kafka cluster and which in turn serves them to consumers. In the publish-subscribe model, message producers are called publishers, and one who consumes messages is called as subscribers. Kafka has a robust queue that handles a high volume of data and passes data from one point to another. Kafka prevents data loss by persisting messages on disk and replicating data in the cluster.

Kafka Architecture:

Topic: A stream of messages of a particular type is called a topic.

Producer: A Producer is a source of data for the Kafka cluster. It will publish messages to one or more Kafka topics. 

Consumer: A Consumer consumes records from the Kafka cluster. Multiple consumers consume or read messages from topics parallelly. 

Brokers: Kafka cluster may contain multiple brokers. A broker acts as a bridge between producers and consumers. A Kafka cluster may contain 10, 100, or 1,000 brokers if needed. Each Kafka broker has a unique identifier number.

Record: Messages Sent to the Kafka are in the form of records. It is a key-value pair.

ZooKeeper: It is used to track the status of Kafka cluster nodes. It also maintains information about Kafka topics, partitions, etc.

Kafka Cluster: A Kafka cluster is a system that comprises different brokers, topics, and their respective partitions. Data is written to the topic within the cluster and read by the cluster itself.

Who uses Kafka ?

A lot of companies adopted Kafka over the last few years. I will list some of the companies that use Kafka.

1) Netflix

Netflix uses Kafka clusters together with Apache Flink for distributed video streaming processing.

2) Pinterest

Pinterest uses Kafka to handle critical events like impressions, clicks, close-ups, and repins. According to Kafka summit 2018, Pinterest has more than  2,000 brokers running on Amazon Web Services, which transports about 800 billion messages and more than 1.2 petabytes per day, and handles more than 15 million messages per second during the peak hours.

3) Uber

Uber requires a lot of real-time processing. Uber collects event data from the rider and driver apps. Then they provide this data for processing to downstream consumers via Kafka.

4) LinkedIn

Apache Kafka originates at LinkedIn. Linked uses Kafka for monitoring, tracking, and user activity tracking, newsfeed, and stream data.

5) Swiftkey

Swiftkey uses Kafka for analytics event processing.

Apart from the above-listed companies, many companies like Adidas, Line, The New York Times, Agoda, Airbnb, Oracle, Paypal, etc use Kafka.

Why can Apache Kafka be used for video streaming?

  1. High throughput – Kafka handles large volume and high-velocity data with very little hardware. It also supports message throughput of thousands of messages per second.
  2. Low Latency – Kafka handles messages with very low latency in the range of milliseconds.
  3. Scalability – As Kafka is a distributed messaging system that scales up easily without any downtime. Kafka handles terabytes of data without any overhead. It can scale up to handling trillions of messages per day.
  4. Durability – As Kafka persists messages on disks this makes Kafka a highly durable messaging system. Also one of another reason for durability is message replication due to which messages are never lost.

Other reasons to consider Kafka for video streaming are reliability, fault tolerance, high concurrency, batch handling, real-time handling, etc.

Neova has expertise in message broker services and can help build micro-services based distributed applications that can leverage the power of a system like Kafka.

References :

  1. https://kafka.apache.org/powered-by
  2. https://kafka.apache.org/documentation/
  3. https://blog.softwaremill.com/who-and-why-uses-apache-kafka-10fd8c781f4d


Jr. Software Engineer