Skip to main content

How to Ingest Data from Kafka into ClickHouse

Learn how to ingest data from a Kafka topic into ClickHouse using the Kafka table engine, materialized views, and MergeTree tables.

Overview

This article walks through the process of sending data from a Kafka topic to a ClickHouse table. We’ll use the Wiki recent changes feed, which provides a stream of events that represent changes made to various Wikimedia properties. The steps include:

  1. How to setup Kafka on Ubuntu
  2. Ingest a stream of data into a Kakfa topic
  3. Create a ClickHouse table that subscribes to the topic

1. Setup Kafka on Ubuntu

  1. Create an Ubuntu ec2 instance and SSH on to it:
  1. Install Kafka (based on the instructions here: https://www.linode.com/docs/guides/how-to-install-apache-kafka-on-ubuntu/):
  1. Start ZooKeeper:
  1. Open a new console and launch Kafka:
  1. Open a third console and create a topic named wikimedia:
  1. You can verify it was created successfully by:

2. Ingest the Wikimedia Stream into Kafka

  1. We need some utilities first:
  1. The data is sent to Kafka using a clever curl command that grabs the latest Wikimedia events, parses out the JSON data and sends that to the Kafka topic:
  1. You can "describe" the topic:
  1. Let's verify everything is working by consuming some events:
  1. Hit Ctrl+c to kill the previous command.

3. Ingest the Data into ClickHouse

  1. Here is what the incoming data looks like:
  1. We will need the Kafka table engine to pull the data from the Kafka topic:
  1. For some reason the Kafka table engine seems to take the public ec2 URL and convert it to the private DNS name, so I had to add that to my local /etc/hosts file:
  1. You can read from a Kafka table, you just have to enable a setting:

The rows should come back nicely parsed based on the columns defined in the wikiQueue table:

  1. We need a MergeTree table to store these incoming events:
  1. Let's define a materialized view that gets triggered when an insert occurs on the Kafka table and sends the data to our rawEvents table:
  1. You should start seeing data going into rawEvents almost immediately:
  1. Let's view some of the rows:
  1. Let's see what types of events are coming in:

Let's define a materialized view chained to our current materialized view. We will keep track of some aggregated stats per minute:

  1. We will need -Merge functions to view the results:
· 6 min read