Skip to main content
Skip to main content
Edit this page

How to query Apache Arrow with chDB

Apache Arrow is a standardized column-oriented memory format that's gained popularity in the data community. In this guide, we will learn how to query Apache Arrow using the Python table function.

Setup

Let's first create a virtual environment:

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

And now we're going to install PyArrow, pandas, and ipython:

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

You can also use the code in a Python script or in your favorite notebook.

Creating an Apache Arrow table from a file

Let's first download one of the Parquet files for the Ookla dataset, using the AWS CLI tool:

Note

If you want to download more files, use aws s3 ls to get a list of all the files and then update the above command.

Next, we'll import the Parquet module from the pyarrow package:

And then we can read the Parquet file into an Apache Arrow table:

The schema is shown below:

And we can get the row and column count by calling the shape attribute:

Querying Apache Arrow

Now let's query the Arrow table from chDB. First, let's import chDB:

And then we can describe the table:

We can also count the number of rows:

Now, let's do something a bit more interesting. The following query excludes the quadkey and tile.* columns and then computes the average and max values for all remaining column: