Skip to main content
Skip to main content

How to query Pandas DataFrames with chDB

Pandas is a popular library for data manipulation and analysis in Python. In version 2 of chDB, we've improved the performance of querying Pandas DataFrames and introduced the Python table function. In this guide, we will learn how to query Pandas using the Python table function.

Setup

Let's first create a virtual environment:

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

And now we're going to install Pandas and a couple of other libraries:

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

You can also use the code in a Python script or in your favorite notebook.

Creating a Pandas DataFrame from a URL

We're going to query some data from the StatsBomb GitHub repository. Let's first import requests and pandas:

Then, we'll load one of the matches JSON files into a DataFrame:

Let's have a look what data we'll be working with:

Next, we'll load one of the events JSON files and also add a column called match_id to that DataFrame:

And again, let's have a look at the first row:

Querying Pandas DataFrames

Next, let's see how to query these DataFrames using chDB. We'll import the library:

We can query Pandas DataFrames by using the Python table function:

So, if we wanted to list the columns in matches_df, we could write the following:

We could then find out which referees have officiated more than one match by writing the following query:

Now, let's explore events_df.

Joining Pandas DataFrames

We can also join DataFrames together in a query. For example, to get an overview of the match, we could write the following query:

Populating a table from a DataFrame

We can also create and populate ClickHouse tables from DataFrames. If we want to create a table in chDB, we need to use the Stateful Session API.

Let's import the session module:

Initialize a session:

Next, we'll create a database:

Then, create an events table based on events_df:

We can then run the query that returns the top pass recipient:

Joining a Pandas DataFrame and table

Finally, we can also update our join query to join the matches_df DataFrame with the statsbomb.events table: