Migrate data from S3 bucket (Parquet files) and insert it into a ClickHouse database using Python

Dec 10, 20232 min read

Reading data from S3 that is stored in folders and in Parquet file format, and then inserting it into a ClickHouse database with a specific schema, requires a few additional steps compared to handling CSV files. Here's how you can do it in Python:

1. Install Required Packages: You need `boto3` for AWS S3 access, `pandas` for data handling, and `clickhouse_connect` for ClickHouse database interaction. Install these packages using pip:

   pip install boto3 pandas clickhouse_connect

2. Set Up AWS Credentials: Ensure your AWS credentials are correctly configured for access to S3.

3. Read Parquet Data from S3: Use Boto3 to list and access the Parquet files in the specified S3 bucket and folders.

4. Convert Parquet Data to DataFrame: Read the Parquet files into a Pandas DataFrame.

5. Insert Data into ClickHouse: Use `clickhouse-driver` to insert the data into the ClickHouse database.

import boto3
import pandas as pd
import clickhouse_connect
import pyarrow.parquet as pq
from io import BytesIO

# AWS S3 Configuration
s3 = boto3.client('s3')
bucket_name = 'your-bucket-name'
prefix = 'your-folder-path/'  # Include trailing slash

# ClickHouse Configuration
clickhouse_host = 'your-clickhouse-host'
clickhouse_port = 9000
clickhouse_user = 'your-user'
clickhouse_password = 'your-password'
clickhouse_db = 'your-db'
clickhouse_table = 'your-table'

# Connect to ClickHouse
client = clickhouse_connect.get_client(host=clickhouse_host, secure=True, port=clickhouse_port, user=clickhouse_user,
           password=clickhouse_password, database=clickhouse_db)

# List and Read Parquet files from S3
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
    for obj in page['Contents']:
        file_key = obj['Key']
        response = s3.get_object(Bucket=bucket_name, Key=file_key)
        file_content = response['Body'].read()
        
        # Read Parquet file into DataFrame
        df = pd.read_parquet(BytesIO(file_content))

        # Process your DataFrame as needed
        # df = process_your_dataframe(df)

        # Insert data into ClickHouse
        # Ensure that the DataFrame columns match the ClickHouse table schema
        columns = list(df.columns)
        data = df.values.tolist()
		
		client.insert('measurements', data, column_names=columns)

print("Data inserted successfully into ClickHouse.")

Important Notes:

Replace your-folder-path/ with the specific folder path in your S3 bucket where the Parquet files are stored.
Ensure that the DataFrame columns match the schema of your ClickHouse table.
The process_your_dataframe(df) placeholder represents any data transformations you might need to align the DataFrame with your ClickHouse schema.
Handle exceptions and errors, especially for network operations and database interactions.
The code assumes your Parquet files are structured in a way that Pandas can directly convert them into a DataFrame. If your Parquet files have a complex structure, you might need additional processing.

Run the Script:

In the command line interface, navigate to the directory where your script is located.
Run the script using Python. For example:

python s3_to_clickhouse.py

Migrate data from S3 bucket (Parquet files) and insert it into a ClickHouse database using Python

Recent Posts

Send us a message and we’ll get back to you shortly.