top of page
nlsamkariashiva

Migrate data from S3 bucket (Parquet files) and insert it into a ClickHouse database using Python




Reading data from S3 that is stored in folders and in Parquet file format, and then inserting it into a ClickHouse database with a specific schema, requires a few additional steps compared to handling CSV files. Here's how you can do it in Python:


Reading data from S3 that is stored in folders and in Parquet file format, and then inserting it into a ClickHouse database with a specific schema, requires a few additional steps compared to handling CSV files. Here's how you can do it in Python:

1. Install Required Packages: You need `boto3` for AWS S3 access, `pandas` for data handling, and `clickhouse_connect` for ClickHouse database interaction. Install these packages using pip:

   pip install boto3 pandas clickhouse_connect

2. Set Up AWS Credentials: Ensure your AWS credentials are correctly configured for access to S3.

3. Read Parquet Data from S3: Use Boto3 to list and access the Parquet files in the specified S3 bucket and folders.

4. Convert Parquet Data to DataFrame: Read the Parquet files into a Pandas DataFrame.

5. Insert Data into ClickHouse: Use `clickhouse-driver` to insert the data into the ClickHouse database.

import boto3
import pandas as pd
import clickhouse_connect
import pyarrow.parquet as pq
from io import BytesIO

# AWS S3 Configuration
s3 = boto3.client('s3')
bucket_name = 'your-bucket-name'
prefix = 'your-folder-path/'  # Include trailing slash

# ClickHouse Configuration
clickhouse_host = 'your-clickhouse-host'
clickhouse_port = 9000
clickhouse_user = 'your-user'
clickhouse_password = 'your-password'
clickhouse_db = 'your-db'
clickhouse_table = 'your-table'

# Connect to ClickHouse
client = clickhouse_connect.get_client(host=clickhouse_host, secure=True, port=clickhouse_port, user=clickhouse_user,
           password=clickhouse_password, database=clickhouse_db)

# List and Read Parquet files from S3
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
    for obj in page['Contents']:
        file_key = obj['Key']
        response = s3.get_object(Bucket=bucket_name, Key=file_key)
        file_content = response['Body'].read()
        
        # Read Parquet file into DataFrame
        df = pd.read_parquet(BytesIO(file_content))

        # Process your DataFrame as needed
        # df = process_your_dataframe(df)

        # Insert data into ClickHouse
        # Ensure that the DataFrame columns match the ClickHouse table schema
        columns = list(df.columns)
        data = df.values.tolist()
		
		client.insert('measurements', data, column_names=columns)

print("Data inserted successfully into ClickHouse.")

Important Notes:

  • Replace your-folder-path/ with the specific folder path in your S3 bucket where the Parquet files are stored.

  • Ensure that the DataFrame columns match the schema of your ClickHouse table.

  • The process_your_dataframe(df) placeholder represents any data transformations you might need to align the DataFrame with your ClickHouse schema.

  • Handle exceptions and errors, especially for network operations and database interactions.

  • The code assumes your Parquet files are structured in a way that Pandas can directly convert them into a DataFrame. If your Parquet files have a complex structure, you might need additional processing.


Run the Script:

  • In the command line interface, navigate to the directory where your script is located.

  • Run the script using Python. For example:

python s3_to_clickhouse.py

89 views0 comments

Recent Posts

See All

Comments


bottom of page