Solved: CSVs to dataframe to BigQuery - GCS best practice

ayushmaheshwari · 06-22-2024 08:42 AM

Hi

Consider the code below. Here I am doing the following steps:

1) Reading several dataframes from csvs in a bucket

2) Appending each into a list

3) Concatenating

4) Putting into a BigQuery table

5) As I have read the dataframe from csv in te bucket, I copy the csv into archive folder and delete it

My question is:

a) is the code written with best practice.

b) Is there any harm that I move the csv into archive folder in the same bucket immediately after reading it?

c) Is there a likelihood of csv getting corrupted by any chance?

@ms4446 can you please comment?

Consider the code below:

    storage_client = storage.Client()
    source_bucket = storage_client.bucket(gcs_bucket)

    blobs = list(source_bucket.list_blobs(match_glob="output/*.csv"))

for blob in blobs:
        logging.info("Processing blob: %s", blob.name)
        df = pd.read_csv(
            BytesIO(blob.download_as_bytes()), encoding="unicode_escape", dtype="string"
        )

        

        # Keep a copy of the original dataframe before renaming columns
        original_df = df.copy()

        for column in COLUMN_MAPPING:
            if column not in df.columns:
                df[column] = pd.NA
        df = df.rename(columns=COLUMN_MAPPING)[[*COLUMN_MAPPING.values()]]

        if df["repository_name"].isna().all():
            df["repository_name"] = blob.name
            original_df["repository_name"] = blob.name
            dataframes_consolidate.append(original_df)
            destination_blob_name = "archive/" + blob.name.split("/")[-1]
            _ = source_bucket.copy_blob(
                        blob, source_bucket, destination_blob_name)
            source_bucket.delete_blob(blob.name)
        dataframes.append(df)

merged_df = pd.concat(dataframes, ignore_index=True)
    merged_df.to_gbq(
        destination_table=f"{destination_dataset}.{destination_table}",
        project_id=destination_project,
        if_exists="replace",
        progress_bar=False,

jaia

Helllo,

Thank you for contacting Google Cloud Community!

Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.

Regards,
Jai Ade

View solution in original post

jaia

Helllo,

Thank you for contacting Google Cloud Community!

Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.

Regards,
Jai Ade