Solved: CSVs to dataframe to BigQuery - GCS best practice - Page 2

ayushmaheshwari · 06-22-2024 08:42 AM

Hi

Consider the code below. Here I am doing the following steps:

1) Reading several dataframes from csvs in a bucket

2) Appending each into a list

3) Concatenating

4) Putting into a BigQuery table

5) As I have read the dataframe from csv in te bucket, I copy the csv into archive folder and delete it

My question is:

a) is the code written with best practice.

b) Is there any harm that I move the csv into archive folder in the same bucket immediately after reading it?

c) Is there a likelihood of csv getting corrupted by any chance?

@ms4446 can you please comment?

Consider the code below:

    storage_client = storage.Client()
    source_bucket = storage_client.bucket(gcs_bucket)

    blobs = list(source_bucket.list_blobs(match_glob="output/*.csv"))

for blob in blobs:
        logging.info("Processing blob: %s", blob.name)
        df = pd.read_csv(
            BytesIO(blob.download_as_bytes()), encoding="unicode_escape", dtype="string"
        )

        

        # Keep a copy of the original dataframe before renaming columns
        original_df = df.copy()

        for column in COLUMN_MAPPING:
            if column not in df.columns:
                df[column] = pd.NA
        df = df.rename(columns=COLUMN_MAPPING)[[*COLUMN_MAPPING.values()]]

        if df["repository_name"].isna().all():
            df["repository_name"] = blob.name
            original_df["repository_name"] = blob.name
            dataframes_consolidate.append(original_df)
            destination_blob_name = "archive/" + blob.name.split("/")[-1]
            _ = source_bucket.copy_blob(
                        blob, source_bucket, destination_blob_name)
            source_bucket.delete_blob(blob.name)
        dataframes.append(df)

merged_df = pd.concat(dataframes, ignore_index=True)
    merged_df.to_gbq(
        destination_table=f"{destination_dataset}.{destination_table}",
        project_id=destination_project,
        if_exists="replace",
        progress_bar=False,

jaia

Helllo,

Thank you for contacting Google Cloud Community!

Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.

Regards,
Jai Ade

View solution in original post