Hi
Consider the code below. Here I am doing the following steps:
1) Reading several dataframes from csvs in a bucket
2) Appending each into a list
3) Concatenating
4) Putting into a BigQuery table
5) As I have read the dataframe from csv in te bucket, I copy the csv into archive folder and delete it
My question is:
a) is the code written with best practice.
b) Is there any harm that I move the csv into archive folder in the same bucket immediately after reading it?
c) Is there a likelihood of csv getting corrupted by any chance?
@ms4446 can you please comment?
Consider the code below:
storage_client = storage.Client()
source_bucket = storage_client.bucket(gcs_bucket)
blobs = list(source_bucket.list_blobs(match_glob="output/*.csv"))
for blob in blobs:
logging.info("Processing blob: %s", blob.name)
df = pd.read_csv(
BytesIO(blob.download_as_bytes()), encoding="unicode_escape", dtype="string"
)
# Keep a copy of the original dataframe before renaming columns
original_df = df.copy()
for column in COLUMN_MAPPING:
if column not in df.columns:
df[column] = pd.NA
df = df.rename(columns=COLUMN_MAPPING)[[*COLUMN_MAPPING.values()]]
if df["repository_name"].isna().all():
df["repository_name"] = blob.name
original_df["repository_name"] = blob.name
dataframes_consolidate.append(original_df)
destination_blob_name = "archive/" + blob.name.split("/")[-1]
_ = source_bucket.copy_blob(
blob, source_bucket, destination_blob_name)
source_bucket.delete_blob(blob.name)
dataframes.append(df)
merged_df = pd.concat(dataframes, ignore_index=True)
merged_df.to_gbq(
destination_table=f"{destination_dataset}.{destination_table}",
project_id=destination_project,
if_exists="replace",
progress_bar=False,
Solved! Go to Solution.
Helllo,
Thank you for contacting Google Cloud Community!
Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.
Regards,
Jai Ade
Helllo,
Thank you for contacting Google Cloud Community!
Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.
Regards,
Jai Ade