Hi
Consider the code below. Here I am doing the following steps:
1) Reading several dataframes from csvs in a bucket
2) Appending each into a list
3) Concatenating
4) Putting into a BigQuery table
5) As I have read the dataframe from csv in te bucket, I copy the csv into archive folder and delete it
My question is:
a) is the code written with best practice.
b) Is there any harm that I move the csv into archive folder in the same bucket immediately after reading it?
c) Is there a likelihood of csv getting corrupted by any chance?
@ms4446 can you please comment?
Consider the code below:
storage_client = storage.Client()
source_bucket = storage_client.bucket(gcs_bucket)
blobs = list(source_bucket.list_blobs(match_glob="output/*.csv"))
for blob in blobs:
logging.info("Processing blob: %s", blob.name)
df = pd.read_csv(
BytesIO(blob.download_as_bytes()), encoding="unicode_escape", dtype="string"
)
# Keep a copy of the original dataframe before renaming columns
original_df = df.copy()
for column in COLUMN_MAPPING:
if column not in df.columns:
df[column] = pd.NA
df = df.rename(columns=COLUMN_MAPPING)[[*COLUMN_MAPPING.values()]]
if df["repository_name"].isna().all():
df["repository_name"] = blob.name
original_df["repository_name"] = blob.name
dataframes_consolidate.append(original_df)
destination_blob_name = "archive/" + blob.name.split("/")[-1]
_ = source_bucket.copy_blob(
blob, source_bucket, destination_blob_name)
source_bucket.delete_blob(blob.name)
dataframes.append(df)
merged_df = pd.concat(dataframes, ignore_index=True)
merged_df.to_gbq(
destination_table=f"{destination_dataset}.{destination_table}",
project_id=destination_project,
if_exists="replace",
progress_bar=False,
Solved! Go to Solution.
Helllo,
Thank you for contacting Google Cloud Community!
Here are the answer to your questions:
1. The provided code incorporates best practices
2. Moving CSVs to an archive folder within the same bucket after reading them is a common practice and can be safe if implemented with proper error handling and considerations for data retention.
3. While there's a slight possibility of CSV corruption, it's generally low when using GCS and implementing best practices.
Regards,
Jai Ade