Solved: GCS Storage: Parquet File single record operation.

waseblr · 11-30-2023 10:39 AM

Hi,

Can someone please help on the below concept, we are planning to create our own customize audit table to hold info such as job_name, table_name, record_count, status etc.

I tried researching around this one but couldn’t get a clear understanding. Suppose, I have a parquet file stored in GCS bucket Ex: /tmp/file/data.parquet, let’s assume we have 100 records in the file.

Now, adding a new record into the existing parquet file, does a new record stored in a different block or a new record will be appended to same block?

Note: I do understand that GCS is a object storage service but I think at lower level data addition or retrieval happens at block level from disk.

In Hadoop, we have a health check command, which details out the file storage information, do we have any such similar command in GCS?

Thanks!

KirTitievsky

Yes 🙂 The closest you can do is composing objects. The typical pattern if you need to append is to either stream small files into a landing area that are aggregated by a separate piece of code. You can do this either by depositing smaller files into GCS (don't do this unless you can pre-aggreagete into files larger than ~1MB); landing the events or records into Cloud Pub/Sub or Apache Kafka. You may consider just writing your data to BigQuery, using the streaming API. Or using Cloud Pub/Sub which has a built-in integration into BigQuery.

View solution in original post

KirTitievsky

Hi there. That's a good question. Cloud Storage objects are immutable. There is no notion of blocks that you might see in Azure's Blob Store. So the only way to write new records or to delete records is to create new objects. This means that any such operations would generate standard admin audit logs.

k

waseblr

Hi, Thanks for the reply, so it means that we cannot append new records into the existing GCS parquet file? since it is immutable?

KirTitievsky

Yes 🙂 The closest you can do is composing objects. The typical pattern if you need to append is to either stream small files into a landing area that are aggregated by a separate piece of code. You can do this either by depositing smaller files into GCS (don't do this unless you can pre-aggreagete into files larger than ~1MB); landing the events or records into Cloud Pub/Sub or Apache Kafka. You may consider just writing your data to BigQuery, using the streaming API. Or using Cloud Pub/Sub which has a built-in integration into BigQuery.

waseblr

Thank you for your help.