Solved: GCS Storage: Parquet File single record operation. - Page 2

waseblr · 11-30-2023 10:39 AM

Hi,

Can someone please help on the below concept, we are planning to create our own customize audit table to hold info such as job_name, table_name, record_count, status etc.

I tried researching around this one but couldn’t get a clear understanding. Suppose, I have a parquet file stored in GCS bucket Ex: /tmp/file/data.parquet, let’s assume we have 100 records in the file.

Now, adding a new record into the existing parquet file, does a new record stored in a different block or a new record will be appended to same block?

Note: I do understand that GCS is a object storage service but I think at lower level data addition or retrieval happens at block level from disk.

In Hadoop, we have a health check command, which details out the file storage information, do we have any such similar command in GCS?

Thanks!

KirTitievsky

Yes 🙂 The closest you can do is composing objects. The typical pattern if you need to append is to either stream small files into a landing area that are aggregated by a separate piece of code. You can do this either by depositing smaller files into GCS (don't do this unless you can pre-aggreagete into files larger than ~1MB); landing the events or records into Cloud Pub/Sub or Apache Kafka. You may consider just writing your data to BigQuery, using the streaming API. Or using Cloud Pub/Sub which has a built-in integration into BigQuery.

View solution in original post