Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GCS Storage: Parquet File single record operation.

Hi,

Can someone please help on the below concept, we are planning to create our own customize audit table to hold info such as job_name, table_name, record_count, status etc.

I tried researching around this one but couldn’t get a clear understanding. Suppose, I have a parquet file stored in GCS bucket Ex: /tmp/file/data.parquet, let’s assume we have 100 records in the file.

Now, adding a new record into the existing parquet file, does a new record stored in a different block or a new record will be appended to same block? 

Note: I do understand that GCS is a object storage service but I think at lower level data addition or retrieval happens at block level from disk.

In Hadoop, we have a health check command, which details out the file storage information, do we have any such similar command in GCS?

Thanks!

Solved Solved
2 4 1,673
1 ACCEPTED SOLUTION

Yes 🙂 The closest you can do is composing objects.  The typical pattern if you need to append is to either stream small files into a landing area that are aggregated by a separate piece of code.  You can do this either by depositing smaller files into GCS (don't do this unless you can pre-aggreagete into files larger than ~1MB); landing the events or records into Cloud Pub/Sub or Apache Kafka.  You may consider just writing your data to BigQuery, using the streaming API.  Or using Cloud Pub/Sub which has a built-in integration into BigQuery.  

View solution in original post