Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Need help in finding suitable Data lake options in Google Cloud

I have limited understanding of google cloud. I'm seeking assistance in finding a suitable solution for implementing data lake due to below requirement:

  • I need to implement a data warehouse using BigQuery.

  • I am planning to create data lake and data warehouse both. So that i can only migrate the required tables to data warehouse and also keep all the data/tables in data lake for on demand availability and readability.

  • The Source has completely structured data in Google Cloud SQL (Postgresql).

  • I am not able to decide which cloud service i can use for implementing the data lake.

My Data lake expectations are:

  • I need to migrate the whole table from Google Cloud SQL (Source) to Data Lake (Target) initially.

  • After that i need to schedule a daily batch to sync the incremental/last day transactions on daily basis into data lake.

  • I may need to query this data lake table on daily basis.

Need Decision on:

  • I am not sure about using BigQuery as the data lake because read cost is very high in BigQuery. Reading requirements from these data lake tables will be very simple (select * from table).

  • I am not sure if i can implement data lake in GCS (Google Cloud Storage) because files stored in GCS are immutable. So, As per my understanding i will be having one .avro file for all time data and then for each day separate .avro file will be created in GCS.

    • I am not sure how i will be able to read all the files as a single table.

    • I am assuming that i need to create separate external table in BigQuery whenever i need to read data from GCS .avro file. It may result in less cost as i will be only reading one day data.

    • I will able to debug things related to finding data origin and all.

    • But analysis may not be possible. Please correct me if i am wrong.

  • The last option i find is to use Cloud SQL managed instance. This options is also not seems to be cost efficient as i do not need to perform many write transactions. It is just a write once a day and read multiple times a day.

I need solution for finding suitable data lake option as i am a beginner in google cloud. Please help me in deciding the suitable data lake option. I know i can be wrong about my understanding. Please feel free to correct me. I am open to new suggestion but only in google cloud.

0 1 368
1 REPLY 1

Hi @RegularDE,

Welcome to Google Cloud Community!

Based on my understanding, you need a data lake and a data warehouse to process your stored data in Cloud SQL.

Below are the details provided in response to your queries:

1. I am not sure about using BigQuery as the data lake because read cost is very high in BigQuery. Reading requirements from these data lake tables will be very simple (select * from table).

BigQuery can function as a data lake, however BigQuery is better for analytics rather than serving as a repository for raw data. You are right that BigQuery can indeed incur higher cost, particularly when performing full table scans due to the volume of data being processed. To mitigate this, you can apply various optimization techniques to improve your query performance and manage cost effectively.

2. I am not sure if i can implement data lake in GCS (Google Cloud Storage) because files stored in GCS are immutable. So, As per my understanding i will be having one .avro file for all time data and then for each day separate .avro file will be created in GCS.

Regarding Google Cloud Storage, you are correct about the immutable nature of Google Cloud Storage. Once an object is successfully uploaded, it cannot be modified. As mentioned in the discussion and confirmed by the official documentation you can update an object by overwriting it with a new upload.

- I am not sure how i will be able to read all the files as a single table. I am assuming that i need to create separate external table in BigQuery whenever i need to read data from GCS .avro file. It may result in less cost as i will be only reading one day data.

For guidance on storing data from Cloud Storage into the same table in BigQuery, you may refer to this documentation for detailed insights.

- But analysis may not be possible. Please correct me if i am wrong.

GCS is mostly used as a storage solution, you can use BigQuery external tables for querying and analyzing data.

3. The last option i find is to use Cloud SQL managed instance. This options is also not seems to be cost efficient as i do not need to perform many write transactions. It is just a write once a day and read multiple times a day.

While Cloud SQL supports structured data and SQL querying, Cloud SQL is ideally used for transactional workloads, and not for large-scale storage.

This Medium article on Building a Data Lake and Warehouse on GCP could be particularly helpful for you. You may also want to review this related discussion on Stack Overflow, which explains the purpose of BigQuery and Cloud SQL.

I hope the above information is helpful.