Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Need help in finding suitable Data lake options in Google Cloud

I have limited understanding of google cloud. I'm seeking assistance in finding a suitable solution for implementing data lake due to below requirement:

  • I need to implement a data warehouse using BigQuery.

  • I am planning to create data lake and data warehouse both. So that i can only migrate the required tables to data warehouse and also keep all the data/tables in data lake for on demand availability and readability.

  • The Source has completely structured data in Google Cloud SQL (Postgresql).

  • I am not able to decide which cloud service i can use for implementing the data lake.

My Data lake expectations are:

  • I need to migrate the whole table from Google Cloud SQL (Source) to Data Lake (Target) initially.

  • After that i need to schedule a daily batch to sync the incremental/last day transactions on daily basis into data lake.

  • I may need to query this data lake table on daily basis.

Need Decision on:

  • I am not sure about using BigQuery as the data lake because read cost is very high in BigQuery. Reading requirements from these data lake tables will be very simple (select * from table).

  • I am not sure if i can implement data lake in GCS (Google Cloud Storage) because files stored in GCS are immutable. So, As per my understanding i will be having one .avro file for all time data and then for each day separate .avro file will be created in GCS.

    • I am not sure how i will be able to read all the files as a single table.

    • I am assuming that i need to create separate external table in BigQuery whenever i need to read data from GCS .avro file. It may result in less cost as i will be only reading one day data.

    • I will able to debug things related to finding data origin and all.

    • But analysis may not be possible. Please correct me if i am wrong.

  • The last option i find is to use Cloud SQL managed instance. This options is also not seems to be cost efficient as i do not need to perform many write transactions. It is just a write once a day and read multiple times a day.

I need solution for finding suitable data lake option as i am a beginner in google cloud. Please help me in deciding the suitable data lake option. I know i can be wrong about my understanding. Please feel free to correct me. I am open to new suggestion but only in google cloud.

0 1 370
1 REPLY 1