I have limited understanding of google cloud. I'm seeking assistance in finding a suitable solution for implementing data lake due to below requirement:
I need to implement a data warehouse using BigQuery.
I am planning to create data lake and data warehouse both. So that i can only migrate the required tables to data warehouse and also keep all the data/tables in data lake for on demand availability and readability.
The Source has completely structured data in Google Cloud SQL (Postgresql).
I am not able to decide which cloud service i can use for implementing the data lake.
My Data lake expectations are:
I need to migrate the whole table from Google Cloud SQL (Source) to Data Lake (Target) initially.
After that i need to schedule a daily batch to sync the incremental/last day transactions on daily basis into data lake.
I may need to query this data lake table on daily basis.
Need Decision on:
I am not sure about using BigQuery as the data lake because read cost is very high in BigQuery. Reading requirements from these data lake tables will be very simple (select * from table).
I am not sure if i can implement data lake in GCS (Google Cloud Storage) because files stored in GCS are immutable. So, As per my understanding i will be having one .avro file for all time data and then for each day separate .avro file will be created in GCS.
I am not sure how i will be able to read all the files as a single table.
I am assuming that i need to create separate external table in BigQuery whenever i need to read data from GCS .avro file. It may result in less cost as i will be only reading one day data.
I will able to debug things related to finding data origin and all.
But analysis may not be possible. Please correct me if i am wrong.
The last option i find is to use Cloud SQL managed instance. This options is also not seems to be cost efficient as i do not need to perform many write transactions. It is just a write once a day and read multiple times a day.
I need solution for finding suitable data lake option as i am a beginner in google cloud. Please help me in deciding the suitable data lake option. I know i can be wrong about my understanding. Please feel free to correct me. I am open to new suggestion but only in google cloud.