Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Data Lakehouse Medallion Architecture: bronze/silver/gold: Silver Zone and data persistency

When it comes to Data Lakehouse Architecture and promoting data from Bronze (raw) to Gold (Curated), what are actual practices you have seen?

Silver (Enriched): what data transformations happen in silver? Is the data model kept close to the source data model or remodeled to present enterprise data entities? Silver zone is one where I found most difference in definition across different publications.

  • Some still advocates for keeping source aligned and just deduping the data (extracting the latest records), maybe standardising data types, and, if applicable and some technical data cleansing
  • Some already transforms/integrate to create enterprise level tables for further uses cases (like combined master tables), meaning appending or merging data from different sources, renaming. In this case, Gold layer contains use-case specific derived datasets.

Data Persistence: where/when data should be persisted and where views/cached views should be used? Bronze(raw), Silver(Enriched), Gold(Curated). Let's assume all zones are represented in BQ.

Thank you for sharing your thoughts.

Just example, of Silver Definition from Databricks: "In the Silver layer of the lakehouse, the data from the Bronze layer is matched, merged, conformed and cleansed ("just-enough") so that the Silver layer can provide an "Enterprise view" of all its key business entities, concepts and transactions. (e.g. master customers, stores, non-duplicated transactions and cross-reference tables)." But when reading hands-on content (blogs and so on) - there silver is does not work on "enterprise view".

Google's Building the analytics lakehouse on Google Cloud whitepaper uses RAW/Enriched/Curated Zone approach, but Silver one is not specifically defined, apart from "In the Enriched Zone, schema is well enforced, data governance and quality rules have been applied on this data (e.g., sensitive data is anonymized), and data is cleansed and optimized for most common consumption patterns"

0 0 4,642
0 REPLIES 0