Hi,
we're using hive tables with dataproc metastore synced to data Catalog, we could see that data catalog was able to capture the metadata of the Hive table but not the Lineage, even we have enabled "dataproc:dataproc.lineage.enabled=true" on our dataproc cluster, but still not able to capture the Lineage of Hive tables but we're able to capture for BQ tables.
Please let us know if GCP Data Lineage supports Hive or not, if yes what are implementation steps.
Thanks, Venkata
Dataplex is designed to capture lineage from Hive tables when utilizing a Dataproc Metastore integrated with the Data Catalog. This setup is the primary method for obtaining Hive table lineage within Dataplex. Please note the following limitation: Not every Hive operation is fully supported for lineage tracking. Complex transformations and certain Data Definition Language (DDL) operations may not be comprehensively captured.
Troubleshooting
Dataproc Version Requirements: For effective Hive lineage capture, ensure your Dataproc clusters are using at least:
Dataproc Compute Engine 2.0.74+
Dataproc Compute Engine 2.1.22+
Dataplex Feature Activation:
Lake Level: In the Dataplex console, navigate to your Lake and verify that "Data Lineage" is enabled.
Zone Level: Perform the same verification within the specific Zone hosting your Hive tables.
Lineage Event Processing Time: Lineage data may not appear instantaneously in Dataplex. Processing time is required for lineage information to be reflected.
Enhanced Logging: For detailed troubleshooting, increase the logging levels for the Dataproc Metastore and Dataplex Data Lineage components.
BigQuery Lineage Prioritization: In workflows involving both Hive and BigQuery, Dataplex may prioritize BigQuery lineage. For clearer Hive lineage tracking, consider segregating your transformation jobs.
Implementation (Assuming Correct Setup)
Automatic Lineage Capture: With the correct setup, Dataplex is expected to automatically document lineage for supported Hive operations, requiring no additional configurations.
Viewing Lineage: Access lineage visualizations by locating your Hive table within Dataplex, showcasing both upstream and downstream dependencies.