Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Data Quality

Hi,

I have a DQ job which essentially fetches data from cloud storage bucket , checks the data quality and once success it persist the data to BigQuery data set. I actually want to persist the success data to another cloud storage bucket instead of BigQuery. I tried but no luck. Please help

0 1 650
1 REPLY 1

While Dataplex Data Quality jobs are adept at assessing data against quality rules and typically channel results or metrics to BigQuery, redirecting successful data to GCS requires a tailored approach. Here’s how to adjust your workflow:

Steps for Data Transformation and Movement

Since direct data movement isn't a native function of Dataplex Data Quality jobs, integrating a data transformation step is essential. Consider these options:

  • Cloud Dataflow: Incorporate a Cloud Dataflow task within your Dataplex job to:

    • Read data that passes quality checks.

    • Write this data to the desired Cloud Storage bucket in an appropriate format (e.g., CSV, Avro, Parquet).

  • Dataproc: For those proficient with Spark, a Dataproc Spark task can serve a similar purpose, leveraging Spark for data transformation and movement.

Dataplex Task Configuration

  • Configure your Dataplex task to execute the Cloud Dataflow pipeline or Dataproc Spark job following the Data Quality task.

  • Ensure task dependencies are correctly set, so the transformation task executes only after successful Data Quality validation.

If opting for Dataflow, the workflow might conceptually proceed as follows:

  1. Dataplex Data Quality Task: Executes defined data quality rules on input data from Cloud Storage.

  2. Conditional Branching: Upon successful validation, trigger the Dataflow task.

  3. Dataflow Task: Initiates a pipeline to read quality-approved data, potentially transform it, and then write it to the specified GCS bucket.

Additional Considerations

  • Cost and Complexity: Incorporating Dataflow or Dataproc adds both operational complexity and compute costs. It's crucial to account for these in your planning and budgeting.

  • Permissions and Security: Ensure proper IAM roles and permissions are set for Dataplex, Dataflow, and Dataproc services to access necessary resources and perform operations.

  • Monitoring and Logging: Implement robust monitoring and logging practices to oversee the performance and outcomes of your tasks. Utilize Google Cloud's operations suite for comprehensive insights.