Re: Dataform optimisation

ayushmaheshwari · 01-18-2024 12:26 PM

I have this query in Dataform which I am trying to optimise. I cannot find a way to do it. Any help will be appreciated. Scanning records with timestamp is not helping unfortunately

Any help will be gratefully appreciated.

config {
    type: "table",
    description: "Information from PHY diagnostics",
    assertions: {
        uniqueKey: ["eqp","force_vector","scalar_date","update_status","due_date","visual_character"],
        nonNull: [
          "session_id",
          "eqp",
          "visual_character",
          "system"
        ],
    
    }

}

WITH PHY_diagnostics_table AS (

    SELECT 
    PHY.now_time AS visual_character, 
    'N/A' AS retailer_name, 
    PHY.session_id,
    'N/A' AS ufn, 
    'N/A' AS uln, 
    PHY.eqp AS EQP, 
    'N/A' AS rc,
    'N/A' AS md, 
    'N/A' AS di, 
    'N/A' AS h, 
    'PHY_DIAGNOSTICS' AS System,
    V.scalar_date, 
    V.Due_date, 
    V.force_vector, 
    V.force_region,
    V.update_status
    FROM ${ref("PHY_DIAGNOSTICS_INTERACTIONS")} PHY
    JOIN ${ref('result_card')} V ON PHY.eqp = V.EQP
                AND DATE(PHY.now_time) >= DATE_ADD( V.scalar_date , INTERVAL -7 DAY )
                AND (DATE(PHY.now_time) < V.Due_date OR Due_date IS NULL)

)

SELECT DISTINCT * FROM PHY_diagnostics_table

ms4446

Optimization Strategies for Your Dataform Query

Partitioning:

Consider partitioning large tables (such as PHY_DIAGNOSTICS_INTERACTIONS and result_card) by a relevant date column (e.g., now_time or scalar_date). This can significantly reduce the amount of data scanned during JOIN operations.
Refer to your data warehouse's documentation (e.g.,https://cloud.google.com/bigquery/docs/partitioned-tables) for instructions on how to create partitioned tables.

Filtering and Pruning:

Apply filters directly in the WHERE clause to limit the data processed. For example:

FROM ${ref("PHY_DIAGNOSTICS_INTERACTIONS")} PHY 
WHERE DATE(PHY.now_time) >= DATE_ADD(V.scalar_date, INTERVAL -7 DAY)
  AND (DATE(PHY.now_time) < V.Due_date OR V.Due_date IS NULL)

If your data warehouse supports it, utilize column statistics to aid the query optimizer in choosing the best execution plan.

Denormalization:

If you frequently join PHY_DIAGNOSTICS_INTERACTIONS and result_card, consider creating a denormalized table that combines data from both.
Be aware of the trade-off: increased storage and potential maintenance overhead.

Indexing:

If supported by your data warehouse, create indexes on columns frequently used in JOINs and WHERE clauses (e.g., EQP, now_time,scalar_date).

Materialized Views:

For computationally expensive, commonly used CTEs (like PHY_diagnostics_table),consider using a materialized view. This pre-computes and stores the results, improving query performance.

Profiling and Experimentation:

Use query profiling tools provided by your data warehouse to understand where bottlenecks occur in the query.
Experiment with different optimization strategies and measure their impact.

Additional Considerations:

If you have data skew (certain values, like specific EQP values, occur much more frequently than others), explore techniques to handle data skew.
Make sure your data warehouse has appropriate settings for resource allocation and query caching.

Please Note:

These strategies serve as general guidance. The best approach depends on your specific dataset, table structures, and the features of your data warehouse. It is crucial to test any optimization changes in a non-production environment before deploying them.