Solved: Re: BigQuery - Loop hrough each row, left join and...

ayushmaheshwari · 05-28-2024 08:33 AM

I am struggling to achieve this logic in BigQuery. I shall be grateful if someone can help

Consider these 2 CTEs

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)

Now if I do:

SELECT * FROM CycleNumbers

I get 3 rows with values

cycle
-----
1
2
3

This is perfect

Now, this is what I want:

SELECT
  inputs.*
FROM
  `project_a.dataset_A.table_B` AS A
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = 1 (--note: 1 is the first row of CycleNumbers)


UNION ALL

SELECT
  inputs.*
FROM
  `project_a.dataset_A.table_B` AS A
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
   AND A.Z = 2 (--note: 2 is the second row of CycleNumbers)

UNION ALL

SELECT
  inputs.*
FROM
  `project_a.dataset_A.table_B` AS A
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
   AND A.Z = 3 (--note: 3 is the second row of CycleNumbers)

Can someone help me generalize this query in BigQuery?

Effectively I want to loop through each row of CycleNumbers, left join and union

@ms4446 , I shall be grateful if you can take a look at this

ms4446

Your approach using CROSS JOIN with CycleNumbers looks good and effectively achieves the desired result. Here are a few suggestions to ensure the query is optimized and readable:

Alias for CycleNumbers: Improves readability.
Check for Necessity of SELECT DISTINCT: Use it only if necessary.
Explicit Column Selection: Always list the specific columns you need.

Here's the updated query with these best practices applied:

 

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT DISTINCT
  A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
  `project_a.dataset_A.table_B` AS A
CROSS JOIN
  CycleNumbers AS CN
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = CN.cycle;

MaxCycle CTE: Computes the maximum cycle value from table_A.
CycleNumbers CTE: Generates an array of cycle numbers from 1 to the maximum cycle value.
Main Query:
- CROSS JOIN with CycleNumbers to ensure each row from table_B is combined with each cycle number.
- LEFT JOIN with table_A on the specified conditions.
- Explicit Column Selection: Lists specific columns from table_B and table_A along with the cycle number.
- SELECT DISTINCT: Ensures no duplicate rows, which you might need to use based on your data.

To determine if SELECT DISTINCT is needed, you can run the following query to check for duplicates:

 

SELECT X, Y, Z, COUNT(*) AS count
FROM `project_a.dataset_A.table_B`
GROUP BY X, Y, Z
HAVING count > 1;

If this query returns no rows, it indicates there are no duplicates, and you can safely remove SELECT DISTINCT from the main query.

Without SELECT DISTINCT if duplicates are not present:

 

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
  A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
  `project_a.dataset_A.table_B` AS A
CROSS JOIN
  CycleNumbers AS CN
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = CN.cycle;

View solution in original post

ayushmaheshwari

@ms4446 I managed to get it through cross join.

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
  A.*, B.*
FROM
  `project_a.dataset_A.table_B` AS A
CROSS JOIN
  CycleNumbers
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = CycleNumbers.cycle;

See above. Any suggestions?

ms4446

Your approach using CROSS JOIN with CycleNumbers looks good and effectively achieves the desired result. Here are a few suggestions to ensure the query is optimized and readable:

Alias for CycleNumbers: Improves readability.
Check for Necessity of SELECT DISTINCT: Use it only if necessary.
Explicit Column Selection: Always list the specific columns you need.

Here's the updated query with these best practices applied:

 

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT DISTINCT
  A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
  `project_a.dataset_A.table_B` AS A
CROSS JOIN
  CycleNumbers AS CN
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = CN.cycle;

MaxCycle CTE: Computes the maximum cycle value from table_A.
CycleNumbers CTE: Generates an array of cycle numbers from 1 to the maximum cycle value.
Main Query:
- CROSS JOIN with CycleNumbers to ensure each row from table_B is combined with each cycle number.
- LEFT JOIN with table_A on the specified conditions.
- Explicit Column Selection: Lists specific columns from table_B and table_A along with the cycle number.
- SELECT DISTINCT: Ensures no duplicate rows, which you might need to use based on your data.

To determine if SELECT DISTINCT is needed, you can run the following query to check for duplicates:

 

SELECT X, Y, Z, COUNT(*) AS count
FROM `project_a.dataset_A.table_B`
GROUP BY X, Y, Z
HAVING count > 1;

If this query returns no rows, it indicates there are no duplicates, and you can safely remove SELECT DISTINCT from the main query.

Without SELECT DISTINCT if duplicates are not present:

 

WITH MaxCycle AS (
  SELECT MAX(my_cycle) AS max_cycle
  FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
  SELECT cycle
  FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
  A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
  `project_a.dataset_A.table_B` AS A
CROSS JOIN
  CycleNumbers AS CN
LEFT JOIN
  `project_a.dataset_A.table_A` AS B
ON
  LOWER(A.X) = LOWER(B.X)
  AND LOWER(A.Y) = LOWER(B.Y)
  AND A.Z = CN.cycle;

BigQuery - Loop hrough each row, left join and unionall