I am struggling to achieve this logic in BigQuery. I shall be grateful if someone can help
Consider these 2 CTEs
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
Now if I do:
SELECT * FROM CycleNumbers
I get 3 rows with values
cycle
-----
1
2
3
This is perfect
Now, this is what I want:
SELECT
inputs.*
FROM
`project_a.dataset_A.table_B` AS A
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = 1 (--note: 1 is the first row of CycleNumbers)
UNION ALL
SELECT
inputs.*
FROM
`project_a.dataset_A.table_B` AS A
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = 2 (--note: 2 is the second row of CycleNumbers)
UNION ALL
SELECT
inputs.*
FROM
`project_a.dataset_A.table_B` AS A
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = 3 (--note: 3 is the second row of CycleNumbers)
Can someone help me generalize this query in BigQuery?
Effectively I want to loop through each row of CycleNumbers, left join and union
@ms4446 , I shall be grateful if you can take a look at this
Solved! Go to Solution.
Your approach using CROSS JOIN
with CycleNumbers
looks good and effectively achieves the desired result. Here are a few suggestions to ensure the query is optimized and readable:
Here's the updated query with these best practices applied:
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT DISTINCT
A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
`project_a.dataset_A.table_B` AS A
CROSS JOIN
CycleNumbers AS CN
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = CN.cycle;
table_A
.CycleNumbers
to ensure each row from table_B
is combined with each cycle number.table_A
on the specified conditions.table_B
and table_A
along with the cycle number.To determine if SELECT DISTINCT is needed, you can run the following query to check for duplicates:
SELECT X, Y, Z, COUNT(*) AS count
FROM `project_a.dataset_A.table_B`
GROUP BY X, Y, Z
HAVING count > 1;
If this query returns no rows, it indicates there are no duplicates, and you can safely remove SELECT DISTINCT from the main query.
Without SELECT DISTINCT if duplicates are not present:
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
`project_a.dataset_A.table_B` AS A
CROSS JOIN
CycleNumbers AS CN
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = CN.cycle;
@ms4446 I managed to get it through cross join.
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
A.*, B.*
FROM
`project_a.dataset_A.table_B` AS A
CROSS JOIN
CycleNumbers
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = CycleNumbers.cycle;
See above. Any suggestions?
Your approach using CROSS JOIN
with CycleNumbers
looks good and effectively achieves the desired result. Here are a few suggestions to ensure the query is optimized and readable:
Here's the updated query with these best practices applied:
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT DISTINCT
A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
`project_a.dataset_A.table_B` AS A
CROSS JOIN
CycleNumbers AS CN
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = CN.cycle;
table_A
.CycleNumbers
to ensure each row from table_B
is combined with each cycle number.table_A
on the specified conditions.table_B
and table_A
along with the cycle number.To determine if SELECT DISTINCT is needed, you can run the following query to check for duplicates:
SELECT X, Y, Z, COUNT(*) AS count
FROM `project_a.dataset_A.table_B`
GROUP BY X, Y, Z
HAVING count > 1;
If this query returns no rows, it indicates there are no duplicates, and you can safely remove SELECT DISTINCT from the main query.
Without SELECT DISTINCT if duplicates are not present:
WITH MaxCycle AS (
SELECT MAX(my_cycle) AS max_cycle
FROM `project_a.dataset_A.table_A`
),
CycleNumbers AS (
SELECT cycle
FROM UNNEST(GENERATE_ARRAY(1, (SELECT max_cycle FROM MaxCycle))) AS cycle
)
SELECT
A.column1, A.column2, A.column3, B.column4, B.column5, CN.cycle AS cycle_number
FROM
`project_a.dataset_A.table_B` AS A
CROSS JOIN
CycleNumbers AS CN
LEFT JOIN
`project_a.dataset_A.table_A` AS B
ON
LOWER(A.X) = LOWER(B.X)
AND LOWER(A.Y) = LOWER(B.Y)
AND A.Z = CN.cycle;