On runniing the exactly same query i get different results.
Details:
Total results are 64155
since i have a limit retrieval limit of 50000 max rows, I am trying to retrieve them in two steps. On fetching first batch of 50k rows i get different result on every execution
IF i don't use Limit Results are Ok.
Findings:
Cache is turned off
Query 1:
select count(*), sum(fact__values_sum) from (
SELECT
`dim_product`.PH_1 `dim_product__ph_1`, `dim_product`.PH_3 `dim_product__ph_3`, `dim_geography`.H1_8 `dim_geography__h1_8`, `dim_measure`.label `dim_measure__label`, `dim_time_period`.label `dim_time_period__label`, `dim_geography`.H1_5 `dim_geography__h1_5`, `dim_geography`.H1_6 `dim_geography__h1_6`, DATETIME_TRUNC(DATETIME(`fact`.`reported_date`, 'UTC'), MONTH) `fact__reporteddate_month`, sum(`fact`.values) `fact__values_sum`, sum(`fact`.valuesly) `fact__values_ly_sum`, sum(`fact`.valuespp) `fact__values_p_p_sum`
FROM
plexus-336107.plexusdata.fact AS `fact`
LEFT JOIN plexus-336107.plexusdata.dim_product AS `dim_product` ON `fact`.product_id = `dim_product`.id
LEFT JOIN plexus-336107.plexusdata.dim_geography AS `dim_geography` ON `fact`.geography_id = `dim_geography`.id
LEFT JOIN plexus-336107.plexusdata.dim_measure AS `dim_measure` ON `fact`.measure_id = `dim_measure`.id
LEFT JOIN plexus-336107.plexusdata.dim_time_period AS `dim_time_period` ON `fact`.time_period_id = `dim_time_period`.id WHERE (`fact`.`reported_date` >= TIMESTAMP('2023-01-01T00:00:00.000Z') AND `fact`.`reported_date` <= TIMESTAMP('2023-01-01T23:59:59.999Z')) AND (`fact`.data_type_id = CAST('100' AS FLOAT64)) GROUP BY 1, 2, 3, 4, 5, 6, 7, 8 ORDER BY 8 ASC LIMIT 50000 ) where dim_product__ph_1
="Gucci" and dim_time_period__label='MTD' and dim_measure__label='Quantity Sold'
Query 1 Results:
results--> 2283 , 24085.0 job_id -->bquxjob_476e8876_18ea056aa87 results--> 2263 , 23977.0 job_id--> bquxjob_78a92fd0_18ea0570760
Query 2:
select count(*), sum(fact__values_sum) from (
SELECT
`dim_product`.PH_1 `dim_product__ph_1`, `dim_product`.PH_3 `dim_product__ph_3`, `dim_geography`.H1_8 `dim_geography__h1_8`, `dim_measure`.label `dim_measure__label`, `dim_time_period`.label `dim_time_period__label`, `dim_geography`.H1_5 `dim_geography__h1_5`, `dim_geography`.H1_6 `dim_geography__h1_6`, DATETIME_TRUNC(DATETIME(`fact`.`reported_date`, 'UTC'), MONTH) `fact__reporteddate_month`, sum(`fact`.values) `fact__values_sum`, sum(`fact`.valuesly) `fact__values_ly_sum`, sum(`fact`.valuespp) `fact__values_p_p_sum`
FROM
plexus-336107.plexusdata.fact AS `fact`
LEFT JOIN plexus-336107.plexusdata.dim_product AS `dim_product` ON `fact`.product_id = `dim_product`.id
LEFT JOIN plexus-336107.plexusdata.dim_geography AS `dim_geography` ON `fact`.geography_id = `dim_geography`.id
LEFT JOIN plexus-336107.plexusdata.dim_measure AS `dim_measure` ON `fact`.measure_id = `dim_measure`.id
LEFT JOIN plexus-336107.plexusdata.dim_time_period AS `dim_time_period` ON `fact`.time_period_id = `dim_time_period`.id WHERE `fact`.`reported_date` ="2023-01-01" AND (`fact`.data_type_id = CAST('100' AS FLOAT64)) GROUP BY 1, 2, 3, 4, 5, 6, 7, 8 ORDER BY 8 ASC LIMIT 50000 ) where dim_product__ph_1
="Gucci" and dim_time_period__label='MTD' and dim_measure__label='Quantity Sold'
Query 2 Results :
Job_id--> bquxjob_3e391a4c_18ea0511613 result--> 2297 25048.0 Job_id--> bquxjob_58310e22_18ea051eb25 result--> 2268 25329.0 Job_id--> bquxjob_3c8eb262_18ea052ba77 result--> 2269 24752.0
given below are relevant schemas
Any help is appreciated.
Thanks In Advance
Are any of the tables changing between queries? Maybe save the results of the subqueries between each of your job runs and see how they change? Try simplify the query one part at a time (remove columns and joins) and see how minimal you can get that still results in non-deterministic results from the queries.