Solved: TABLESAMPLE is not allowed with array scans at [1:...

mdideles · 10-23-2024 11:19 AM

We are trying TABLESAMPLE function with the query below.. When trying to execute the below query we see the error TABLESAMPLE is not allowed with array scans at [1:613].

Do you have other approach to accomplish the same tasks as this query? or another way with TABLESAMPLE that avoids this error?

select
Min(als_0.light_id) as cvdraw_payloadmetal_id_min_value,
Max (als_0.light_id) as cvdraw_payloadmetal_id_max_value,
count (DISTINCT (als_0.light_id)) as cvdraw_payloadmetadatal_id_distinct_cnt,
STRING_AGG(DISTINCT(als_0.light_id),';' limit 10 ) AS cvdraw_payloadmetadatal_id_samp
from `project.dataset.table` sampled_data
left join unnest (cvdc62_raw_payload_metadata_light_id_x) as als_0
TABLESAMPLE SYSTEM (0.01 PERCENT)
WHERE
DATE(df_partition_date) <= CURRENT_DATE

ms4446

BigQuery does not support using TABLESAMPLE in conjunction with array scans, such as when using the UNNEST function. Specifically, TABLESAMPLE cannot be used with queries that involve UNNEST operations.

To achieve the same result without this error, you can use an alternative sampling method. One common approach is to use the RAND() function to randomly sample rows before the UNNEST operation. Here's how you can modify your query:

WITH sampled_data AS (
SELECT *
FROM `project.dataset.table`
WHERE DATE(df_partition_date) <= CURRENT_DATE
AND RAND() < 0.01 -- Adjust the sampling rate as needed
)
SELECT
MIN(als_0.light_id) AS cvdraw_payloadmetal_id_min_value,
MAX(als_0.light_id) AS cvdraw_payloadmetal_id_max_value,
COUNT(DISTINCT als_0.light_id) AS cvdraw_payloadmetadatal_id_distinct_cnt,
STRING_AGG(DISTINCT als_0.light_id, ';' LIMIT 10) AS cvdraw_payloadmetadatal_id_samp
FROM sampled_data
LEFT JOIN UNNEST(cvdc62_raw_payload_metadata_light_id_x) AS als_0;

View solution in original post

caryna

Hi @mdideles,

In addition to @ms4446 suggestion, you can use TABLESAMPLE within the CTE clause. By sampling the data at this stage, the UNNEST function will only be applied to the sampled subset, preventing any conflicts that could arise between TABLESAMPLE and UNNEST.

You may also refer to this documentation for more details on using the TABLESAMPLE and RAND functions.

I hope the above information is helpful.

View solution in original post

ms4446

BigQuery does not support using TABLESAMPLE in conjunction with array scans, such as when using the UNNEST function. Specifically, TABLESAMPLE cannot be used with queries that involve UNNEST operations.

To achieve the same result without this error, you can use an alternative sampling method. One common approach is to use the RAND() function to randomly sample rows before the UNNEST operation. Here's how you can modify your query:

WITH sampled_data AS (
SELECT *
FROM `project.dataset.table`
WHERE DATE(df_partition_date) <= CURRENT_DATE
AND RAND() < 0.01 -- Adjust the sampling rate as needed
)
SELECT
MIN(als_0.light_id) AS cvdraw_payloadmetal_id_min_value,
MAX(als_0.light_id) AS cvdraw_payloadmetal_id_max_value,
COUNT(DISTINCT als_0.light_id) AS cvdraw_payloadmetadatal_id_distinct_cnt,
STRING_AGG(DISTINCT als_0.light_id, ';' LIMIT 10) AS cvdraw_payloadmetadatal_id_samp
FROM sampled_data
LEFT JOIN UNNEST(cvdc62_raw_payload_metadata_light_id_x) AS als_0;

caryna

Hi @mdideles,

In addition to @ms4446 suggestion, you can use TABLESAMPLE within the CTE clause. By sampling the data at this stage, the UNNEST function will only be applied to the sampled subset, preventing any conflicts that could arise between TABLESAMPLE and UNNEST.

You may also refer to this documentation for more details on using the TABLESAMPLE and RAND functions.

I hope the above information is helpful.