Re: Time and Geographically clustered joins in Big...

nategate6 · 06-25-2024 11:27 AM

Hi All,

I am working with astronomical data, using Bigquery Geogpoints data to represent RA/Dec, and am testing using BigQuery to perform a time and location based crossmatch between two datasets. Both tables represent point source observations - for our purposes they have an obstime (timestamp), an observatory_code, and an st_geo (st_geogpoint(ra, dec)). Essentially I am trying to answer: For every row in Table A, find all rows in Table B such that st_geo_a and st_geo_b are within a certain distance, the observatory code is shared, and the times are within a tolerance.

I have been having trouble getting this query to be performant, and I think I am not understanding how the query resolver utilizes the table clustering. Both tables are clustered on first obstime, and then st_geo.

Here is a sample query I am running. As written, the query times out. When I subsample both tables to be only a subset of time (ex, a month), the query runs and is fast, but my expectation is that there is a way to rewrite this to make use of the analagous clustering, and be able to perform the crossmatch-as-join in one go (rather than having to go by, say, monthly bins, which will also work). For context, the candidates table is ~200k rows here, and the mpc table is of order 500m.

Additionally, the analysis usage seems high, also leading me to believe that every row is being scanned, and that I'm not leveraging the clustering properly here.

SELECT
*, st_distance(geo_1, geo_2) / 30.8874796235 as dist_arcsec
FROM (
SELECT
    observation_id, cand.obstime, cand.st_geo as geo_1,
    mpc.obstime, mpc.st_geo as geo_2,
    TIMESTAMP_DIFF(cand.obstime, mpc.obstime, SECOND) AS timedelta
FROM
    `ai_natetellis_crossmatch.candidates_clustered`
    AS cand
INNER JOIN (
    SELECT
    obstime, st_geo, stn
    FROM
    moeyens-thor-dev.mpc_sbn_views.obs_sbn_clustered
    and stn in ("W84", "695", "V00")) AS mpc
ON
    ST_DWITHIN(cand.st_geo, mpc.st_geo, 5.0 * 30.8874796235)
    and cand.obscode = mpc.stn) AS result_table
WHERE
ABS(timedelta) < 300.0

Is the second join condition (cand.obscode = mpc.stn) interfering here? My understanding is that predicate functions like ST_DWITHIN should make use of the clustering, but is the geogpoint being the second-order clustered field preventing that from happening?

Thank you for any help in understanding the intricacies here, and if there is a better place I should be asking this question, please let me know.

B/R,

Nate

ms4446

The core issue is that the query optimizer isn't leveraging the clustering effectively in its current form. The existing query faces several performance bottlenecks:

1. Filtering After the Join: The ST_DWITHIN predicate, which checks the spatial proximity between points, is applied during the join operation. This requires BigQuery to compute distances for all potential row combinations before any filtering occurs. Given the size of the datasets (with one table containing approximately 200,000 rows and the other around 500 million rows), this computation is extremely expensive.
2. Secondary Clustering: Although both tables are clustered on obstime and st_geog, the primary clustering is on obstime. This means the data is primarily ordered by time, making spatial filtering less efficient when not pre-filtered by obstime.
3. Complex Join Condition: The additional condition cand.obscode = mpc.stn in the join clause adds complexity, which can hinder the optimizer's ability to effectively utilize the spatial index on st_geog.
To address these issues, an optimized query approach is necessary. This involves filtering data earlier in the process and combining conditions in a manner that allows the query optimizer to utilize clustering effectively.

WITH filtered_candidates AS (
  SELECT observation_id, obstime, st_geo, obscode
  FROM ai_natetellis_crossmatch.candidates_clustered
),

filtered_mpc AS (
  SELECT obstime, st_geo, stn
  FROM moeyens-thor-dev.mpc_sbn_views.obs_sbn_clustered
  WHERE stn IN ("W84", "695", "V00")
)

SELECT observation_id, cand.obstime, cand.st_geo AS geo_1, 
       mpc.obstime, mpc.st_geo AS geo_2, 
       ST_DISTANCE(cand.st_geo, mpc.st_geo) / 30.8874796235 AS dist_arcsec
FROM filtered_candidates AS cand
INNER JOIN filtered_mpc AS mpc
  ON cand.obscode = mpc.stn
  AND ABS(TIMESTAMP_DIFF(cand.obstime, mpc.obstime, SECOND)) < 300
  AND ST_DWITHIN(cand.st_geo, mpc.st_geo, 5.0 * 30.8874796235)

Pre-filtering with CTEs: The filtered_candidates and filtered_mpc CTEs (Common Table Expressions) filter out unnecessary rows before the join, reducing the number of distance calculations.
Combining Conditions in the Join: The join now includes all conditions (obscode, obstime, and ST_DWITHIN) together. This allows the query optimizer to leverage the clustering on both obstime and st_geo more effectively.
Reordering Conditions: The obscode and obstime conditions are placed before ST_DWITHIN. This is because they are likely more selective than the spatial condition, further reducing the number of rows that need distance calculations.

Additional Considerations

Partitioning: If your data has a natural time-based partitioning scheme (e.g., by month or year), consider partitioning your tables on obstime. This can dramatically improve performance for queries that filter on specific time ranges.
Indexing: If you frequently filter on obscode or stn, creating indexes on these columns might be beneficial.
Alternative Approaches: If you still face performance issues, exploring spatial indexing techniques like quadtrees or H3 might be worthwhile, but these can be more complex to implement.

Time and Geographically clustered joins in BigQuery