Solved: Re: BigQuery check if two strings are different by...

Jeff_Franklin · 08-05-2024 04:04 PM

I have a long (18 million) list of short (sub ten character) strings. I need to compare them all to each other and find "matches" where the two strings are exactly 1 character apart (ABC123 and ABD123 would be returned, but ABC123 and AXX123 would not be returned). I think you can do this in Oracle using the utl_match package, but....is something similar available in BQ?

ms4446

In BigQuery, there isn't a direct equivalent to Oracle's utl_match package for fuzzy string matching. However, you can achieve similar results using SQL with a combination of BigQuery's string functions.

To find strings that differ by exactly one character, you can use the following approach:

Self-Join: Use a CROSS JOIN to compare each string with every other string in the table.
Character Comparison: Compare characters at each position in the strings using SUBSTR and a virtual table of positions generated by UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))).
Difference Calculation: Calculate the number of differing characters using SUM(CASE...END).
Filtering: Keep only pairs of strings with exactly one differing character.

Here is the SQL query that implements this logic:

WITH StringComparisons AS (
  SELECT
    a.string AS string_a,
    b.string AS string_b,
    SUM(CASE WHEN SUBSTR(a.string, pos, 1) != SUBSTR(b.string, pos, 1) THEN 1 ELSE 0 END) AS difference_count
  FROM
    `your_dataset.your_table` a
  CROSS JOIN
    `your_dataset.your_table` b
  CROSS JOIN
    UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))) AS pos
  WHERE
    a.string != b.string AND LENGTH(a.string) = LENGTH(b.string)  -- Added length check
  GROUP BY
    a.string, b.string
)
SELECT
  string_a,
  string_b
FROM
  StringComparisons
WHERE
  difference_count = 1;

For extremely large datasets, the use of CROSS JOIN may be computationally expensive. Consider using advanced techniques such as window functions, breaking the task into smaller chunks, or leveraging parallel processing for better performance.
If you need to find strings with a wider range of differences, consider exploring algorithms like Levenshtein distance, although this would require custom implementation as BigQuery does not have a built-in function for this.

View solution in original post

ms4446

In BigQuery, there isn't a direct equivalent to Oracle's utl_match package for fuzzy string matching. However, you can achieve similar results using SQL with a combination of BigQuery's string functions.

To find strings that differ by exactly one character, you can use the following approach:

Self-Join: Use a CROSS JOIN to compare each string with every other string in the table.
Character Comparison: Compare characters at each position in the strings using SUBSTR and a virtual table of positions generated by UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))).
Difference Calculation: Calculate the number of differing characters using SUM(CASE...END).
Filtering: Keep only pairs of strings with exactly one differing character.

Here is the SQL query that implements this logic:

WITH StringComparisons AS (
  SELECT
    a.string AS string_a,
    b.string AS string_b,
    SUM(CASE WHEN SUBSTR(a.string, pos, 1) != SUBSTR(b.string, pos, 1) THEN 1 ELSE 0 END) AS difference_count
  FROM
    `your_dataset.your_table` a
  CROSS JOIN
    `your_dataset.your_table` b
  CROSS JOIN
    UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))) AS pos
  WHERE
    a.string != b.string AND LENGTH(a.string) = LENGTH(b.string)  -- Added length check
  GROUP BY
    a.string, b.string
)
SELECT
  string_a,
  string_b
FROM
  StringComparisons
WHERE
  difference_count = 1;

For extremely large datasets, the use of CROSS JOIN may be computationally expensive. Consider using advanced techniques such as window functions, breaking the task into smaller chunks, or leveraging parallel processing for better performance.
If you need to find strings with a wider range of differences, consider exploring algorithms like Levenshtein distance, although this would require custom implementation as BigQuery does not have a built-in function for this.

BigQuery check if two strings are different by exactly 1 character