I have a long (18 million) list of short (sub ten character) strings. I need to compare them all to each other and find "matches" where the two strings are exactly 1 character apart (ABC123 and ABD123 would be returned, but ABC123 and AXX123 would not be returned). I think you can do this in Oracle using the utl_match package, but....is something similar available in BQ?
Solved! Go to Solution.
In BigQuery, there isn't a direct equivalent to Oracle's utl_match package for fuzzy string matching. However, you can achieve similar results using SQL with a combination of BigQuery's string functions.
To find strings that differ by exactly one character, you can use the following approach:
Here is the SQL query that implements this logic:
WITH StringComparisons AS (
SELECT
a.string AS string_a,
b.string AS string_b,
SUM(CASE WHEN SUBSTR(a.string, pos, 1) != SUBSTR(b.string, pos, 1) THEN 1 ELSE 0 END) AS difference_count
FROM
`your_dataset.your_table` a
CROSS JOIN
`your_dataset.your_table` b
CROSS JOIN
UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))) AS pos
WHERE
a.string != b.string AND LENGTH(a.string) = LENGTH(b.string) -- Added length check
GROUP BY
a.string, b.string
)
SELECT
string_a,
string_b
FROM
StringComparisons
WHERE
difference_count = 1;
In BigQuery, there isn't a direct equivalent to Oracle's utl_match package for fuzzy string matching. However, you can achieve similar results using SQL with a combination of BigQuery's string functions.
To find strings that differ by exactly one character, you can use the following approach:
Here is the SQL query that implements this logic:
WITH StringComparisons AS (
SELECT
a.string AS string_a,
b.string AS string_b,
SUM(CASE WHEN SUBSTR(a.string, pos, 1) != SUBSTR(b.string, pos, 1) THEN 1 ELSE 0 END) AS difference_count
FROM
`your_dataset.your_table` a
CROSS JOIN
`your_dataset.your_table` b
CROSS JOIN
UNNEST(GENERATE_ARRAY(1, LEAST(LENGTH(a.string), LENGTH(b.string)))) AS pos
WHERE
a.string != b.string AND LENGTH(a.string) = LENGTH(b.string) -- Added length check
GROUP BY
a.string, b.string
)
SELECT
string_a,
string_b
FROM
StringComparisons
WHERE
difference_count = 1;