I am currently running a pipeline through Airflow every 12 hours. Each time the pipeline is run, the BigQuery table is rewrittem (WRITE_TRUNCATE). This is working fine.
However, I want to append only the new data into the table. Instead of using WRITE_APPEND can I use another means to append the table. LAst tme I used WRITE_APPEND, I fgaced issues as the table was populated with repated rows in many cases.
Can I have a timestamp column (current_timestamp) so that after the time stamp the rows get appended into the BigQuery table?
Solved! Go to Solution.
Based on your use case of ingesting data from a BigQuery table in Project X into your table in Project Y, here's an enhanced approach to ensure efficient and accurate data transfer:
Ingesting New Data from BigQuery Table in Project X to Project Y
Identifying New Data:
Loading New Data into a Staging Table:
Transferring Data to Main Table (Table B):
MERGE
statement to efficiently transfer data from the staging table to Table B in Project Y.MERGE
statement to avoid duplicates, typically by matching on a unique identifier.Simplified Example:
-- Query to select new data from Project X's table
INSERT INTO ProjectY.staging_table
SELECT *
FROM ProjectX.TableA
WHERE timestamp_column > (
SELECT MAX(timestamp_column)
FROM ProjectY.TableB
);
-- Merge statement to insert new data into your main table
MERGE INTO ProjectY.TableB
USING ProjectY.staging_table
ON ProjectY.TableB.id = ProjectY.staging_table.id
WHEN NOT MATCHED THEN INSERT (columns...)
VALUES (values...);
-- Optionally, clean up the staging table
Additional Considerations:
This enhanced approach should provide a robust framework for transferring only new data from Project X to Project Y, ensuring data integrity, efficiency, and compliance with security standards.