Re: Large File Data Ingestion to BigQuery

xinoxs · 09-03-2024 01:25 PM

I have 2 positional file containing customer data.

The first file contains (file size 130Gb) and is structured as

customer_identifier| surname|name|dob| ...

Every row is length 1000 characters and contains data for a single customer.

The second file contains others customer details (file size 180Gb) structured as

customer_identifier| gender| address| phone_number, ...

I should merge this two files, based on custumer_identifier (a key length 22 characters)

filtering some columns an load the result into Google BigQuery.

What is the best approach to do this ?

ms4446

To merge these two large positional files and load them into Google Cloud BigQuery efficiently, follow these steps:

1. Upload Files to Google Cloud Storage (GCS)

Upload both files (file1.txt and file2.txt) to a GCS bucket. This is essential for further processing using BigQuery or Dataflow.

2. Create an External Table for Each File in BigQuery

You can create external tables in BigQuery directly from these positional files by using BigQuery's ability to work with files stored in GCS.
Create External Table for the First File:

CREATE OR REPLACE EXTERNAL TABLE `project.dataset.customer_file1`
(
  customer_identifier STRING,
  surname STRING,
  name STRING,
  dob DATE,
  -- define other columns according to your file structure
)
OPTIONS (
  format = 'CSV',
  skip_leading_rows = 0,
  field_delimiter = '|',
  source_uris = ['gs://bucket/file1.txt']
);

Create External Table for the Second File:

CREATE OR REPLACE EXTERNAL TABLE `project.dataset.customer_file2`
(
  customer_identifier STRING,
  gender STRING,
  address STRING,
  phone_number STRING,
  -- define other columns according to your file structure
)
OPTIONS (
  format = 'CSV',
  skip_leading_rows = 0,
  field_delimiter = '|',
  source_uris = ['gs://bucket/file2.txt']
);

3. Create a BigQuery Table to Store Merged Data

Define the schema and create a table where the merged data will be loaded.

CREATE OR REPLACE TABLE `project.dataset.merged_customers` AS
SELECT
  f1.customer_identifier,
  f1.surname,
  f1.name,
  f1.dob,
  f2.gender,
  f2.address,
  f2.phone_number
FROM
  `project.dataset.customer_file1` f1
JOIN
  `project.dataset.customer_file2` f2
ON
  f1.customer_identifier = f2.customer_identifier;

4. Optional: Dataflow for Advanced Transformations

If you need more advanced transformation or filtering (e.g., handling file formats or applying complex logic), you could use Google Cloud Dataflow. Dataflow can handle large files efficiently and apply the necessary transformations before loading them into BigQuery.

5. Filtering Columns

Apply filtering directly within your SQL query when loading data into the final BigQuery table. For instance, if you want to exclude certain columns, adjust the SELECT statement.

Poll_Nofer1

With files that large, you'd probably want to stage them in Cloud Storage first, then use Dataflow or a custom Python script in Cloud Functions to parse, filter, and join on customer_identifier before loading into BigQuery. (URL Removed by Staff) gives a good overview of ingestion strategies that could help shape your pipeline.