Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

DATAFORM - Create table on BigQuery using Parquet files in GCS

Hello,

I would like to know if it is possible to create a table using parquet data from GCS using dataform.

The idea would be to do exactly what the Terraform feature below does:

resource "google_bigquery_table" "my_table" {
  dataset_id = google_bigquery_dataset.landing.dataset_id
  table_id   = "my_table"
  description = "table using parquet in GCS"
  external_data_configuration {
    autodetect    = true
    source_format = "PARQUET"
    connection_id = google_bigquery_connection.cloud_resource_connection_southamerica_east1.name
    source_uris   = ["gs://data/schema/table/*.parquet"]
  }
}
2 1 1,057
1 REPLY 1

Hello @airtonchagas,

Yes, you can definitely create a table using parquet data from GCS using Dataform. However, Dataform doesn't have a dedicated command for this like Terraform's google_bigquery_table resource. Instead, you'll achieve this through Dataform's SQL-like syntax and its built-in functions for interacting with external data sources.

Here's a breakdown of how you can do it:

  1. Define your Connection:

You'll first need to define your BigQuery connection in your dataform.json file. Here's an example:

{
  "project": "your-project-id",
  "connections": [
    {
      "name": "cloud_resource_connection_southamerica_east1",
      "type": "cloud_resource"
    }
  ]
}
  1. Define your External Table:

You can create an external table in Dataform using the ExternalTable class. Here's an example:

from dataform.core import ExternalTable, Connection

my_connection = Connection(
    name='cloud_resource_connection_southamerica_east1',
    type='cloud_resource',
)

my_table = ExternalTable(
    name='my_table',
    connection=my_connection,
    source_uris=["gs://data/schema/table/*.parquet"],
    source_format='PARQUET',
    autodetect=True,
)
  1. Create your Dataform Graph:

You'll need to define a Dataform graph to group your resources. Here's an example:

from dataform.core import Graph

graph = Graph(
    name='my_graph',
    tables=[my_table],
)
  1. Deploy your Dataform Graph:

Finally, deploy your Dataform graph using the command:

dataform deploy

Explanation:

  • The ExternalTable class defines the structure of your external table.
  • connection specifies the BigQuery connection you defined in dataform.json.
  • source_uris is a list of GCS paths containing your parquet files.
  • source_format specifies the data format (in this case, "PARQUET").
  • autodetect instructs Dataform to automatically infer the schema from your parquet data.

Important Considerations:

  • Make sure your BigQuery connection has the necessary permissions to access your GCS bucket and read the parquet files.
  • If your parquet data has a complex schema, you might need to manually define the schema in the ExternalTable using the schema parameter.
  • For more complex data transformations and analysis, consider using Dataform's SQL-like syntax to interact with the external table.

With these steps, you can efficiently create BigQuery tables from your GCS parquet data using Dataform.

I hope this helps.