Hello,
I would like to know if it is possible to create a table using parquet data from GCS using dataform.
The idea would be to do exactly what the Terraform feature below does:
resource "google_bigquery_table" "my_table" {
dataset_id = google_bigquery_dataset.landing.dataset_id
table_id = "my_table"
description = "table using parquet in GCS"
external_data_configuration {
autodetect = true
source_format = "PARQUET"
connection_id = google_bigquery_connection.cloud_resource_connection_southamerica_east1.name
source_uris = ["gs://data/schema/table/*.parquet"]
}
}
Hello @airtonchagas,
Yes, you can definitely create a table using parquet data from GCS using Dataform. However, Dataform doesn't have a dedicated command for this like Terraform's google_bigquery_table resource. Instead, you'll achieve this through Dataform's SQL-like syntax and its built-in functions for interacting with external data sources.
Here's a breakdown of how you can do it:
You'll first need to define your BigQuery connection in your dataform.json file. Here's an example:
{
"project": "your-project-id",
"connections": [
{
"name": "cloud_resource_connection_southamerica_east1",
"type": "cloud_resource"
}
]
}
You can create an external table in Dataform using the ExternalTable class. Here's an example:
from dataform.core import ExternalTable, Connection
my_connection = Connection(
name='cloud_resource_connection_southamerica_east1',
type='cloud_resource',
)
my_table = ExternalTable(
name='my_table',
connection=my_connection,
source_uris=["gs://data/schema/table/*.parquet"],
source_format='PARQUET',
autodetect=True,
)
You'll need to define a Dataform graph to group your resources. Here's an example:
from dataform.core import Graph
graph = Graph(
name='my_graph',
tables=[my_table],
)
Finally, deploy your Dataform graph using the command:
dataform deploy
Explanation:
Important Considerations:
With these steps, you can efficiently create BigQuery tables from your GCS parquet data using Dataform.
I hope this helps.