Solved: Re: Is it possible to make partition table in Data...

Stanislav · 11-07-2023 01:07 AM

Hello!

I've created a stream from Mysql into BigQuery and this table is huge enough. Is it possible to make an autopartitioning or set it up in some way for destination table? Because by default it's not partitioned

ms4446

Hi @Stanislav ,

Yes, you can create a partitioned table in BigQuery to serve as the destination for a Datastream from MySQL. Datastream itself does not handle the partitioning; you must set up the partitioned table in BigQuery first.

To create a partitioned table in BigQuery, use the following SQL command:

CREATE TABLE dataset.partitioned_table (
id INT64,
name STRING,
timestamp TIMESTAMP
)
PARTITION BY TIMESTAMP(timestamp)

After creating the partitioned table, configure your Datastream to write to this table. When setting up the Datastream, specify the BigQuery dataset and the partitioned table as the destination. Datastream will then stream changes into this table, and BigQuery will automatically partition the data based on the timestamp column.

Keep in mind that the partitioning is managed by BigQuery, not Datastream. Datastream is responsible for delivering the data to the table, and BigQuery takes care of organizing it into partitions.

View solution in original post

ms4446

Hi @Stanislav ,

Yes, you can create a partitioned table in BigQuery to serve as the destination for a Datastream from MySQL. Datastream itself does not handle the partitioning; you must set up the partitioned table in BigQuery first.

To create a partitioned table in BigQuery, use the following SQL command:

CREATE TABLE dataset.partitioned_table (
id INT64,
name STRING,
timestamp TIMESTAMP
)
PARTITION BY TIMESTAMP(timestamp)

After creating the partitioned table, configure your Datastream to write to this table. When setting up the Datastream, specify the BigQuery dataset and the partitioned table as the destination. Datastream will then stream changes into this table, and BigQuery will automatically partition the data based on the timestamp column.

Keep in mind that the partitioning is managed by BigQuery, not Datastream. Datastream is responsible for delivering the data to the table, and BigQuery takes care of organizing it into partitions.

Stanislav

When I'm creating a stream there is no any option to choose a table but there is only option to add prefix to dataset. Moreover datastream adds some meta columns and create structure with types mapping from source into destination. That's why your advice isn't so helpful

ms4446

I apologize for any confusion. You are correct in your understanding of how Google Cloud Datastream operates with regards to BigQuery.

When you create a stream in Datastream, there isn't an option to select a specific destination table. Instead, Datastream automatically generates tables within the specified destination dataset in BigQuery. The names of these tables are derived from the stream name and the source table names.

Datastream is engineered to facilitate the streaming of data from a source database (like MySQL) to a destination dataset in BigQuery. It's designed to handle multiple tables within a single dataset, creating a corresponding table in BigQuery for each source table in the stream. This design simplifies the management and querying of your streamed data in BigQuery.

Additionally, Datastream appends meta columns to the destination tables. These columns are essential for tracking the state of the stream and providing metadata about the source data, such as the source's operation type and timestamps. While these columns are part of the table schema, they are typically not used in data analysis queries and can be excluded from views or analysis as needed.

Regarding data type mapping, Datastream automatically converts data types from the source database to compatible BigQuery data types. This ensures that the data in BigQuery is stored in a format that is consistent with BigQuery's data type system.

It's important to note that while Datastream handles the initial ingestion of data into BigQuery, it does not manage table partitioning. If you require partitioned tables in BigQuery for better performance and cost management, you would need to create and manage these tables separately. After Datastream streams data into BigQuery, you can use additional data processing tools like Google Dataflow or custom scripts to organize the data into partitioned tables as needed.

Stanislav

Thanks you for your reply.

But there are some moments that don't work with the flow described above.

I don't know names and types of meta columns which Datastream needs in target table. I can find them out after stream creation. Then I need to stop stream, then I need to find a ddl of the table that datastream has created and only after it I need to drop table and recreate a partitioned one with the same ddl.

And I suppose that this process is not straightforward at all. Instead of it there should be an option to choose what column should be a partition one because datastream knows schema of source table in advance.

ms4446

Hi @Stanislav ,

I understand your concerns about the process of manually creating a partitioned table after Datastream has created the initial table. The current workflow is indeed not as straightforward as it could be, and it would be convenient to have an option to specify the partitioning column directly when creating the stream.

The lack of this option is likely due to the flexibility that Datastream aims to provide in terms of handling different source table schemas. By automatically generating the table schema and adding meta columns, Datastream ensures that it can handle a wide range of source tables without requiring any prior knowledge of their specific structure.

However, this approach can make it more challenging for users to customize the table structure, especially when it comes to partitioning. As you mentioned, it requires additional steps of stopping the stream, identifying the meta columns, and recreating the table with the desired partitioning.

I would encourage you to submit feedback to Google Cloud through the Google Cloud Public Issue Tracker to suggest the addition of an option to specify the partitioning column during stream creation. This feedback can help raise awareness of this need among the development team and potentially lead to an improvement in the Datastream user experience.

In the meantime, you can continue to follow the current workflow, which involves creating the initial table with meta columns and then manually recreating the partitioned table after identifying the meta columns. Although this process is not ideal, it ensures that your data is partitioned for better performance and cost management.

Is it possible to make partition table in Datastream