Hi Team,
Found some interesting cases on datastream, currently, I have set datastream to bigquery ( via cloud storage, pub/sub, and dataflow ) in production scale, the streams already running well for about 1 month, but when I want to check the data integrity, found 1 issue.
Seems that datastream converts every timestamp datatype column from source Mysql. when I downloaded and open the Avro file on GCS, the timestamp is already changed UTC but datetime is not.
So the issue is like this:
MySQL Bigquery
Timestamp Column ---> Datetime Column with different UTC ( converted to UTC 0 )
Datetime Column ---> Datetime column with the same UTC as source DB
Is there any way to retain the timestamp column as is ? of course we can create a view on bigquery to mitigate this issue, but better to know the root cause in the datastream?
@farizi I am using (mysql => cloud sql) datastream and external tables but the timestamp and datetime column have the same time value. The only problem is that it adds a string UTC at the end of it (which is a big query issue mostly ).
If at your end you are getting this from datastream side itself , then you can raise a bug request here
https://issuetracker.google.com/issues?q=status:open%20componentid:1150595&s=created_time:desc
Are there any parameters that you have set on datastream before you start it? any schema that you modified or just run it as is?
@farizi Yes but on external table. enable_logical_types=true does help in changing some formats
CREATE EXTERNAL TABLE `ext_tbl`
WITH PARTITION COLUMNS
WITH CONNECTION ``
OPTIONS(
enable_logical_types=true,
format="AVRO",
hive_partition_uri_prefix="",
require_hive_partition_filter=true,
uris=["gs://bucketname/folder/folder/*"]
);
In datastream I guess there is no parameter as such which can do such things. Can you show one sample , like data in source db and what you get in avro ?
Sorry for the wait, here are 2 screenshots, the first one is from the DB, and the second is from datastream Avro, look at the data_create and data_upd columns, and notice that in the second picture, the date update column has 8 hours delta from the date_create column, this is the issue, date_update column has timestamp type in the DB when it arrives in GCS, the value already changed.
@farizi are you sure that both rows are related to same cdc events ? Because there are no specific patterns observed here. This looks like you are comparing two different rows. Because for a row with single id (primary key) in mysql you can multiple rows in GCS considering the cdc rows related to same id.
Wrt your usecase are you replicating source mysql and BQ or are you dumping datastream output in BQ ?
Hello, we are using Dataflow's DataStream to BigQuery template, to replicate the MySQL DB to BIgquery. So far there havent been any issues, except for this Timestamp issue