Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Datastream fails to read latin1 encoded tables

Hello,

I'm using Datastream to unload binlog files to Google Cloud Storage. Unfortunately, my source database is encoded in latin1 :

 

> SHOW VARIABLES LIKE 'character_set_database';

> latin1

> SHOW VARIABLES LIKE 'collation_database';

> latin1_swedish_ci

Even though I tried to start the stream using a "mysql-source-config" that looks like this :

 

{
    "includeObjects": {
        "mysqlDatabases": [
            {
                "database": "my_db",
                "mysqlTables": [
                    {
                        "table": "my_table",
                        "mysqlColumns": [
                            {
                                "column": "id",
                                "dataType": "int"
                            },
                            {
                                "column": "some_text_col",
                                "dataType": "varchar",
                                "primaryKey": false,
                                "collation": "latin1_swedish_ci"
                            }
                        ]
                    }
                ]
            }
        ]
    },
    "excludeObjects": {}
}

 

The stream fails to read the content of the "some_text_col" column with the following error:

Discarded 2 unsupported events with reason code: MYSQL_DECODE_ERROR. Latest discarded event details: Discarded an event from my_db.my_table: Event Parsing Error: Failed to parse event: === UpdateRowsEvent === Date: 2024-07-03T13:26:08 Log position: 17343280 Event size: 839 Read bytes: 161. Successfully parsed rows: []., caused by: Row Parsing Error: Failed to parse row of table xxx ... [skipping because the full schema is written] 

, caused by:\n Column Parsing Error: Failed to parse bytes:0x312ee382b9e382abe382a4e383a9e383b3e383aae38383e382b8e58f82e58aa0e381aee3819fe38281e38081e4bba5e4b88be381aee98081e8bf8ee38292e3818ae9a198e38184e887b4e38197e381bee38199e380820a382f313728e59c9f2931373a3130e4b887e5baa7e9b9bfe6b2a2e58fa3e9a785e28692e3839be38386e383ab0a382f313828e697a52931373a3035e3839be38386e383abe28692e4b887e5baa7e9b9bfe6b2a2e58fa3e9a7850a322ee383ace382a4e38388e38381e382a7e38383e382afe382a2e382a6e3838831373a3030e381abe381a6e3818ae9a198e38184e887b4e38197e381bee38199e380820a332ee7a681e78599e381abe381a6e3818ae9a198e38184e887b4e38197e381bee38199e380823c62723e44696e6e65722a2030204a5059 as value of column {'type': 252, 'name': 'Comment', 'collation_name': 'latin1_swedish_ci', 'character_set_name': 'latin1', 'comment': '', 'unsigned': False, 'zerofill': False, 'type_is_bool': False, 'is_primary': False, 'fixed_binary_length': None, 'length_size': 2}., caused by:\n 'charmap' codec can't decode byte 0x8f in position 27: character maps to <undefined>",
 
Datastream seems to be actually ignoring the collation parameter provided through the configuration.
Since the error seems to be happening most of the time on the same columns, I also tried to exclude clearly these columns as a dirty fix. 

Unfortunately, Datastream is also ignoring the exclude column parameter : it keeps trying to read the whole row (including all the columns, even the ones I purposely excluded) and keeps failing to read the row, hence leading to a full binary log event being ignored.
 
Is there a way I am not aware of, to make sure Datastream is not ignoring the parameters it is provided?
 
Or at least, to replace the column values with NULL values when it fails to decode it?
 
Thanks in advance!

 

Solved Solved
0 2 373
1 ACCEPTED SOLUTION

Hello,

Thank you for contacting Google Cloud Community!

Datastream currently doesn't fully support parsing binlog events with latin1 encoding, specifically when encountering characters outside the standard ASCII range. This leads to parsing errors and discarded events.

Unfortunately, there's no perfect solution at this moment, but you can consider:

Upgrade Source Database Encoding:

  • If possible, the ideal solution is to upgrade your source MySQL database character set and collation to a more widely supported encoding like UTF-8.
  • This will ensure compatibility with Datastream and other tools in the long run.
  • Upgrading the database encoding might require some planning and potential downtime, so consider a maintenance window for this change.

Consider raising a feature request with Google Cloud. While there's no guarantee of immediate implementation, expressing user demand can help prioritize future support for latin1 encoding in Datastream.

Regards,
Jai Ade

 

View solution in original post

2 REPLIES 2

Hello,

Thank you for contacting Google Cloud Community!

Datastream currently doesn't fully support parsing binlog events with latin1 encoding, specifically when encountering characters outside the standard ASCII range. This leads to parsing errors and discarded events.

Unfortunately, there's no perfect solution at this moment, but you can consider:

Upgrade Source Database Encoding:

  • If possible, the ideal solution is to upgrade your source MySQL database character set and collation to a more widely supported encoding like UTF-8.
  • This will ensure compatibility with Datastream and other tools in the long run.
  • Upgrading the database encoding might require some planning and potential downtime, so consider a maintenance window for this change.

Consider raising a feature request with Google Cloud. While there's no guarantee of immediate implementation, expressing user demand can help prioritize future support for latin1 encoding in Datastream.

Regards,
Jai Ade

 

Thanks for answering, we will consider using a supported database encoding.

Cheers

Top Solution Authors