Data Stream from AWS RDS MariaDB to Big Query

Hi everyone,

I am in the process of setting up Datastream to transfer data from AWS RDS using MariaDB engine to Big Query. I am using the documentation provided at https://cloud.google.com/datastream/docs/configure-your-source-mysql-database for guidance. However, I have come across three source-side configurations that I am not sure about:

net_read_timeout3600
net_write_timeout3600
wait_timeout86400

The values for these configurations are much higher than the default value for MariaDB, and the net_write_timeout is causing an "Incompatible Parameters" error for the RDS replica. Are they safe? Can anyone tell me if these configurations are necessary for Datastream to function correctly? I would really appreciate any advice or insight you may offer.

Thank you,

Solved Solved
0 5 1,026
1 ACCEPTED SOLUTION

The three source-side configurations you have mentioned are related to how the MariaDB (or MySQL) database system behaves, not specifically how Datastream uses them. Datastream recommends specific values for these configurations to ensure smooth communication with the source database, but the definitions themselves are from the perspective of the database.

  • net_read_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a response from the source database before timing out the connection.
  • net_write_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a write operation to complete on the source database before timing out the connection.
  • wait_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a query to complete on the source database before timing out the connection.

The default values for these configurations in MariaDB are:

  • net_read_timeout: 30 seconds
  • net_write_timeout: 60 seconds
  • wait_timeout: 28800 seconds (8 hours)

Datastream recommends the following values for these configurations:

  • net_read_timeout: 3600 seconds (1 hour)
  • net_write_timeout: 3600 seconds (1 hour)
  • wait_timeout: 86400 seconds (24 hours)

Datastream recommends these higher values because it streams data from the source database to BigQuery in real time. This means that Datastream needs to be able to handle long-running queries and network outages without losing data.

The net_write_timeout configuration is causing an "Incompatible Parameters" error for your RDS replica because the default value for this configuration on RDS replicas is 60 seconds. To resolve this error, you can either increase the value of the net_write_timeout configuration on your RDS replica or decrease the value of the net_write_timeout configuration in Datastream. However, I recommend that you only decrease this value if you are absolutely sure that it is necessary.

Are the Datastream recommended values for these configurations safe?

Yes, the Datastream recommended values for these configurations are safe. Datastream has been tested with these values and has been found to be stable and reliable in typical use-cases and data loads.

Are these configurations necessary for Datastream to function correctly?

No, these configurations are not necessary for Datastream to function correctly. However, using the Datastream recommended values for these configurations can help to improve the performance and reliability of Datastream.

Recommendations

Based on Datastream's documentation, using the recommended values for the net_read_timeout, net_write_timeout, and wait_timeout configurations can help ensure that Datastream is able to stream data from your AWS RDS database to BigQuery in a reliable and efficient manner.

If you are unable to increase the value of the net_write_timeout configuration on your RDS replica, you can decrease the value of the net_write_timeout configuration in Datastream. However, I recommend that you only decrease this value if you are absolutely sure that it is necessary.

Additionally, the longer timeouts help Datastream to handle long-running queries and delays in capturing changes. This ensures that Datastream does not prematurely disconnect or lose data.

View solution in original post

5 REPLIES 5

The three source-side configurations you have mentioned are related to how the MariaDB (or MySQL) database system behaves, not specifically how Datastream uses them. Datastream recommends specific values for these configurations to ensure smooth communication with the source database, but the definitions themselves are from the perspective of the database.

  • net_read_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a response from the source database before timing out the connection.
  • net_write_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a write operation to complete on the source database before timing out the connection.
  • wait_timeout: This configuration specifies the maximum amount of time that Datastream will wait for a query to complete on the source database before timing out the connection.

The default values for these configurations in MariaDB are:

  • net_read_timeout: 30 seconds
  • net_write_timeout: 60 seconds
  • wait_timeout: 28800 seconds (8 hours)

Datastream recommends the following values for these configurations:

  • net_read_timeout: 3600 seconds (1 hour)
  • net_write_timeout: 3600 seconds (1 hour)
  • wait_timeout: 86400 seconds (24 hours)

Datastream recommends these higher values because it streams data from the source database to BigQuery in real time. This means that Datastream needs to be able to handle long-running queries and network outages without losing data.

The net_write_timeout configuration is causing an "Incompatible Parameters" error for your RDS replica because the default value for this configuration on RDS replicas is 60 seconds. To resolve this error, you can either increase the value of the net_write_timeout configuration on your RDS replica or decrease the value of the net_write_timeout configuration in Datastream. However, I recommend that you only decrease this value if you are absolutely sure that it is necessary.

Are the Datastream recommended values for these configurations safe?

Yes, the Datastream recommended values for these configurations are safe. Datastream has been tested with these values and has been found to be stable and reliable in typical use-cases and data loads.

Are these configurations necessary for Datastream to function correctly?

No, these configurations are not necessary for Datastream to function correctly. However, using the Datastream recommended values for these configurations can help to improve the performance and reliability of Datastream.

Recommendations

Based on Datastream's documentation, using the recommended values for the net_read_timeout, net_write_timeout, and wait_timeout configurations can help ensure that Datastream is able to stream data from your AWS RDS database to BigQuery in a reliable and efficient manner.

If you are unable to increase the value of the net_write_timeout configuration on your RDS replica, you can decrease the value of the net_write_timeout configuration in Datastream. However, I recommend that you only decrease this value if you are absolutely sure that it is necessary.

Additionally, the longer timeouts help Datastream to handle long-running queries and delays in capturing changes. This ensures that Datastream does not prematurely disconnect or lose data.

Thank you, @ms4446, for your thorough response. Your answer is incredibly helpful and reassuring.
May I suggest that this be included in the documentation? https://cloud.google.com/datastream/docs/configure-your-source-mysql-database
I think it will help a lot of users like me.

Also, for MariaDB, a Google engineer in London informed me that I could follow the MYSQL doc. I later found that the parameter log_bin_use_v1_row_events, does not apply to MariaDB;  will this cause any issue to ignore? Is there a config on the Datastream side that I need to tweak?

I ran into an extra flag to bet set in MySQL: max_execution_time. I had set it to 60 secs, but it should be much higher to avoid errors in DataStream.

In addition to the mentioned flag, for us it was also necessary to adjust max_execution_time in our MySQL instance. 

Adjusting the max_execution_time parameter in your MySQL  is indeed another important configuration when setting up replication or continuous data transfer tasks like those handled by Datastream. This setting controls the maximum execution time for statements executed by the server, helping to prevent long-running queries from consuming too much resource or running indefinitely.

  • The max_execution_time parameter sets a limit on the time a query can execute before the server terminates it. This is particularly useful in web applications to ensure that poorly written queries do not impact database performance. However, in the context of data replication or migration tasks, this limit can inadvertently interrupt necessary long-running queries.

  • By default, MySQL does not impose a hard limit on query execution time (max_execution_time is set to 0, meaning no limit). However, for performance or security reasons, this might be adjusted in some environments.

When using Datastream, it's essential to ensure that the max_execution_time is configured to accommodate the potentially long-running queries involved in data replication, especially when dealing with large datasets or complex transactions. Setting this too low could result in premature query termination, leading to errors or incomplete data replication.

  • Recommended Setting: There's no one-size-fits-all value for max_execution_time as it heavily depends on your specific workload, data size, and complexity. However, for Datastream purposes, it's generally recommended to set this value high enough to allow for the uninterrupted execution of replication tasks. In some cases, it might even be advisable to disable this limit temporarily by setting it to 0 (no limit) during the initial data loading or when performing a full refresh of the data.

  • How to Adjust: You can adjust the max_execution_time for your MySQL instance by setting it globally or per session. For a global change (affecting all new connections), you can execute the following SQL command:

     
    SET GLOBAL max_execution_time = 60000; -- Sets the limit to 60,000 milliseconds (60 seconds)

    To remove the limit entirely, you would set it to 0:

    SET GLOBAL max_execution_time = 0;

    Remember, these changes will apply to new connections after the setting is applied. Existing connections will not be affected until they reconnect.

Considerations

  • Impact on Performance: While increasing or removing the max_execution_time limit can facilitate uninterrupted data replication, be mindful of the potential impact on your database's performance. Long-running queries, especially if not related to the replication process, can consume significant resources.

  • Monitoring and Adjustment: Monitor your database's performance and adjust the max_execution_time as necessary based on the observed impact and replication needs. It might be beneficial to have more restrictive limits during normal operations and only relax them during intensive data transfer periods.

Adjusting the max_execution_time is an important consideration when configuring your MySQL instance for use with Datastream. Ensuring this setting is appropriately configured will help avoid unnecessary errors and ensure a smooth data replication process.