Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Datastream not pulling change data from RDS Postgresql

I followed the on screen instruction to connect datastream from AWS RDS Postgresql to Bigquery. 
Enabled logical replication, added 

CREATE PUBLICATION [MY_PUBLICATION] FOR ALL TABLES;
SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('[MY_SLOT_NAME]', 'pgoutput');
 
and 
 
CREATE USER [MY_USER] WITH ENCRYPTED PASSWORD '[MY_PASSWORD]';
GRANT RDS_REPLICATION TO [MY_USER];
GRANT SELECT ON ALL TABLES IN SCHEMA [MY_SCHEMA] TO [MY_USER];
GRANT USAGE ON SCHEMA [MY_SCHEMA] TO [MY_USER];
ALTER DEFAULT PRIVILEGES IN SCHEMA [MY_SCHEMA]
GRANT SELECT ON TABLES TO [MY_USER];
 
Followed the above steps. 
I can able to pull the data by running backfil. But change data like insert/update are not coming up in bigquery. 
I am seeing this error in logs
kabilanravi_0-1694079589708.png

It doesn't show exactly what error. Does anyone experienced this issue? Need help to solve this.

 

0 9 1,670
9 REPLIES 9

If you are experiencing problems with logical replication between AWS RDS PostgreSQL and Google Cloud Datastream, you can try the following troubleshooting steps:

  1. Check the status of the logical replication slot.
SELECT * FROM pg_replication_slots;

The output of this command should show the status of all the logical replication slots on the database. The slot that you created for Datastream should be in the active state.

  1. Check the permissions on the logical replication slot.

The user that Datastream is using to connect to the database must have the REPLICATION role. You can check this by running the following command:

 
GRANT REPLICATION TO [MY_USER];
  1. Check the permissions on the tables and schema that you are replicating.

The user that Datastream is using to connect to the database must have the SELECT privilege on all of the tables and USAGE permission on the schema that you are replicating. You can check this by running the following commands:

 
GRANT SELECT ON ALL TABLES IN SCHEMA [MY_SCHEMA] TO [MY_USER];
GRANT USAGE ON SCHEMA [MY_SCHEMA] TO [MY_USER];
  1. Check the logs for the Datastream service.

The logs may contain more information about the error that you are seeing. You can view the logs by going to the Cloud Console and navigating to Datastream > Connections. Click on the connection that you are using to replicate data from AWS RDS PostgreSQL to BigQuery, and then click on the Logs tab.

Additional tips:

  • Make sure that the AWS RDS PostgreSQL database is running the latest version of the PostgreSQL software.
  • Make sure that the AWS RDS PostgreSQL database is not overloaded.
  • If you are still seeing the error after checking all of the above, try restarting the AWS RDS PostgreSQL database and the Datastream service. However,keep in mind that this can be disruptive, so it is best to do this as a last resort, especially in production environments.

Thanks ms446 for the quick reply. Here are findings so far

1. Output of logical replication slot status - Active

I am using bq_slot

kabilanravi_0-1694088395780.png

2. Already grant replication to the user

3. Did both grant already. 

4. Logs tab is not available in Datastream. 

kabilanravi_1-1694088574969.png

Please help on further debug this issue.

Thank you for sharing your findings. Based on what you've provided:

  1. Logical Replication Slot Status: The slot is active, which is a good sign. This means that the replication slot is ready to stream changes.

  2. User Permissions: You've granted the necessary replication permissions to the user, and you've also granted the required permissions on the tables and schema. This is also in order.

  3. Logs in Datastream: You can navigate to the Logging section in the Google Cloud Console and filter by the Datastream service to see if there are any relevant logs there.

Given the information you've provided, here are some additional steps to consider:

  1. WAL Retention: Ensure that the Write-Ahead Logging (WAL) retention settings on your AWS RDS PostgreSQL instance are set to retain logs for a sufficient duration. If WAL files are being cleaned up too quickly, it might affect replication.

  2. Datastream Configuration: Double-check the configuration of your Datastream connection. Ensure that the connection details, such as the hostname, port, user credentials, and database name, are correct. Also, ensure that the tables you want to replicate are correctly specified in the Datastream configuration.

  3. Network Connectivity: Ensure that there are no network issues between AWS RDS PostgreSQL and Google Cloud Datastream. Check security groups, VPC settings, and any firewall rules that might be blocking the connection.

  4. Datastream Health: Navigate to the Datastream dashboard in the Google Cloud Console. Check the health and status of your streams. If there are any issues, they might be highlighted there.

Lastly, if you can provide any additional error messages, descriptions, or details from your investigations, it might help in narrowing down the issue further.

Let me share the steps I did. I followed the instructions provided on screen while creating datastream

1. While creating the source I choose postgres to bigquery and followed the settings mentioned here.

Screenshot 2023-09-08 at 11.22.03 AM.png

2. I just enabled rds.logical_replication to 1 and restarted the DB. Rest are default configurations

kabilanravi_0-1694152483440.png

3. Below is the wal parameter values

kabilanravi_1-1694152517292.png

4. Publication and slot were created

5. In the datasteam user setup, I used an existing user.kabilanravi_2-1694152589405.png

So I did only the step 2.

6. I used ssh tunneling to connect to the RDS which is inside VPC. Whitelisted the ips in the security group.

7. I selected only 2 tables from the schema while doing the stream operation.

After all the changes made, ran the test and it passed. For the first sync, it pulled all necessary data from postgres to bigquery. But backfill works.

Problem is, I did change the values in postgres table, but the change is not getting pulled in the stream. Logs 

Below is the full log information.

kabilanravi_0-1694153264275.png

{
"insertId": "ca2ba7c6-d49e-4d6e-b511-72f7b2316c4a",
"jsonPayload": {
"object_name": "200_dimensions",
"message": "CDC failed parsing with error type: UNKNOWN_ERROR, will retry...",
"event_code": "PROCESSOR_PARSE_FAILED",
"context": "CDC",
"read_method": ""
},
"resource": {
"type": "datastream.googleapis.com/Stream",
"labels": {
"location": "us-east4",
"stream_id": "aws-prod-db-stream",
"resource_container": "projects/1061065129435"
}
},
"timestamp": "2023-09-08T06:00:47.288560Z",
"severity": "ERROR",
"logName": "projects/composed-strata-301915/logs/datastream.googleapis.com%2Fstream_activity",
"receiveTimestamp": "2023-09-08T06:00:47.408295609Z"
}

Pasting the error json as well.

Logs doesn't show the error message correctly. It just says unknown error, retry... 

Bit hard to understand the correct problem with the log it has. Is there any other place, i can see the detailed error message?

If you are experiencing problems with Datastream, consider the following troubleshooting steps:

  1. Schema Verification:

    • Check the schema of the table in BigQuery to ensure it matches the schema in the source database.
    • Ensure that data types are compatible between the source and BigQuery. Some data types in the source might not have a direct equivalent in BigQuery, potentially causing issues.
    • Datastream is designed to handle schema changes gracefully, but verifying the schema is always a good practice.
  2. Backfill:

    • Try running a backfill for the table. This will reprocess all the data and might resolve certain issues.
    • Note: Backfilling can be time-consuming and could incur additional costs. It might not address problems related to real-time change data capture (CDC).
  3. Stream Operations:

    • Pause and resume the stream, as this can sometimes resolve transient issues.
    • As a last resort, consider recreating the stream. This action will delete all the data that has already been processed.
  4. Debug Logging:

    • Enable debug logging for Datastream to generate more detailed logs that can help identify the issue.
    • To enable debug logging, navigate to Datastream > Streams in the Cloud Console. Click on the problematic stream, then the Edit button. In the Settings tab, scroll to the Logging section and select the Debug level. Save your changes.
    • View and filter these logs in the Google Cloud Console's Logs Explorer for insights.

Additional Tips:

  • Ensure the source database runs the latest version of its software.
  • Monitor the source database's performance to ensure it's not overloaded.
  • If using SSH tunneling to connect to the source database, verify the tunnel's functionality and ensure necessary ports are open without firewall rules blocking the connection.
  • Set up monitoring and alerts for Datastream in Google Cloud Monitoring for proactive issue identification.
  • Regularly consult the official Google Cloud Datastream documentation for updates on best practices, troubleshooting tips, and new features.
  • Ensure stable network connectivity between the source database and Google Cloud to avoid data sync disruptions.

Thanks for the great support. I will check again. 

1 last question. Does JSONB supports for cdc stream?

Yes, JSONB is supported in DataStream. DataStream can replicate JSONB data from a PostgreSQL source database to a BigQuery destination table. However, there are a few limitations and considerations to be aware of:

  1. DataStream can only replicate JSONB data from PostgreSQL source databases. It cannot replicate JSONB data from other source databases, such as MySQL or Oracle.
  2. DataStream can only replicate JSONB data to BigQuery destination tables. It cannot replicate JSONB data to other destination databases.
  3. DataStream cannot replicate nested JSONB data.
  4. DataStream will create a BigQuery schema for the destination table that is compatible with the JSONB data.
  5. DataStream will convert the JSONB data to a BigQuery STRUCT type.
  6. DataStream will not replicate any JSONB data that is not valid JSON.

If you are using DataStream to replicate JSONB data, you should also be aware of the following:

  • It's important to test the replication process thoroughly before using DataStream to replicate JSONB data in a production environment. This will help ensure that the data is being replicated correctly and that the BigQuery destination table is properly structured to receive the JSONB data.

Sounds good. Let me try with other tables as well.