"unable to connect to source database server" in database migration job over VPN

Hi. My database migration project is to import from an AWS RDS Postgresql cluster to AlloyDB. I've connected the AWS and GCP VPCs using a VPN well enough that I can reach both the AWS and GCP instances using psql (or telnet hostname 5423, etc, etc.)  from compute VM instances in any subnet in the two VPCs.

More details on the VPN is that it is a HA type with dynamic routes, the same as this documented at 

https://cloud.google.com/database-migration/docs/postgresql-to-alloydb/configure-connectivity-vpns#d... . The firewall / security group rules were also adjusted to allow the postgres port  and it works in providing TCP accessibility, at least to VMs started up in the normal user subnets in either VPC.
 
Unfortunately when I create the Database Migration job, selecting the "VPC peering" type of connection which would relevant for the type of network connectivity enabled, I encounter the following error:
 
    generic::unknown: unable to connect to source database server: failed to connect to the source database "postgres": connectWithTimeoutAndRetry timeout with error: dial tcp (CORRECT_IP_ADDRESS_OF_MY_AWS_RDS_INSTANCE):5432: connect: connection timed out
 
Does anyone know a good way to further diagnose the issue? Or what the solution will be?
 
Akira
Solved Solved
0 7 1,286
1 ACCEPTED SOLUTION

I've resolved the issue - the root cause was a AWS Security Group filter. Its needs to be expanded to also allow the IP range(s) for the Google "Private Services Access" in your GCP VPC. The route tables in AWS and GCP were being updated automatically and correctly by my VPN configuration, but the security group is separate matter.

For interested readers I'll describe how I diagnosed this.

I looked in the AWS RDS Postgres instance's log and found that there were no log messages at all when I attempted to start the DMS job. I could make a connection with a deliberately bad password (successful connections are silent at default log verbosity) at say 5:42, then attempt the DMS job start, wait until it fails at say 5:45, then make another bad connection attempt from a normal VM at say 5:48, and the only thing logged were the 5:42 and 5:48 failures. So I was certain enough at this point the TCP traffic from GCP's DMS was being blocked.

Then I realized that a AWS Reachability Analysis was the other half of network route investigation that needed to be done. The GCP Connectivity test only checks the route as far as the VPN tunnel on GCP end. The AWS Reachability analysis is likewise partial, it only checks what happens within the AWS side of the VPN.

What it found, but only once I had included destination port=5432 and source IP = the private IP address of the AlloyDB as optional packet headers, was that the security group was blocking that. It already permitted port 5432 traffic for the normal GCP subnet ranges, but the GCP Private Services Access CIDR, the one for "servicenetworking-googleapis-com" which the AlloyDB services are in, is a different range.

Once I added an extra ingress filter rule to the AWS security group the connections starting getting through.

A note about the IP address of the Database Migration Service

The connection from DMS in the AWS RDS Postgres log shows its IP address was in CIDR of the "Private Services Access" range that the AlloyDB instances are in. A new IP address, but close to them. xx.xx.xx.9 instead xx.xx.xx.2 in my case.

A note about SSL mode being on or off in the Database Migration connection profiles

With some further testing, just for curiosity, I found the DMS connections will get through TCP-wise whether or not the DMS connection profile is using encryption. With AWS RDS Postgres you won't be totally successful with "None" as an encryption option, it will reject it, but you'll see the rejection message in the AWS RDS Postgres log.

Using SSL in a DMS connection profile requires uploading or pasting the CA cert. I found the pem file for my region supplied at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.Region... works for this.

View solution in original post

7 REPLIES 7

I understand you're encountering a connection timeout error during your AWS RDS PostgreSQL to AlloyDB migration using the Database Migration Service (DMS) with a VPC peering setup. This can indeed be challenging, here are some steps for troubleshooting Connection Timeout During AWS RDS PostgreSQL to AlloyDB Migration with VPC Peering:

1. Network Accessibility:

  • Firewall & Security Groups:
    • Verify that both AWS RDS security group and GCP firewall rules allow inbound connections on port 5432 (PostgreSQL) from the relevant IP ranges, including the GCP VPC and specific DMS IPs.
  • Network Connectivity:
    • Confirm the VPN tunnel is active and routes are propagated correctly between VPCs.
    • Use traceroute to test connectivity from a GCP VM to the RDS instance. Remember, ping might be disabled on AWS RDS.
  • DNS Resolution:
    • Ensure DNS resolution is functioning properly in both VPCs. Verify the RDS hostname resolves to the correct IP address within the GCP VPC.
    • Consider using the IP address directly in the DMS setup to avoid any DNS issues.

2. DMS Specific Checks:

  • DMS Connectivity Test:
    • Utilize the DMS connectivity test feature to verify if DMS can connect to your source database. This will help discover if the issue is network-related or database-specific.
  • Service Account Permissions:
    • Confirm the service account used by DMS has the necessary permissions for accessing AlloyDB instances. Verify the relevant permissions, similar to cloudsql.client.connect and cloudsql.instances.get for Cloud SQL.
  • DMS Version:
    • Ensure you're using the latest version of DMS. Updates or patches might have addressed known issues with VPC peering connections.

3. Additional Considerations:

  • Routing:
    • While static routing can be an alternative, stick to dynamic routing unless confirmed as the cause.
  • GCP Support:
    • If the above steps don't resolve the error, contact Google Cloud Platform support for troubleshooting assistance.
  • Other Factors:
    • Review any VPC peering limitations or known issues mentioned in GCP documentation.
    • Double-check the database user credentials provided to DMS for accuracy and necessary privileges.
    • Consider network latency or bandwidth limitations that might impair the connection.

Hello. Thanks for the help.

To cover the parts in "1. Network Accessibility" I could run telnet, traceroute etc. without issue. As I was able to use the psql client that was already implied though. DNS is also working well, even resolving the hostnames assigned within AWS just works first time.

But, as a general point, this is all from a VM in the normal subnet of the VPC, not the 'private service' IP range the AlloyDB instances are in.

Q. Where does the database migration service run from? Is it a part of the db daemon like it would be in normal Postgres? Or is it in its own VM or K8s container? If so where is that, and can I do these network tests from that?

For "2 DMS Specific Checks":

Regarding DMS version I've only just created the DMS job so presumably it is a recommended version.

> Confirm the service account used by DMS has the necessary permissions for accessing AlloyDB instances.

No extra service account has been configured. Is this applicable?

A DMS connection profile with correct credentials, namely username & password, was created to connect to the AWS RDS Postgres source DB of course.

Update: I remember a point here: the encryption mode of "None" has been used in this connection profile. I don't have to chose the network encryption mode when using the psql client, so it didn't seem necessary but it's a toin coss whether SSL is on or off by default with each separate client these days.

It's a pity we can't just test connectivity from the DMS connection profile page. "Test" -> "Select Destination AlloyDB or Cloud SQL instance for network context" -> "Go".

> Utilize the DMS connectivity test feature to verify if DMS can connect to your source database. This will help discover if the issue is network-related or database-specific.

This was a new thing for me to try 👍. And it gave me a way to check with the IP range of the AlloyDB instance rather than the normal subnet 👍. Unfortunately nothing interesting has been discovered. The test result is "Reachable". In the detail view it shows steps from "Non-google Network" (actually the private service range for the google AlloyDB instance), to "Dynamic route", to VPN tunnel.

For the "3. Additional Considerations" section:

I have, and will continue to follow, the advice to stick with the dynamic routing. 

> Review any VPC peering limitations or known issues mentioned in GCP documentation.

I haven't found any in the reading so far.

> Double-check the database user credentials provided to DMS for accuracy and necessary privileges.


The credentials are the same as the successful test by command line on VMs.

> Consider network latency or bandwidth limitations that might impair the connection.

There's been none observed so far.

It's good to hear that you've successfully tested network accessibility and DNS resolution from a VM in the normal subnet of your VPC, and that you're using the latest version of the Database Migration Service (DMS). Addressing your specific questions and concerns:

Where Does the Database Migration Service Run From?

  1. DMS Execution Context:

    • DMS typically runs as a managed service in the cloud, not directly within your database instance or on a specific VM or Kubernetes container that you can access. It's managed by GCP and runs in its own secure, managed environment.
  2. Network Testing from DMS Context:

    • Since DMS is a managed service, you generally can't perform network tests (like telnet or traceroute) directly from its execution context. However, the "Reachable" status from the DMS connectivity test indicates that DMS can communicate with your AWS RDS instance, which is a positive sign.

DMS Specific Checks:

  1. Service Account Configuration:

    • If you haven't configured a specific service account for DMS, it's using default permissions. In most cases, this should be sufficient, but it's always good to verify that the default service account has the necessary roles and permissions for the migration task.
  2. Encryption Mode:

    • The choice of "None" for encryption mode in the DMS connection profile might be significant. AWS RDS instances often require SSL for connections. You might want to try configuring the DMS connection profile to use SSL encryption to match the RDS instance's configuration.
  3. Testing Connectivity from DMS:

    • While direct connectivity testing from the DMS interface would be ideal, currently, the connectivity test feature is the closest available tool. The fact that it shows "Reachable" is a good indicator, but it doesn't necessarily test the full range of database interactions that a migration would entail.

Additional Considerations:

  1. Stick with Dynamic Routing:

    • Continuing with dynamic routing is a wise choice, especially since it seems to be functioning correctly based on your tests.
  2. VPC Peering Limitations:

    • If you haven't found any relevant limitations or known issues in the GCP documentation, it's likely that the issue isn't with VPC peering itself.
  3. Database User Credentials:

    • Since the credentials work when tested from VMs, they're likely correct. However, ensure that the user has the necessary privileges for the migration process, which might require more permissions than standard operations.
  4. Network Latency or Bandwidth:

    • No observed latency or bandwidth issues is a good sign. However, keep an eye on this during the actual migration process, as the data transfer demands will be higher.

Next Steps:

Given your thorough testing and the results you've observed, here are some additional steps you might consider:

  1. SSL Encryption:

    • Revisit the encryption settings in your DMS connection profile. Try enabling SSL if your AWS RDS instance is configured to use it.
  2. Detailed Logs and Monitoring:

    • If possible, enable detailed logging for the migration process in DMS and monitor these logs for any specific errors or warnings that occur during the migration attempt.
  3. GCP Support:

    • Given the complexity of your setup and the fact that basic connectivity tests are passing, reaching out to GCP support might provide more insights. They can access more detailed logs and system information that could pinpoint the issue.
  4. Incremental Troubleshooting:

    • If feasible, try a smaller, more controlled migration (e.g., with a subset of your data or a test database) to see if the issue persists in a less complex scenario. This might help isolate the problem.
  5. Review RDS Instance Settings:

    • Double-check the settings of your AWS RDS instance, particularly around network and security configurations, to ensure there's nothing that might be blocking or interfering with the migration process.

I'd like to double-check something just as one step by itself.

The last two replies mentioned specifically a "DMS Connectivity Test". I used the Connectivity Test which exists in the Network Intelligence service. Is there a specific Database Migration Service connectivity test that is different?

Sorry for confusion. To clarify, the DMS in Google Cloud  does not have a separate, standalone connectivity test feature specific to DMS itself. The Connectivity Test you used from the Network Intelligence service in Google Cloud is the appropriate tool for testing network connectivity and configurations, including those relevant to DMS.

The Network Intelligence Center's Connectivity Test is designed to diagnose network issues across Google Cloud VPCs and on-premises networks. It helps you understand network configurations and connectivity for services running in the cloud, which is relevant for your use case with DMS.

In the context of DMS, when I referred to a "DMS Connectivity Test," it was meant to suggest using available tools within GCP (like the Network Intelligence Center's Connectivity Test) to ensure that the network path from your DMS setup to the source (AWS RDS) and destination (AlloyDB) databases is correctly configured and not encountering any blockages.

Since you've already used the Connectivity Test from the Network Intelligence service and it shows "Reachable," it indicates that the network path is correctly set up for the DMS to communicate with your AWS RDS instance. The issue with the DMS job might lie elsewhere, possibly in the configuration of the DMS job itself, database settings, or permissions.

If you continue to face issues with the DMS job, I recommend:

  1. Reviewing the DMS job configuration for any potential misconfigurations.
  2. Ensuring that the database user credentials and permissions are correctly set up for the migration task.
  3. Considering the use of SSL encryption in the DMS connection profile if your AWS RDS instance requires it.
  4. Consulting GCP support for more detailed insights, especially since the basic network connectivity seems to be functioning correctly.

I've resolved the issue - the root cause was a AWS Security Group filter. Its needs to be expanded to also allow the IP range(s) for the Google "Private Services Access" in your GCP VPC. The route tables in AWS and GCP were being updated automatically and correctly by my VPN configuration, but the security group is separate matter.

For interested readers I'll describe how I diagnosed this.

I looked in the AWS RDS Postgres instance's log and found that there were no log messages at all when I attempted to start the DMS job. I could make a connection with a deliberately bad password (successful connections are silent at default log verbosity) at say 5:42, then attempt the DMS job start, wait until it fails at say 5:45, then make another bad connection attempt from a normal VM at say 5:48, and the only thing logged were the 5:42 and 5:48 failures. So I was certain enough at this point the TCP traffic from GCP's DMS was being blocked.

Then I realized that a AWS Reachability Analysis was the other half of network route investigation that needed to be done. The GCP Connectivity test only checks the route as far as the VPN tunnel on GCP end. The AWS Reachability analysis is likewise partial, it only checks what happens within the AWS side of the VPN.

What it found, but only once I had included destination port=5432 and source IP = the private IP address of the AlloyDB as optional packet headers, was that the security group was blocking that. It already permitted port 5432 traffic for the normal GCP subnet ranges, but the GCP Private Services Access CIDR, the one for "servicenetworking-googleapis-com" which the AlloyDB services are in, is a different range.

Once I added an extra ingress filter rule to the AWS security group the connections starting getting through.

A note about the IP address of the Database Migration Service

The connection from DMS in the AWS RDS Postgres log shows its IP address was in CIDR of the "Private Services Access" range that the AlloyDB instances are in. A new IP address, but close to them. xx.xx.xx.9 instead xx.xx.xx.2 in my case.

A note about SSL mode being on or off in the Database Migration connection profiles

With some further testing, just for curiosity, I found the DMS connections will get through TCP-wise whether or not the DMS connection profile is using encryption. With AWS RDS Postgres you won't be totally successful with "None" as an encryption option, it will reject it, but you'll see the rejection message in the AWS RDS Postgres log.

Using SSL in a DMS connection profile requires uploading or pasting the CA cert. I found the pem file for my region supplied at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.Region... works for this.

Thank you for sharing your solution and insights!