Using cloud SQL (postgres) server as SSH host to connect from Datastream

Hi! I am trying to connect to my Cloud SQL PostgreSQL database from Datastream (to get data to BigQuery). I am using public IP, but since it's set up to require trusted client certificates, my understanding is that I need to use Forward-SSH tunnel as connectivity method from Datastream (please correct me if I'm wrong on this). 

According to this guide, it should be possible to use the database server to terminate the SSH tunnel. It does not specify in detail how to do it, only to "allow network traffic to reach the tunnel server or database host via SSH, which is generally on TCP port 22". 

I am not able to connect, and getting the error 
"We can't connect to the data source. Verify that the Forward SSH tunnel configuration is configured correctly in the connection profile and that the port is open on the SSH tunnel server."

I created a connectivity test on the database confirming that port 22 is not reachable. 

How can I get this to work? Is there any setting up that needs to be done on the database side? How do I allow traffic on port 22? Or should I use a different port? 

Appreciate any help on this! 

0 3 1,031
3 REPLIES 3

To connect to your Cloud SQL PostgreSQL database from Datastream (to get data to BigQuery) using a Forward-SSH tunnel as the connectivity method, you will need to:

  1. Create an SSH Tunnel Server:

    • This server will be used to terminate the SSH tunnel. If you're using Cloud SQL, you typically need an intermediary server, such as a Compute Engine VM in the same VPC as your Cloud SQL instance, to act as the SSH tunnel server. Alternatively, you can use a self-hosted server.
  2. Configure the SSH Tunnel Server to Allow Traffic on Port 22:

    • If you are using a self-hosted SSH tunnel server:
      • Ensure the SSH server is running.
      • If the default SSH port has been changed, edit the SSH configuration file (usually /etc/ssh/sshd_config) and ensure it has the line Port 22. Restart the SSH server afterward.
    • If you're using an intermediary Compute Engine VM in Google Cloud, ensure that the VM's firewall and the VPC firewall allow incoming traffic on port 22.
  3. Create a Datastream Connection Profile for the PostgreSQL Database:

    • Go to the Datastream console.
    • Click on Connections > Create profile.
    • Select PostgreSQL as the source database type.
    • Enter a name for the connection profile.
    • Select Forward-SSH tunnel as the connectivity method.
    • Enter the IP address and port of the SSH tunnel server.
    • Provide the necessary SSH authentication details, such as the SSH username and private key.
    • Click Create.
  4. Create a Datastream Stream and Start Replication:

    • Go to the Datastream console.
    • Click on Streams > Create stream.
    • Select the connection profile you created in the previous step.
    • Choose BigQuery as the destination data store.
    • Select the tables you want to replicate.
    • Click Create. Datastream will start replicating data from the PostgreSQL database to BigQuery.
  5. Troubleshooting:

    • If you encounter the error "We can't connect to the data source. Verify that the Forward SSH tunnel configuration is configured correctly in the connection profile and that the port is open on the SSH tunnel server," consider the following:
      • Ensure the SSH tunnel server is running and port 22 is open.
      • Verify the correct IP address and port for the SSH tunnel server in the Datastream connection profile.
      • Confirm that port 22 is open in both your cloud provider's firewall (e.g., Google Cloud Platform's VPC firewall) and on the server's OS if it has a firewall enabled (e.g., ufw or firewalld on Linux).
      • Check server logs, such as /var/log/auth.log on Linux, for any SSH connection attempts and potential issues.

By following these steps, you should be able to establish a successful SSH tunnel from Datastream to your Cloud SQL PostgreSQL database.

I understand that as a no to whether it is possible to use the database server (cloud SQL) as SSH termination. 

Turns out it is not possible for Datastream to connect to the database via public IP at all, even through SSH, as long as it requires trusted client certificates. 

For the record, I ended up going with federated queries from Big Query instead of Datastream, since my only remaining option was to use a reverse proxy through private IP / VPC, which was too much overhead for being worth it.

I understand the challenges you've faced in trying to establish a secure and reliable connection to your Cloud SQL PostgreSQL database using Datastream. It's indeed true that Cloud SQL instances do not support direct SSH access for terminating SSH tunnels, and the requirement for trusted client certificates adds a layer of security that complicates the use of public IP connections in this context.

Federated queries in BigQuery offer a good alternative for accessing external data sources like Cloud SQL. They allow you to run queries directly on external databases without moving data into BigQuery, which can be a more straightforward solution when dealing with secure connections that require client certificates.

Using federated queries can simplify the architecture by removing the need for an SSH tunnel and a reverse proxy setup, which would indeed add complexity and overhead. It's a solution that leverages BigQuery's built-in capabilities to query external data sources, thus streamlining the process.