Re: Inexpensive way to run a large number of remot...

jgreenberg · 03-01-2023 07:39 AM

Hi there! We have a step in our workflow that requires sending a large number (say 5000+) of simple ssh commands to an external server using ssh keys to authenticate -- ideally this would be triggered from a pub/sub. I was wondering what the cheapest way to accomplish this in GCP is? We were wondering if pub/sub can directly send an ssh command to an external server? Or would we need some sort of cloudrun that does nothing but runs ssh over and over again?

Are there any example of this out in the world? Thanks!

kolban

I think you are on the right path with Cloud Run. PubSub is a communications medium for moving data from point A to point B. To execute logic (such as sending an SSH command) then you will want the PubSub message to trigger logic ... and that logic will perform the work. You can tune your Cloud Run usage ... for example, you might be able to declare that Cloud Run can process some large number of concurrent SSH requests ... so it wouldn't be one Cloud Run instance per SSH request.

jgreenberg

Thanks! This is helpful. The only thing I'm somewhat struggling with is the "right" way to create and use a GCP-side ssh key that the container could use to perform "passwordless" ssh commands. Do you happen to have a simple example (e.g. a DOCKERFILE) that does an ssh connection without embedding the ssh keys within the container which I assume is a big security no-no?

kolban

What is your high level design? Are you thinking ONE pub/sub message would trigger a LARGE number of SSH requests? Or is it one pub/sub message triggers one SSH request? Where are the private keys for each of the back-end SSH servers currently being stored?

Maybe take a few minutes and describe the overall picture of what you are hoping to achieve and what constraints and existing architecture/components are in place.

jgreenberg

So right now we're going one-to-one -- one pub/sub triggers one ssh, but we're going to be sending thousands of ssh commands (don't worry, we've talked to our sysadmins about this so they don't think we're doing an attack). What we are trying to do is batch-processing type work where part of the processing needs to be done on our in-house HPC. My working theory is this can easily be handled by a small resources cloud run instance (ssh commands are so low-resource that I'm guessing we can be sending a few hundred or thousand per minute to a single CPU/low RAM system). Each of those ssh commands will (ultimately) trigger a qsub (SLURM) within our in-house HPC.

We think that as long as we can figure out a system that can 1) create a ssh keypair between GCP and our in-house HPC, 2) have the cloud run instance use that keypair ssh, and 3) the pub/sub simply triggers an ssh user@hpc [somecommand] [some variable passed from pub/sub] everything else should be totally doable. Our "hello world" approach would be demonstrating pub/sub + cloud run create a file on our remote system using a pub/sub variable.

adrianom

Hi kolban! I work under jgreenberg and am in charge of putting this workflow together. As jgreenberg mentioned, do you know if there is a way to let Cloud Run use an ssh keypair to connect to a remote machine that is outside the scope of GCP? The only thing I can think of would be using the Secret Manager to mount keys to the Cloud Run instance, but this way has been proving to be complicated. Keep getting a "Host key verification failed" message anytime I try to ssh into a remote machine from a Cloud Run instance. This is the right approach?

kolban

Howdy Adrian. I am imaging some "Compute" running in Google Cloud. Whether this is Cloud Run, Cloud Functions, Compute Engines or GKE ... I don't think that matters too much for this discussion. If I'm sensing correctly, the goal is to form an SSH connection from Google Compute to a target SSH server that exists on premises. Let's then look at this from a "unit" perspective. Lets ask "What is needed to allow ONE connection to happen?" I feel we would need:

The identity of the target server (IP or DNS)
Network connectivity from the Google Compute to the target server
The private key for the target server that would allow us authorization

Do all three of these work so far? If no, then we need to dig deeper.

If yes, then it seems that what we need is some form of secure storage that contains a data mapping of target server identity to private key. Then, when the compute wishes to make the connection to the target server, it has the two "data items" it needs to achieve that. Whether we use Google Secret Manager or a database that is locked down or a GCS bucket that is locked down or something else ... becomes a matter of taste.

You said that the solution was getting complicated and that we are getting a named error. Is this something we want to drill down into?

adrianom

Thanks for the response! The three points mentioned work so far. As for the last point regarding the error message, I think I want to take a step back and focus on the three points you mentioned. I guess my main question would be what's the best way to make the private key? Would I need to make it in GCP and then copy it to our target server? Or vice-versa? If yes to the former, how would I do this?

kolban

I believe it doesn't matter where the key is generated as long as the individual requesting the SSH connection is in possession of the private key and the SSH server knows about the public key. It feels like we could use sshkeygen to create a public/private key pair and then you could distribute the same public key to ALL your back-end SSH Servers and keep the single private key safe. I had assumed that each of the SSH servers already had their own public/private key pairs and your puzzle was how to get access to each server's private key ... however, if that isn't the story and you are looking to access the servers and they don't have key pairs today, then you'd likely create one pair and distribute the public key to all the back ends.

adrianom

Thanks again for the response! In our case, we have just one SSH server we want to access via Cloud Run. So if I understand this correctly, I can generate a key/pair on say my personal local machine, then copy the private key over to a Cloud Run instance (via the Secret Manager) and copy the public key over to my SSH server. Then I can use Cloud Run to SSH into my server in question because Cloud Run would have the private key mounted and the SSH server would have the opposing public key. If this sounds right to you, then I am still receiving that "Host key verification failed" error message in my Cloud Run service when doing it this way. I guess there is probably something I am messing up on. I did the following:

1. I generated a key/pair using sshkeygen on my laptop.

2. I copied the private key generated from step 1 into my Secret Manager in GCP so I could securely mount the private key in my Cloud Run containers.

3. I copied the public key over to my SSH server using ssh-copy-id.

4. I built my Cloud Run service, mounting my private key to my containers via the secret manager, and reference the key in my command. It looks something like this:

ssh -i /etc/secrets/id_rsa $SSH_SERVER

I get the "Host key verification failed" error once I run this command. I know for sure that the private key is mounted in the container as I checked the /etc/secrets folder for the private key file before running the command. I'm going to look into this error online to see if this is just a general SSH error or if it is GCP specific.

kolban

I've never done it myself so we will be learning together here. What I'd do is break the puzzle into pieces. The first part is lets check that if we are in possession of the private part of the SSH key that we can SSH into the SSH server. I hear you say that we think the public key is correctly present in the target SSH server. Great!!! Where is the SSH server located? Is it on-premises and internet accessible or is it in a private Google Cloud compute engine? What I suggest we do is spin up a Google Cloud compute engine (Linux, Debian would be my choice) or a Google Cloud shell environment. Let's copy in the private key to that environment and then try and run:

ssh -i <file> <Server>

Let's validate that, outside of a Cloud Run environment, we can connect to our target. If yes, then we move on. If no, then we pause and see what we need to do to fix this problem. Report back when you have tested 🙂

adrianom

The SSH server is located a few miles from our office.

Okay, so I created a Compute Engine instance, copied my private key over to it and ran the command you mentioned and it worked! Was able to SSH from my Compute Engine instance to my server a few miles away from me fine and dandy 👍

kolban

Perfect (and "find and dandy" is a GREAT phrase).

So ... now we have the notion that, in principle, we should be able to make the same SSH call from inside a Cloud Run environment. If I were sitting in your seat, the next thing I'd try is to spin up the Docker image that you are going to run inside a Compute Engine. Here is my thinking ....

We can spin up a Compute Engine and say "Please run this docker image for me". The compute engine will start AND will contain the docker runtime AND will launch our docker image creating a container instance. We can now SSH into the Compute Engine and then attach a shell to the running docker container. Now we should have a prompt INSIDE the container. Now we can do some focused debugging. We can try an:

ssh -i <file> <Server>

against the target SSH server and see if it works. If it does, onwards ... if it doesn't again we pause and take stock.

Re-reading the thread, I see the statement that you ran:

ssh -i /etc/secrets/id_rsa $SSH_SERVER

Can you elaborate on this? I normally think of a Cloud Run environment as a Docker container running code that is listening on a port that is triggered by an incoming request. The command you listed above is a shell command. Can you describe:

What is in your Docker image?
How are you invoking SSH?
Can we test the value of $SSH_SERVER?
How are you testing that /etc/secrets/id_rsa contains what you expect?

This also looks useful:

https://askubuntu.com/questions/45679/ssh-connection-problem-with-host-key-verification-failed-error

I think we want to look inside your Docker image and see what we have in ~/.ssh/known_hosts

adrianom

Gotcha 👍

I'll try running my Docker image in a Compute Engine instance and try the command there.

As for the command I ran in my Cloud Run service, I should have elaborated more on this. So the Cloud Run service is running some Python code, but uses the subprocess module to run shell commands. So I pass my shell command string ("ssh -i /etc/secrets/id_rsa $SSH_SERVER") into my subprocess function. So to formally answer your questions:

What is in your Docker image?
- A Python script that uses the subprocess module to run a shell command that SSHs into my remote server. It also has the private key mounted via the Secret Manager.
- EDIT: I forgot to mention that this Cloud Run service is triggered via Pub/Sub, so I upload a file into a specific bucket which triggers this Cloud Run service. Nothing is done with this file expect that the file name and the bucket name where the file is stored is passed into my Cloud Run service. I wanted to do a simple test that writes that name of the file to a text file in my remote server I'm trying to SSH in. I'm only using Pub/Sub for testing purposes here to make sure I can pass messages over to my remote server.
How are you invoking SSH?
- Previously mentioned, but I just create a string of my exact SSH command in Python and pass that string into the subprocess.run() method to run shell commands.
Can we test the value of $SSH_SERVER?
- I think if I understand this question correctly, you want to be sure that the SSH server is correctly substituted into my string? I print the string before running the subprocess.run() method to make sure it all looks correct and it does. $SSH_SERVER is an environmental variable I pass into my Cloud Run service (again via the Secret Manager). I then store the environmental variable as a variable in my Python code and concatenate that variable into my SSH string.
How are you testing that /etc/secrets/id_rsa contains what you expect?
- In my code, before I use the subprocess.run() method, I have a line of code that simply prints all files listed in a given directory. This is just to ensure that the Secret Manager is actually mounting my private key into my containers.
- os.listdir("/etc/secrets")
  - The code above is what I used and it returns back ['id_rsa'], implying that the private key is stored in this directory.

I'll also look into that askubuntu question you linked and see what's in ~/.ssh/known_hosts. Thanks again! I'll report back soon!

adrianom

Another update. I pulled the Docker image that I created to use for Cloud Run, pulled it into a Compute Engine instance and got the Host Key error. I'm gonna look into the known_hosts thing further now.

EDIT: It looks like the ~/.ssh folder does not even exist in my container. Changing directories to ~/.ssh results in a "No such file or directory error" message.

EDIT #2: Added the known_hosts file into the ~/.ssh folder in my container allowed me to SSH into my remote server! I'm going to use the Secret Manager to mount the known_hosts file into my Cloud Run service and try again. This will be the last edit until my next message.

adrianom

So I was able to fix the "Host key verification failed error", but now I got something new.

Permissions 0444 for '/etc/secrets/id_rsa' are too open. It is required that your private key files are NOT accessible by others. This private key will be ignored. Load key "/etc/secrets/id_rsa": bad permissions Permission denied, please try again.

Looking this error up, some say it is as simple as doing a chmod command to change the permissions of my private key. Link to an example here:

https://stackoverflow.com/questions/29933918/ssh-key-permissions-0644-for-id-rsa-pub-are-too-open-on...

This did not work for me, however. I still am getting the same error.

EDIT: Do you think it may have something to do with loading the private key in via the Secret Manager?

adrianom

Just an update. So the Secret Manager had nothing to do with it. Copying the private key into my container did not fix the problem. I think my last response was a bit unclear as to what my errors are. They are the following:

Failed to add the host to the list of known hosts (/root/.ssh/known_hosts)

Permissions 0444 for '/etc/secrets/id_rsa' are too open. It is required that your private key files are NOT accessible by others. This private key will be ignored. Load key "/etc/secrets/id_rsa": bad permissions.

Seems like I'm getting two different errors. Looking it up here

https://stackoverflow.com/questions/17668283/failed-to-add-the-host-to-the-list-of-know-hosts

I needed to change the permissions of a couple of folders and my key file as well as add the StrictHostKeyChecking=no flag in my ssh command. Doing this still results in the errors above when running it in Cloud Run. However, when building this container and pulling it into my own local machine instead of running it in Cloud Run actually works fine. I'm able to SSH into my server. At this point, I'm just trying everything I can think of. Will hopefully report back.

adrianom

Finally, after much head banging I got it to work (thanks to jgreenberg for this suggestion). No need to copy the known_hosts file over to a container. The ssh -o StrictHostKeyChecking=no

command bypasses this. I just needed to copy the private key to a separate directory outside of /etc/secrets and then change the permissions using chmod 600. This let me SSH into my remote server.

Inexpensive way to run a large number of remote ssh commands?