Setting up Monte Carlo on a Google cluster

carlr · 10-06-2022 09:34 AM

I have a Monte Carlo code (physical sciences) running on Linux. On my 8-CPU desktop I can run 8 jobs in parallel. (Perhaps it is more accurate to say that the overall problem is split into 8 jobs and the results are combined when all jobs complete).

I have also set up a Google Cloud c2d-112 instance and can run 112 jobs in parallel.

I have also created a container that runs successfully.

However, I am struggling with how to set up a cluster to run (let's say) 500 jobs in parallel. The new Batch option seems interesting but I'm not convinced it works for my case.

Any general guidance would be appreciated; there seems to be a steep learning curve.

Thanks, Carl R

comaro

I have also set up a Google Cloud c2d-112 instance and can run 112 jobs in parallel.
I have also created a container that runs successfully.

What Google Cloud product was for each of those?

Dataproc and Apache Spark provide infrastructure and capacity that you can use to run Monte Carlo simulations written in Java, Python, or Scala; see.

carlr

So for c2d-112 I created a VM instance on one of these machines. I uploaded and compiled my code. Then, I submit to it a script with 112 lines of the basic form:

code_exec <switches> input.file_0001.inp > dev/null &
code_exec <switches> input.file_0002.inp > dev/null &
...
code_exec <switches> input.file_0112.inp > dev/null &

The 112 jobs run in parallel and each prints out a file on completion that is used for further analysis.

(The code is written in Fortran but I don't think that makes any difference).

So the container I've created and run on my local machine was just to demonstrate than I can use it in the same way.

I have not yet understood how I can set up a cluster using my container and input scripts.

I want (say) 5 c2d-112 machines, each with my container and each being presented with a script of 112 lines.

Do I use Kubernetes or Batch? Or is it not possible and I need some other way?

Thanks,

Carl

The 112 processes run in parallel and each outputs a file when finished.

comaro

For Compute Engine to achieve 500 parallel jobs, you should open a support case and request a quota increase.

This can be done over a support case under a valid business justification.

I want (say) 5 c2d-112 machines, each with my container and each being presented with a script of 112 lines.
Do I use Kubernetes or Batch? Or is it not possible and I need some other way?

Kubernetes tends to be a little more complex, but following simple logic, it is possible, assuming it has enough quota to satisfy those 5 machines it can provision 5 nodes with that type of machine inside the VM and run the 112 jobs within those 5 machines.

Again, it's a little more complex due to the fact that you would need to assign resources to the deployment itself, among other things. But is it possible? Yes, it is possible. Does this make architectural sense? Yes, it is a lot harder than provisioning 5 VMs and running the script on them? Yup, it is.

Also, one thing to keep in mind, regardless of your chosen option between Compute Engine or Kubernetes, both share the same compute quota, so let say if you have 1 machine with 112 cores and 2 nodes on a cluster with that same machine, in total the quota usage would be 336 CPUs.

carlr

Thanks for your quick response. Yes, I am aware of the quota issue. I had to apply for quota for the 112 CPUs.

For the business I'm working with, sometimes 8 cores on a desktop is fine. In other cases, the c2d-112 machines are attractive. However, there are some problems where one might like 1000 cores to make the total compute time manageable.

So I'm trying to understand the scaling issue. The documentation suggests it is fairly easy to set up and tear down a cluster.

Suppose I learn how to set up a cluster with a few nodes with a few cores each. Is it then easy to transfer that configuration information to many nodes and many cores?

Is the new "Batch" tool relevant to my problem or do I focus on Kubernetes?

I know there is a vast amount of information about Google clusters but I have struggled with how to proceed for my problem. Basically, I have a containerized application that needs to run on each node and each node needs to be able to accept a script from a supervisor level. (it would seem the storage bucket concept is suited to collect the output files from the calculations).

Any general guidelines much appreciated.

Thanks again,

Carl

comaro

Suppose I learn how to set up a cluster with a few nodes with a few cores each. Is it then easy to transfer that configuration information to many nodes and many cores?

Yes, it is.

Is the new "Batch" tool relevant to my problem or do I focus on Kubernetes?

As to the batch tech it could work, but right now ‘batch’ is fairly new and has a lot of limitations, even we don’t have documented so it's up to you to experiment with it.

I personally would go with the kubernetes option and if that makes it a little more complex, I would use a normal VM to run the jobs as a script.

bolind

Check out https://github.com/SchedMD/slurm-gcp.

There's also a Marketplace version which will have you running in no time. If your code is fault tolerant, you can use Spot instances to save money.