Solved: perfectly parallel python jobs

blaise · 10-13-2022 12:45 PM

Hello I have a perfectly parallel task that I want to move to the cloud, it involves a function that takes around an hour to complete. I would like to deploy this function around 1000 times in parallel.

Does anyone have any suggestions for how to approach this problem, I have tried using Ray clusters to no avail -- what google cloud product can accomplish this most easily.

Again, each job is agnostic of the other, they do not need to know anything besides that a job has left the queue.

Any tutorials or documentation I should read would be greatly appreciated!

blaise

For anyone else trying to build a lot of models at once with custom frameworks I got this to work using the custom containers and custom jobs on vertex AI.

You just need to dockerize your scripts and throw them into the artifact registry then you can go about running the jobs mostly as you would autoML.

This helped me: https://cloud.google.com/vertex-ai/docs/training/create-custom-container

View solution in original post

RC1

@blaise

What does that function normally does ? Is there involvement of spark ?

Approach 1

if there is involvement of spark then we can use dataproc cluster or dataproc serverless to deploy your job , get it done and destroy the cluster. Here spark itself handles the distributed computing provided if you give a decent cluster.

Approach 2
We can also use a multi core compute engine and run you task on it. But here you have to use a multi threading or multi processing library and handle it. Here compute is handled by the compute engine but distribution is done by you.

BTW what exactly your function does and where exactly does it take much time ? Without actually knowing your usecase its difficult to provide a good/optimized solution

blaise

No involvement of spark, just python, the function does some ML training on a given dataset, I would prefer not to use a multiprocessing library and to distribute the tasks.

RC1

I think you need to check this out => https://cloud.google.com/ai-platform/training/docs/overview

1) Here you can use a Compute engine with GPU so that the training is distributed by the ml libraries that you use.

2) You can also check collab pro notebooks

3) https://cloud.google.com/ai-platform/training/docs/overview#distributed_training_structure

https://cloud.google.com/ai-platform/docs/technical-overview

https://towardsdatascience.com/how-to-train-machine-learning-models-in-the-cloud-using-cloud-ml-engi...

blaise

Cool, thanks for those recourses, digging into VertexAI and the AI platform, do you know if its possible to use my own prop tuning logic, like if I have a custom package built over sklearn can I run that in its own container or do I need to use googles frameworks

blaise

For anyone else trying to build a lot of models at once with custom frameworks I got this to work using the custom containers and custom jobs on vertex AI.

You just need to dockerize your scripts and throw them into the artifact registry then you can go about running the jobs mostly as you would autoML.

This helped me: https://cloud.google.com/vertex-ai/docs/training/create-custom-container