Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Tensorflow Distributed ParameterServer setup

Hi,

         Earlier I asked the Tensorflow forum but didn't get a practical answer. I read the Tensorflow documentation and set up a simple 

distribution = tf.distribute.MultiWorkerMirroredStrategy()

The cluster spec. is 

cluster_spec = { "worker":["127.0.0.1:9901",
                           "127.0.0.1:9902"]
               }

It produced an appropriated result when I trained using the MNIST dataset. I documented what worked in https://branetheory.org/2022/05/25/distributed-training-using-tensorflow-federated/

But I never understood how to use a truly distributed ParameterServer. It isn't documented because it involves set up of compute VMs, GPUs etc. I think.

When I read the paper "Monolith: Real Time Recommendation System With
Collisionless Embedding Table" this came up again. This is the diagram.

Can anyone point out instructions to set up this and execute a simple training task ? I am mainly

interested in the TensorFlow distributed set up.

I may set up Kafka and Flink as described in the paper for learning later.

Screenshot 2023-06-27 123004.png

Thanks,

Mohan

0 0 149
0 REPLIES 0