Tensorflow Distributed ParameterServer setup

spatial · 07-19-2023 09:32 AM

Hi,

Earlier I asked the Tensorflow forum but didn't get a practical answer. I read the Tensorflow documentation and set up a simple

distribution = tf.distribute.MultiWorkerMirroredStrategy()

The cluster spec. is

cluster_spec = { "worker":["127.0.0.1:9901",
                           "127.0.0.1:9902"]
               }

It produced an appropriated result when I trained using the MNIST dataset. I documented what worked in https://branetheory.org/2022/05/25/distributed-training-using-tensorflow-federated/

But I never understood how to use a truly distributed ParameterServer. It isn't documented because it involves set up of compute VMs, GPUs etc. I think.

When I read the paper "Monolith: Real Time Recommendation System With
Collisionless Embedding Table" this came up again. This is the diagram.

Can anyone point out instructions to set up this and execute a simple training task ? I am mainly

interested in the TensorFlow distributed set up.

I may set up Kafka and Flink as described in the paper for learning later.

Thanks,

Mohan