Hi,
Earlier I asked the Tensorflow forum but didn't get a practical answer. I read the Tensorflow documentation and set up a simple
distribution = tf.distribute.MultiWorkerMirroredStrategy()
The cluster spec. is
cluster_spec = { "worker":["127.0.0.1:9901", "127.0.0.1:9902"] }
It produced an appropriated result when I trained using the MNIST dataset. I documented what worked in https://branetheory.org/2022/05/25/distributed-training-using-tensorflow-federated/
But I never understood how to use a truly distributed ParameterServer. It isn't documented because it involves set up of compute VMs, GPUs etc. I think.
When I read the paper "Monolith: Real Time Recommendation System With
Collisionless Embedding Table" this came up again. This is the diagram.
Can anyone point out instructions to set up this and execute a simple training task ? I am mainly
interested in the TensorFlow distributed set up.
I may set up Kafka and Flink as described in the paper for learning later.
Thanks,
Mohan
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |