DATAFLOW: what is the difference between jsonPaylo...

davidregalado25 · 06-08-2023 09:35 PM

Hi

I wanted to know how many threads were in use by a specific job. When exporting the job logs, I saw these fields, I want to know what is the difference between them.

Also, I saw the same value for all the logs in the field jsonPayload.thread. So, it is safe to say that only 1 thread was in use, right?

More context:

machine type: n1-highcpu-96
source: mongodb
target: elasticsearch
data volume: 1.7M
Flex template with custom logic

--
Best regards
David Regalado
Web | Linkedin | Twitter

ms4446

The jsonPayload.worker and jsonPayload.portability_worker_id fields are both related to the worker processes that handle the execution of your data processing tasks in Dataflow.

The jsonPayload.worker is associated with the worker processes that handle the execution of tasks in the pipeline. Dataflow launches one worker container (also known as an SDK worker) per core of your machine. Each of these worker processes has an unbounded thread pool for processing the bundles of data

The jsonPayload.portability_worker_id likely refers to the worker processes within the context of Apache Beam's portability framework. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner, allowing different languages and runners to work together more uniformly. The portability worker could be part of this interop layer, but I couldn't find a specific definition or role for this field.

See the following link for more details: https://beam.apache.org/roadmap/portability/

Regarding the jsonPayload.thread field, if it consistently has the same value across all logs, it does seem to suggest that only one thread is being used. Python.

DATAFLOW: what is the difference between jsonPayload.portability_worker_id and jsonPayload.worker in