Hi
I wanted to know how many threads were in use by a specific job. When exporting the job logs, I saw these fields, I want to know what is the difference between them.
Also, I saw the same value for all the logs in the field jsonPayload.thread. So, it is safe to say that only 1 thread was in use, right?
More context:
The jsonPayload.worker
and jsonPayload.portability_worker_id
fields are both related to the worker processes that handle the execution of your data processing tasks in Dataflow.
The jsonPayload.worker
is associated with the worker processes that handle the execution of tasks in the pipeline. Dataflow launches one worker container (also known as an SDK worker) per core of your machine. Each of these worker processes has an unbounded thread pool for processing the bundles of data
The jsonPayload.portability_worker_id
likely refers to the worker processes within the context of Apache Beam's portability framework. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner, allowing different languages and runners to work together more uniformly. The portability worker could be part of this interop layer, but I couldn't find a specific definition or role for this field.
See the following link for more details: https://beam.apache.org/roadmap/portability/
Regarding the jsonPayload.thread
field, if it consistently has the same value across all logs, it does seem to suggest that only one thread is being used. Python.