Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

DATAFLOW: what is the difference between jsonPayload.portability_worker_id and jsonPayload.worker in

Hi

I wanted to know how many threads were in use by a specific job. When exporting the job logs, I saw these fields, I want to know what is the difference between them.

Screen Shot 2023-06-08 at 23.26.05.png

 

Also, I saw the same value for all the logs in the field jsonPayload.thread. So, it is safe to say that only 1 thread was in use, right?

More context:

  • machine type: n1-highcpu-96
  • source: mongodb
  • target: elasticsearch
  • data volume: 1.7M
  • Flex template with custom logic

--
Best regards
David Regalado
Web | Linkedin | Twitter

0 1 171
1 REPLY 1

The jsonPayload.worker and jsonPayload.portability_worker_id fields are both related to the worker processes that handle the execution of your data processing tasks in Dataflow.

The jsonPayload.worker is  associated with the worker processes that handle the execution of tasks in the pipeline. Dataflow launches one worker container (also known as an SDK worker) per core of your machine. Each of these worker processes has an unbounded thread pool for processing the bundles of data

The jsonPayload.portability_worker_id likely refers to the worker processes within the context of Apache Beam's portability framework. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner, allowing different languages and runners to work together more uniformly. The portability worker could be part of this interop layer, but I couldn't find a specific definition or role for this field​.

See the following link for more details: https://beam.apache.org/roadmap/portability/

Regarding the jsonPayload.thread field, if it consistently has the same value across all logs, it does seem to suggest that only one thread is being used.  Python​.