From March 30th to April 14th, we experienced a noticeable increase in system lag across all our DataFlow jobs. This was particularly surprising because these jobs had been deployed in February without any modifications since. Furthermore, we observed no substantial changes in the volume of incoming data during this period.
This unexpected system lag has raised concerns about potential impact on our data in the future. Thus, we are keen on investigating this issue and understanding its root cause.
As a starting point, I would appreciate it if you could guide me on how best to approach this issue. In particular:
What strategies or tools should I consider using to debug this issue? Is there a way to view or analyze the system performance during the aforementioned dates?
Were there any known issues with GCP during this time frame that could potentially have influenced the performance of DataFlow jobs?
Any insights or pointers that can help us mitigate such issues in the future would be greatly appreciated.