Hello folks,
I have a vertex pipeline with custom components that runs daily. My pipeline has several components. Some components depends on others to finish, and some do not. There is a node in my pipeline that has no dependencie (I mean it does not rely on another component's output). Lets name it "text extraction".
The text extraction component usually takes less than 3 minutes to finish. Since it does not depend on any another component, it is usually lunched at the beginning on my pipeline. This what happens most of the time.
However, sometimes, text extraction is lunched at the end of the pipeline (and I can't see why) and runs infinitely. I have a watching application that cancels the pipeline when it exceeds the TTL of the pipeline (20 hours). So the situation is that I have a component that is supposed to start at the beginning of the pipeline and takes 3 minutes, and sometimes its starts at the end and is canceled after 8-ish hours (20h -8h pipeline)
Nota bene : The node is logging `Job is running.` in loop
Thanks for you help
EDIT :The text extraction component uses multi processing. Here is a code snippet of the used function
def series_parallel_apply(data: pd.Series, func, **args):
try:
assert isinstance(data, pd.Series)
except AssertionError:
raise (Exception("Input data should be a pandas series"))
cores = mp.cpu_count() - 1
data_split = np.array_split(data, cores)
pool = mp.Pool(cores)
data_return = pd.concat(
pool.map(partial(apply_series, func=func, **args), data_split),
ignore_index=False,
)
pool.close()
pool.join()
return data_return