Pipeline run failed: The replica workerpool0-0 exited with a non-zero status of 13.

I'm trying to fine-tune the chat-bison@002 model with my own JSONL dataset. And I'm receiving this error: "The replica workerpool0-0 exited with a non-zero status of 13". I think the problem is related to roles and permissions. 

When I'm trying to fine-tune the text-bison@002, I don't receive this error, cause while setting up this tuning I'm able to choose what service account to use. But, with the chat-bison@002 I'm not. 

Can anyone help with it? I'm stuck.

@lsolatorio @hadi224 @mohsin-m @norahul1020 @ys04092003 @aytech @javatwo @xavidop @nimrah-waqar @Qasim_Taleb 

 

 

Screenshot 2024-02-28 at 12.22.59.png

 

0 7 290
7 REPLIES 7

Hey @vovaparkhomchuk I guess the logs with help us to get more details regarding the same issue that you are currently facing. TBH I barely work with ai models and pipeline so I have very limited idea about it. Still if you have descriptive logs then I can dig deeper into it.

Thank you for the quick response. Just in case, I've already fixed the issue. It was related to the incorrect format of my dataset. I was trying to tune the chat model using a dataset for the text mode. But, what is strange for me, is that the error didn't give me any information about the real cause of the run fail. And there is no useful information in the logs.

Temporarily grant broader permissions if safe to do so،Assign roles like storage.objectViewer ya storage.objectAdmin,Examine training logs, 

Thanks for the quick response. I thought the problem was related to roles and permissions, but the problem was in the incorrect format of my dataset.

what are the logs saying?

Thanks for the quick response. Actually, in my opinion, nothing useful. The problem was in the incorrect format of my dataset. 

I'm currently facing a similar issue. How did you correctly format your dataset?