How to evaluate Generative AI finetuned model

samy255 · 12-13-2023 03:07 AM

HI Team,

I have successfuly fine tuned generative LLM with my own labeled training data with 7oo data points, i prepeared and upload training data as per schema: {input_text, output_text}, but now i am unable to see the evaluation. when i try to create evaluation i always get error that my jsonl file doesnot have prompt and ground Truth.

My question is when input schema is {input_text, output_text} then why would make a test jsonl data file witth prompt and ground Truth.

at one place i read that : evaluation is not supported in generative LLM models.

please advise or guide. I really want to see my model performance.

samy255

please guide @xavidop

xavidop

sorry, what?

samy255

I have successfuly fine tuned generative LLM with my own labeled training data with 7oo data points, i prepeared and upload training data as per schema: {input_text, output_text}, but now i am unable to see the evaluation. when i try to create evaluation i always get error that my jsonl file doesnot have prompt and ground Truth.

My question is when input schema is {input_text, output_text} then why would make a test jsonl data file witth prompt and ground Truth.

at one place i read that : evaluation is not supported in generative LLM models.

please advise or guide. I really want to see my model performance.

xavidop

why are you copy-pasting the same? xD please explain it more

Poala_Tenorio

It seems like you're encountering challenges in evaluating the performance of your fine-tuned generative language model, and there might be a bit of confusion regarding the evaluation process.

When you're training a language model using a supervised approach, where the schema includes both input and corresponding output text, the evaluation generally involves assessing the model's performance on a separate dataset. This evaluation dataset would typically contain input-output pairs (as you mentioned) to test how well your model generates the desired output given specific inputs.

However, some platforms or frameworks might require a specific format for evaluation purposes. If you're encountering errors stating that your JSONL file doesn't have a "prompt" and "ground truth," it could be due to the expected format by the evaluation tool or platform.

Regarding the statement you read about evaluation not being supported in generative language models, that might not be entirely accurate. While some models might not have a dedicated evaluation metric or method due to their nature (like models trained via unsupervised methods), it's generally possible and important to evaluate the performance of supervised fine-tuned models.

You can try to format your evaluation data. If the platform expects a JSONL file with a "prompt" (input_text) and "ground truth" (output_text), you might need to modify your dataset accordingly for evaluation purposes.

Also, please provide more details about the platform or tool you're using for evaluation and the community might be able to give more specific guidance.

samy255

Thanks heaps for a detailed reply.

Task: text classification

Model =vertex ai Large Language model (supervised fine tuning)->using google console vertexAi-> language->Tuning.

training data= jsonl file = Format : input_text , Output_text(according to required schema)

evaluation data=json file =Format: input_text ,Output_text(according to required schema)

Scenario 1: when i do tuning through google console, i put both training and evaluation data but unable to trace where the evaluation stores after successful training. (checked all the files in dedicated bucket)

Scenario 2: And when I do tuning through google colab python code, i do not provide evaluation file while training. but after training when i separately try to create evaluation through Model Registry->create Evaluation , this tool does ask for eval file having ('prompt', 'ground_truth'). [even via python code as well get same issue].

Now my question is two fold.

1- For scenerio 1: google console gave the schema for (input_text ,Output_text), so i did it as is ,just unable to locate the evaluation results.

2- For scenerio 2: if i use the evaluation tool format ('prompt', 'ground_truth') where as i trained the model on (input_text, output_text), will it provide me reliable evaluation?

I am so confuse how to project my results , where as my model are working excellent when i deploy and test them. I am a research student so I have to project F1, Accuracy, precision , recall and confusion metrics. Please guide.