I am experiencing an issue using Google Prompt Optimizer (GPO) for a binary classification task. The task is to classify input text into two predefined labels, each consisting of one or two words (e.g., “System Engineering” and “Others”).
The initial system instruction is: "You are a system engineer in the automotive domain, and your task is to allocate the stakeholder requirements to organizational functions that shall realize the requirement.\nDetermine if it falls under the category of 'SystemEngineering' or 'Others'.Provide your answer as 'SystemEngineering' or 'Others'".
When I pass prompts and input texts directly to the Gemini 1.5 Pro model, the outputs are as expected—only the predicted labels are returned. However, when I use GPO for the same task, the responses often become verbose, generating long outputs instead of the desired single labels.
Example of an unexpected response from GPO:
As you can see in the example above, the response in the example above is correct at labelling the item as SystemEngineering, but because it has redundant explanation, the exact_match/score is set to 0.
Interestingly, the prompts that GPO generates, when used directly with the LLM, do produce just the raw labels. Despite this, GPO records long responses during optimization, resulting in low accuracy and poor learning outcomes.
Here is also the average number of words that are generated per iteration in GPO, for more information (Ideally it should be only 2, i.e. the orange line). We achieve this value only at the 19th iteration, for some other datasets this value is never achieved:
Key questions:
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |