Issue with Google Prompt Optimizer (GPO) for Class...

SayyorYusupov · 01-25-2025 06:07 AM

I am experiencing an issue using Google Prompt Optimizer (GPO) for a binary classification task. The task is to classify input text into two predefined labels, each consisting of one or two words (e.g., “System Engineering” and “Others”).

The initial system instruction is: "You are a system engineer in the automotive domain, and your task is to allocate the stakeholder requirements to organizational functions that shall realize the requirement.\nDetermine if it falls under the category of 'SystemEngineering' or 'Others'.Provide your answer as 'SystemEngineering' or 'Others'".
When I pass prompts and input texts directly to the Gemini 1.5 Pro model, the outputs are as expected—only the predicted labels are returned. However, when I use GPO for the same task, the responses often become verbose, generating long outputs instead of the desired single labels.

Example of an unexpected response from GPO:

'item': 'REPT shall be sized to avoid fire in case of cooling failure.\n Grade: A\n CSR: S\n Note: Cooling failure can mean empty cooling circuit.'
'target': 'SystemEngineering',
'response': "**SystemEngineering** \n\nThis requirement directly impacts the design and safety of the entire REPT system, a core focus of system engineering. Here's why:\n\n* **System-Level Impact:** Sizing REPT to prevent fire from cooling failure affects multiple components and their interactions (e.g., REPT size, cooling system capacity, materials, safety mechanisms).\n* **Cross-Functional Collaboration:** Addressing this requirement will likely involve electrical engineers, thermal management experts, safety engineers, and potentially others. System engineers play a crucial role in coordinating these efforts. \n* **Risk Management:** Fire hazards are a major safety concern. System engineers are responsible for identifying and mitigating risks like this during the design phase. \n",
'exact_match/score': 0.0
'exact_match/confidence': 1.0

As you can see in the example above, the response in the example above is correct at labelling the item as SystemEngineering, but because it has redundant explanation, the exact_match/score is set to 0.

Interestingly, the prompts that GPO generates, when used directly with the LLM, do produce just the raw labels. Despite this, GPO records long responses during optimization, resulting in low accuracy and poor learning outcomes.

Here is also the average number of words that are generated per iteration in GPO, for more information (Ideally it should be only 2, i.e. the orange line). We achieve this value only at the 19th iteration, for some other datasets this value is never achieved:

Key questions:

Is this behavior expected? Does GPO maybe process prompts through an intermediary step that causes longer responses?
Are there parameters or configurations I may have missed to ensure GPO generates the right outputs for classification tasks?
Is there a way to enforce specific constraints, such as limiting token count or explicitly formatting the output, within GPO? I’ve reviewed the GPO documentation but couldn’t find details about setting maximum token limits or specifying class labels. Any guidance on configuring GPO for this type of task would be greatly appreciated.

Issue with Google Prompt Optimizer (GPO) for Classification Task