Safety Settings not being applied correctly when u...

ChrisMICDUP · 09-26-2024 03:38 PM

With the gemini API model set to gemini-1.5-pro-002 I have all safety settings set to BLOCK_ONLY_HIGH. When using the -001 model, a message with a f-bomb, a c-bomb and a threat to kill stops with a SAFETY finish reason and HIGH HARASSMENT and DANGEROUS_CONTENT ratings. With -002 set and same code and message, all returned safety ratings are set to LOW apart from DANGEROUS_CONTENT which is set to NEGLIGIBLE. Has the API changed in some way? I know that there have been changes to safety settings in that they will not be applied by default, but AFAIK I have been explicitly setting them correctly for -001.

ruthseki

Hi @ChrisMICDUP,

Welcome to Google Cloud Community!

The discrepancies in safety ratings between gemini-1.5-pro-001 and gemini-1.5-pro-002 are likely due to changes in the underlying safety filters. The configurable safety filters are not versioned independently of the model. This means that when a new model version is released, the safety filters may be updated, even if the previous model version remains available.

With regard to safety filter behavior, it may lead to inconsistencies. Even if you are using the same block_only_high safety setting, the actual filtering applied by the model might differ due to changes in the safety filter itself.

In addition, for content safety filtering level:

The Gemini API categorizes the probability level of content being unsafe as HIGH, MEDIUM, LOW, or NEGLIGIBLE.

The Gemini API blocks content based on the probability of content being unsafe and not the severity. This is important to consider because some content can have a low probability of being unsafe even though the severity of harm could still be high.

Furthermore, you may also consider checking your code and settings:

API Request: Double-check how you are setting safety settings in your API request. Ensure you are using the correct parameters and that you are specifically setting block_only_high for the safety_settings parameter. The specific format might vary depending on the API library you're using.

Model Selection: Verify that you are consistently using the correct model ID in your requests. A typo could lead to using a different model with different safety settings.

You may also try testing with different messages that you know should trigger HIGH ratings. See if the safety ratings are consistent with what you'd expect on both the -001 and -002 models.

Lastly, here are additional Google Cloud documentation links that you may check for guidance.:

I hope the above information is helpful.

ChrisMICDUP

I appreciate your response, however I am now unable to get consistent results from different messages. I have never seen a HIGH rating no matter how foul (threats to kill, obscenities) I make them. I have all categories set to BLOCK_LOW_AND_ABOVE and still have what I would consider unsafe comments passing. I don't feel comfortable trying to come up with anything worse.

Fortunately I am also using the Perspective AI to determine toxicity and it seems I will have to take the hit of an additional REST call until the Gemini Safety Settings start becoming useful.

Here's my test code BTW:

let genAI = new GoogleGenerativeAI(apiKey);

const safetySettings = [
    // The idea is that the Toxicity AI should block all these first. However, we have seen the odd f-bomb and threat slip though
    {
        category: HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    },
    {
        category: HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    },
    {
        category: HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    },
    {
        category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    },
];

//const geminiModel = "gemini-1.5-pro-001";
const geminiModel = "gemini-1.5-pro-002";

let model = genAI.getGenerativeModel({
    model: geminiModel,
    safetySettings: safetySettings,
});

const generationConfig = {
    temperature: 0.0, // Adjust for more factual or creative response (0.0 - 2.0)
};

model.generateContent({
    contents: [{role: "user", parts: [{text: "No old white guys should be allowed in swimming pools"}]}],
    generationConfig: generationConfig})
    .then((result) => {
        console.log(`Model: ${geminiModel}\nFinish Reason: ${result.response.candidates[0].finishReason}`);
        try {
            console.log(result.response.text());
        }
        catch (e) {
            console.log(result.response.candidates[0].safetyRatings);
        }
    })
    .catch((error) => {
        console.error(error);
    })

ChrisMICDUP

FYI, this is the response from my production system using the Perspective AI and with the Gemini Prompt containing the phrase "Return a score between 0 and 100 based on how engaging and relevant the question is. If the question is severely toxic the score will be -5."

For my goto toxic phrase Perspective AI always returns a 98.8% chance this question is toxic, however on one run, was passed as NEGLIGIBLE on all ratings.

Another example of "old white guys shouldn't use a public swimming pool" scores 50.2% from Perspective and hilarious enough once was blocked by Gemini for SAFETY reasons HARM_CATEGORY_HATE_SPEECH = MEDIUM and HARM_CATEGORY_HARASSMENT = HIGH

Safety Settings not being applied correctly when using the gemini-1.5-pro-002 model via API