Re: Data store with gemini-2.0-flash-001 returns i...

cpetrenciuc · 04-24-2025 01:22 AM

Hello,
I am building a Conversational Agent (Playbook based) for a Romanian insurance company. I am using an unstructured documents data store as a TOOL for one of the agent's playbooks.

The data store has the following custom prompt:

Given the conversation between a Human and an AI assistant and a list of sources, write a final answer for the AI assistant.
Follow these guidelines:
  + Provide helpful and informative responses to user queries regarding Groupama activities, insurance policies, registering claims and other related topics.
  + Offer clear and concise explanations, unless the user specifically requests more detailed information.
  + The answer MUST be based only on the sources and not introduce any additional information.
  + All numbers, like price, date, time or phone numbers must appear exactly as they are in the sources.
  + You should be prepared to provide more technical explanations if requested and offer support with finding specific information quickly.
  + When asked to provide details about an insurance policy, make sure to include in the response the risks covered by the insurance, any optional clauses, and the procedure for registering claims.
  + You must adjust your responses to be appropriate for the general public.
  + The answer MUST be given in the language of the query.
  + Use a combination of bullet points and numbered lists when presenting multiple pieces of information.
  + Include images and videos relevant to the human's query in the response.
  + Include hyperlinks relevant to the human's query in the response.
  + Provide concise and well-structured responses.
  + Don't try to make up an answer: If the answer cannot be found in the sources, then you answer with NOT_ENOUGH_INFORMATION.

You will be given a few examples before you begin.

Example 1:
Sources:
[1] <product or service> Info Page
Yes, <company> offers <product or service> in various options or variations.

Human: Do you sell <product or service>?
AI: Yes, <company> sells <product or service>.

Example 2:
Sources:
[1] Andrea - Wikipedia
Andrea is a given name which is common worldwide for both males and females.

Human: How is the weather?
AI: NOT_ENOUGH_INFORMATION


Begin! Let's work this out step by step to be sure we have the right answer.

Sources:
$sources

$conversation
Human: $original-query
AI:

I have uploaded to my data store, which has the following parsing configuration, this PDF file: https://drive.google.com/file/d/1xraTqwc_EvxwnSfpO_aDf-MLU0-XWhPQ/view?usp=sharing . It is a CASCO insurance policy for companies that have fleets of vehicles. It has different coverages and clauses depending on the size of the insured vehicles (below 7.5 tons, or above 7.5 tons).

When I query this data store with queries like "toate informatiile despre CASCO flote" (all information about CASCO fleets) I get an answer that contains only the first part of the PDF, and stops with the phrase saying "Here are the details for vehicles over 7.5 tons:". But these details never come. I have to make an additional query like "give me the rest of the information", or more specific "details of CASCO insurance for vehicles over 7.5 tons" to get the second half of the PDF.

It seems that the data store returns only a chunk corresponding to the first half of the PDF and does not return for the initial query any other chunk.

How can I solve this problem? Thank you.

ruthseki

Hi @cpetrenciuc,

Welcome to Google Cloud Community!

I understand that you're having an issue where the data store is chunking the PDF and only returning the first relevant chunk for broad queries like "toate informatiile despre CASCO flote" (all information about CASCO fleets). This is happening because the chunking seems to be splitting the document before it covers the details for vehicles over 7.5 tons.

To address this, try the following steps:

Increase Chunk Size: Increase the "Chunk size limit" in the document chunking section. Larger chunks increase the chance that a single chunk will contain all the necessary information. Note: Larger chunk sizes can impact the performance of the data store. Keep an eye on latency. Experiment with increasing the size to 1000, 1500, or 2000 tokens.
Explore Alternative Parsers: Consider other parsers that might be better suited to your document's format. The Layout Parser might not be optimal for this specific document structure.
Confirm "Include Ancestor Headings": Verify that "Include ancestor headings in chunks" is enabled. This is good, as it provides context to each chunk.
Refine the Prompt: Review your custom prompt and add a specific instruction to prioritize comprehensive answers: "When the query asks for 'all information' or a broad overview, ensure the AI assistant provides a complete response that covers all relevant aspects, even if it spans multiple sections."
Test and Iterate: After making each change, test it thoroughly with the original query ("toate informatiile despre CASCO flote") and variations. This is an iterative process, and you may need to adjust the settings multiple times to find the right balance.
Consider Document Structure: if possible, restructure your policy document to simplify chunking. For example, consolidate information for vehicles under and over 7.5 tons into a single section with clear headings.

In summary, by adjusting the chunk size, exploring parser options, refining your prompt, and potentially restructuring the document, you can improve the completeness and quality of the responses.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

cpetrenciuc

Hi @ruthseki,

Unfortunately for layout parser the chunk size cannot be increased above 500 tokens.

I can confirm that I have enabled the Include Ancestor Headings options.

Thank you,

Cosmin Petrenciuc
Sent from Outlook for Android<>

--
All use of confidential and/or privileged material in this mail or its
attachments, by persons or entities other than the intended recipient is
prohibited. If you received this email in error, please delete it and
notify the sender immediately.

cpetrenciuc

Hi @ruthseki ,

In my previous reply, I told you I cannot create a data store with a chunk size greater than 500. Here is the proof for this statement.

I tried creating the data store through REST API call. I have this bash script:

#!/bin/bash

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "x-goog-user-project: cdc-sentiment-analysis-tests" \
     -H "Content-Type: application/json" \
     "https://eu-discoveryengine.googleapis.com/v1/projects/cdc-sentiment-analysis-tests/locations/eu/collections/default_collection/dataStores?dataStoreId=groupama-unstructured-docs-chunksize1000" \
     -d '{
           "displayName": "Groupama Unstructured Documents 1000 tokens chunk size",
           "industryVertical": "GENERIC",
           "solutionTypes": ["SOLUTION_TYPE_CHAT"],
           "contentConfig": "CONTENT_REQUIRED",
           "documentProcessingConfig": {
             "chunkingConfig": {
               "layoutBasedChunkingConfig": { "chunkSize": 1000, "includeAncestorHeadings": true }
             },
             "defaultParsingConfig": { 
               "layoutParsingConfig": { "enableTableAnnotation": true, "enableImageAnnotation": true }
             }
           } 
         }'

When I execute it I get this error:

{
  "error": {
    "code": 400,
    "message": "\"documentProcessingConfig.chunkingConfig.layoutBasedChunkingConfig.chunkSize\" must be between 100 and 500, inclusive.",
    "status": "INVALID_ARGUMENT"
  }
}

cpetrenciuc

I have managed to obtain some improvement by changing my custom data store prompt like this:

Given the conversation between a Human and an AI assistant and a list of sources, write a final answer for the AI assistant.
Follow these guidelines:
  + Provide helpful and informative responses to user queries regarding Groupama activities, insurance policies, registering claims and other related topics.
  + Offer clear and concise explanations, unless the user specifically requests more detailed information.
  + The answer MUST be based only on the sources and not introduce any additional information.
  + All numbers, like price, date, time or phone numbers must appear exactly as they are in the sources.
  + You should be prepared to provide more technical explanations if requested and offer support with finding specific information quickly.
  + When asked to provide details about an insurance policy, make sure to include in the response the risks covered by the insurance, any optional clauses, and the procedure for registering claims.
  + When asked to provide all the information about a subject, provide a complete response that covers all relevant aspects, even if it spans multiple sections.
  + You must adjust your responses to be appropriate for the general public.
  + The answer MUST be given in the language of the query.
  + Use a combination of bullet points and numbered lists when presenting multiple pieces of information.
  + Include images and videos relevant to the human's query in the response.
  + Include hyperlinks relevant to the human's query in the response.
  + Provide concise and well-structured responses.
  + Don't try to make up an answer: If the answer cannot be found in the sources, then you answer with NOT_ENOUGH_INFORMATION.

You will be given a few examples before you begin.

Example 1:
Sources:
[1] <product or service> Info Page
Yes, <company> offers <product or service> in various options or variations.

Human: Do you sell <product or service>?
AI: Yes, <company> sells <product or service>.

Example 2:
Sources:
[1] Andrea - Wikipedia
Andrea is a given name which is common worldwide for both males and females.

Human: How is the weather?
AI: NOT_ENOUGH_INFORMATION


Begin! Let's work this out step by step to be sure we have the right answer.

Sources:
$sources

$conversation
Human: $original-query
AI:

I have added this instruction:
+ When asked to provide all the information about a subject, provide a complete response that covers all relevant aspects, even if it spans multiple sections.

I have also modified my Playbook instructions by adding the following:

    - Step 6.4. If there is additional information available, ask the Human if they want you to provide the remainder of the information.
        - Step 6.4.1. If the Human says they want you to continue, then use ${TOOL:unstructured-doc} with fallback `NOT_ENOUGH_INFORMATION` to search for the remaining information relevant for Human's query.
...
        - Step 6.4.6. If there is additional information available, ask the Human if they want you to provide the remainder of the information.
            - Step 6.4.6.1. If the Human says they want you to continue, then proceed to Step 6.4.1.
    - Step 6.5. Ask the human if they have other questions or requests. Match the formality level determined from the human's initial question. If the human used formal language, maintain a formal tone. If the human used informal language, maintain an informal tone. You will be given interaction examples to help you match the formality level.

It is not the perfect solution because Gemini continues to stop midphrase, or midword, or display some broken tables, but at least now it provides the user of the chatbot with the option of asking for a continuation of the answer.

Data store with gemini-2.0-flash-001 returns incomplete answers