Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Missing PageNumber Value in Vertex AI Search Unstructured Search Results

Hello,

I'm currently working with an unstructured (Import with Metadata.JSONL) search application using Vertex AI Search. I've noticed that the PageNumber value is missing from my search results. In my previous experience with similar search implementations, this value is typically present to indicate which page of the document the search result comes from.

I would appreciate clarification on:

  1. Under what conditions should the PageNumber value be present in the search results?
  2. Are there specific scenarios or document types where PageNumber information might not be available?

Has anyone else encountered similar behavior or could explain the expected behavior regarding PageNumber values in Vertex AI Search?

Thank you in advance for any insights.

Search Code Ref : 

 

response = client.search(
    request=discoveryengine.SearchRequest(
        query=user_query,
        # filter=f"category: ANY(\"{filter}\")",
        page_size=10,
        serving_config=serving_config,
        content_search_spec=discoveryengine.SearchRequest.ContentSearchSpec(
            extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
                max_extractive_segment_count=4,
                return_extractive_segment_score=True,
                num_previous_segments=1,
                num_next_segments=1,
            )
        ),
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
            pin_unexpanded_results=True,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        )
    )
)
print(response)

 



SearchPager Ref : 

 

SearchPager<results {
  id: "doc-164"
  document {
    name: "projects/REPLACE VALUE/locations/global/collections/default_collection/dataStores/REPLACE VALUE_1733723908768/branches/0/documents/doc-164"
    id: "doc-164"
    struct_data {
      fields {
        key: "title"
        value {
          string_value: "REPLACE VALUE.pdf"
        }
      }
      fields {
        key: "start_year"
        value {
          string_value: "2020"
        }
      }
      fields {
        key: "model"
        value {
          list_value {
            values {
              string_value: "VENUE"
            }
            values {
              string_value: "ALL"
            }
          }
        }
      }
      fields {
        key: "end_year"
        value {
          string_value: "Now"
        }
      }
      fields {
        key: "category"
        value {
          list_value {
            values {
              string_value: "QX1.6"
            }
            values {
              string_value: "ALL"
            }
          }
        }
      }
    }
    derived_struct_data {
      fields {
        key: "link"
        value {
          string_value: "gs://REPLACE VALUE.pdf"
        }
      }
      fields {
        key: "extractive_segments"
        value {
          list_value {
            values {
              struct_value {
                fields {
                  key: "relevanceScore"
                  value {
                    number_value: 0.83465969562530518
                  }
                }
                fields {
                  key: "id"
                  value {
                    string_value: "c1"
                  }
                }
                fields {
                  key: "content"
                  value {
                    string_value: "REPLACE VALUE"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

 

 

1 1 199
1 REPLY 1

Hi @MaxChen1126,

Welcome to Google Cloud Community!

It sounds like you're working on an interesting project with Vertex AI Search, and I can understand why you'd be concerned about the missing PageNumber value in your search results. The PageNumber can be crucial for understanding the context of a search result, especially when dealing with large documents or datasets. To answer your questions:

1. Under what conditions should the PageNumber value be present in the search results? 

In general, PageNumber should appear in search results when:

  • Document Segmentation: If you're working with large documents or multi-page documents (e.g., PDFs, long articles, books, etc.), the document might be segmented into logical "pages" or chunks for more efficient retrieval. ThePageNumberis often added as metadata to the document chunks during the indexing process.
  • Metadata Inclusion: When documents are imported into Vertex AI Search, the PageNumber is usually included if the original document (like a PDF or other paginated text) contains metadata about its page structure. If structured formats such as JSONL are used during ingestion, the page number can be extracted as part of the document’s metadata and indexed. You can check this document to learn how to create and configure indexes in Vertex AI, making sure that your document metadata, such as PageNumber, is indexed properly.

2. Are there specific scenarios or document types where PageNumber information might not be available?

Yes, there are some cases where the PageNumber may not be available or may not be automatically included in the search results:

  • Unstructured Documents: If the document is not paginated (e.g., plain text files or web pages), or if the system cannot infer a "page" structure, there may not be any PageNumber available. In these cases, the document may be indexed as a single chunk of text without any reference to pages.
  • Incorrect Metadata Parsing: If the metadata extraction during the ingestion process fails or is incomplete (e.g., the document metadata was not properly included or formatted in the input file), the PageNumber might not be indexed. This is often the case when documents don't have clear, standardized metadata or if the extraction tools miss certain elements.
  • Non-paginated Document Types: For documents such as images, videos, or audio files, there is no "page number" to extract. Similarly, some types of structured documents (e.g., databases or spreadsheets) may not have page-based structures.
  • Search Index Configuration: Depending on how the search index is configured in Vertex AI Search, certain metadata fields (like PageNumber) might not be indexed or returned by default. In some configurations, metadata extraction needs to be explicitly defined or customized during the ingestion process.

3. Has anyone else encountered similar behavior or could explain the expected behavior regarding PageNumber values in Vertex AI Search?

Yes, this is a known issue that some users have encountered when working with large or segmented documents in Vertex AI Search. The issue can often be traced back to how the documents are structured and how metadata, including PageNumber, is handled during ingestion. Here are the common causes:

  • Documents not having clear or consistent page breaks.
  • Metadata not being included in the indexing process.
  • The system is not correctly interpreting paginated content during document ingestion.

In order for you to troubleshoot and resolve the issue, When querying the search index, make sure you’re requesting the PageNumber field in your search query's return parameters. Some fields might be excluded by default, so you need to explicitly ask for the PageNumber in the response. You can check this document to know more about the concept of Indexing.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.