Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex search Configuration - Stemming / Boost filters

Hi,

  I'm trying to set up Vertex AI Search (Not Chatbot) for our documentation site (as Sitesearch/PSE is being shutdown). Few questions

- How do I set certain words NOT to be stemmed? For eg: yugabyted is automatically converted to YugabyteDB. I don't want this to happen. (Specifying within quotes as "yugabyted" doesn't work)

- For Boost/Bury, how do I create a filter where the URL path matches a pattern 

- How to deeplink to specific headers within the page?. How do I identify whether it is a hit on a header or not ?

Thanks in advance.

0 5 447
5 REPLIES 5

@ruthseki  - any suggestions here. How do we get the right support here. We are an enterprise customer ..

Hi @premyb,

Welcome to Google Cloud Community!

Here are some possible approaches that you might need to help you address Vertex Search configurations:

Preventing Stemming for Specific Words While Vertex AI search does not directly support custom stemming rules, you might consider the following approaches to handle stemming specific word : 

  • Create a list of words to exclude from stemming.
  • Preprocess your text data - before uploading your documentation data to Vertex AI search, use a custom tokenizer or text processing library to handle your case. 

Boost/Bury Filters Based on URL Patterns 

  • To create a filter where the URL path matches a pattern, you can use filter parameters  in Vertex AI Search.
  • You can use the boostSpec or servingControls (Preview) to apply boosts or bury results based on your URL path filter

Deeplinking to Specific Headers:

  • Vertex AI Search itself doesn't provide direct support for deep linking to headers or identifying them within search results. Below are implementation that might help you deeplinking to specific headers:
  • To deeplink to specific headers within a page, ensure your documents are structured with identifiable headers. You can use HTML tags or specific markers in your documents. 
  • When you preprocess your documentation data, extract the header information (e.g., H1, H2, H3 tags) and store it alongside the content. 
  • Store this header information in separate fields (e.g., h1_text, h2_text) within your documents when indexing them in Vertex AI Search.
  • To identify whether it is a hit on a header, you can analyze the search results and check if the hit corresponds to a header tag or marker within your document structure.

For more information about Vertex AI Search you can read through this documentation.

I hope the above information is helpful.

 

Thanks @MJane .
Preventing Stemming/Spell suggestion for Specific Words :
Preprocess your text data - How do I do this when using the Crawler? 
I'm thinking of adding 

spellCorrectionSpec":{"mode":"AUTO"} 

for most queries and have a list of query exceptions and just for those cases set 

spellCorrectionSpec":{"mode":"SUGGEST_ONLY"}

 

Boost/Bury Filters Based on URL Patterns :
I've the filter set as

siteSearch : "https://docs.yugabyte.com/preview/yedis/*"

 and set the boost/bury score to -1 . but that does not seem to work correctly. For some queries the first result is from the same path. I don't get this.

Deeplinking to Specific Headers:

- Again, How can I preprocess data when using the Crawler?  The data is just HTML (crawled by Vertex crawler ) & the headers are correctly defined with proper ID. Still, I'm unable to identify if it is a hit on the header. from the json search results response - What additional parameters do I have to pass in the request to get this info?

To prevent stemming in Vertex AI Search, try custom synonym rules, as there's no direct way to stop it. For Boost/Bury, use filter expressions like "url_path LIKE '/docs/%'" to match URL patterns. To deeplink specific headers, ensure headers have unique IDs in your HTML (e.g., <h2 id="section1">Header</h2>), and Vertex AI Search will index them as searchable entities, enabling links to specific sections in search results.

@shaikhsharmeen4 , The headers have unique ids and are correctly marked up and indexed .

 

<h2 id="section1">SomeText</h2>

 

 , But for a search on  "sometext" , I'm unable to identify if the hit was on the header, so that in the result listing I can modify the URL as url_path#section1 , so that the page will scroll to the header/anchor when clicked on the result. How do I do this ?