Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Discovery Engine Datastore with advanced indexing for the website

I want to create a conversational agent that uses the Datastore tool to retrieve my website content. I have used the Discovery Engine Datastore with advanced indexing for the website. The indexing process takes too long, but the agent is finally able to respond based on the website content.

Can I schedule the advanced indexing to run automatically?

Also, when I delete articles from the website, I noticed that the number of Datastore documents does not decrease. Why is that? How can I make sure the deleted articles are removed from the Datastore?

0 2 163
2 REPLIES 2

Hi @yasmine ,

  1. Scheduled Indexing:
    Discovery Engine doesn’t support built-in scheduled re-indexing. To automate it:

    • Set up a Cloud Scheduler job + Cloud Function or Vertex AI Agent API to trigger indexing via API regularly.

    • You’ll need to re-supply the updated site or feed.

  2. Deleted Articles Still in Datastore:
    This happens because Discovery Engine doesn’t auto-remove deleted web content. It retains previously crawled documents unless:

    • The document is explicitly marked for deletion via the API

    • Or the crawl source (e.g. sitemap) no longer contains the URL and reindexing is triggered

Solution:

  • Maintain an up-to-date sitemap or feed

  • Trigger a full re-crawl via API or UI after deletions

  • Or use the RemoveDocument API for precise control

Hi @yasmine,

Welcome to Google Cloud Community!

In addition to @a_aleinikov’s insight.

  • Can I schedule the advanced indexing to run automatically?

    Advanced website indexing does support automatic refresh, once a data store is created, it generates an initial index and then continuously indexes new pages, recrawls existing ones, and regularly refreshes itself if it hits 50 queries in 30 days. But if you require more immediate or very specific scheduling for the advanced indexing to run automatically, another option is to use Cloud Scheduler to trigger Cloud Run and execute indexing via API.

  • Also, when I delete articles from the website, I noticed that the number of data store documents does not decrease. Why is that? How can I make sure the deleted articles are removed from the data store?

    When you delete articles from your website, the number of data store documents may not decrease immediately, and there may be a delay of 6 to 24 hours before removals are fully reflected. To handle deleted articles from your website, you can refer to this documentation as quoted below:

    When a page is deleted, Google recommends that you manually refresh the deleted URLs. When your website data store is crawled during either an automatic or a manual refresh, if a web page responds with a 4xx client error code or 5xx server error code, the unresponsive web page is removed from the index.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.