Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to create datastore only from website xml file

Issue: From our search console, it is re-crawling too often and crawling information outside the listed crawler (www.mycompany.support/*)

Question: How do you create a datastore that is only indexed from the sitemap and it is only re-indexed from a manual trigger?

2 REPLIES 2

Hi @Ian_Stack,

Welcome to Google Cloud Community!

Your concerns address different aspects of Google's indexing process. 

Issue: Excessive Crawling Outside Designated Path (www.mycompany.support/*)

The problem isn't about what is indexed, but how often and where Google crawls. There's no single solution to completely stop Google from exploring beyond your intended path (www.mycompany.support/*), but you can significantly reduce unwanted crawling by:

  • Tightening robots.txt: Precisely define which pages are disallowed.
  • Refining your sitemap: Only include pages within www.mycompany.support/*.
  • Cleaning up internal and external links: Remove or redirect links pointing to unauthorized areas.
  • Managing URL parameters: Properly configure parameter handling in Google Search Console.
  • Controlling content updates: Reduce the frequency of updates outside the allowed path.
  • Monitoring Google Search Console: Track crawling activity to pinpoint problem areas.

This is a process of iterative refinement; perfect control is unlikely, but significant improvement is achievable.

Question: Creating a Datastore Indexed Only from Sitemap and Manual Triggers

You can't create a datastore that's only indexed from a sitemap and only re-indexed via manual trigger. This question aims for complete control over indexing—indexing only from a sitemap and only re-indexing via manual triggers. This is not fully achievable. Google's crawler has autonomy; it can discover pages through means other than your sitemap.

The closest you can get is by:

  • Using a sitemap exclusively: List only the desired pages.
  • Minimizing discoverability: Use robots.txt to block unwanted pages and avoid external links to them.
  • Manual sitemap resubmission: Trigger re-crawls using Google Search Console.

Even with these methods, Google might still crawl some pages outside your control, though the likelihood is significantly reduced. The degree of control you have is relative; absolute control is impossible within Google's indexing system. The emphasis is on reducing automatic crawling, not eliminating it entirely.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

 

thank you for this however this all very theoretical. Can you show me exactly how to create datastore only with a site map? like which datastore do I select? what do I put in the include and exclude? Is there an example Ic an look at this?