Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dialogflow CX: Website Datastore Unpredictable Behavior

I am developing a Dialogflow CX app connected to a website datastore. 

I have the app in two different projects (one for testing and one for production) and I have the same configuration for both (i.e. sites to include and sites to exclude). 

Both have not been automatically refreshed in the last 14 days, the second was created more recently and has more content than the first. 

Website Datastore Test: 

Documents usage / project quota limit 15440 / 200000 
Data size 1.24 GB 
malam_0-1732174223098.pngmalam_2-1732174278765.png

 

 

 

Website Datastore Prod: 
Documents usage / project quota limit 15440 / 200000 
Data size 4.99 GB
malam_1-1732174239647.png

 



I need to know that they are at least close to the same, I can re-index the test datastore now but moving forward how do I know that they will re-index the same way and have the same website data? How can I manage and monitor this?  

I have Advanced Website indexing turned on. 

Thank you in advance
 
0 3 283
3 REPLIES 3

The Datastore Size continues to go down (now 1.13 gb) and when I ran manual refresh curl command it had no effect on the index.

Hi @malam,

The discrepancy between your test and production Dialogflow CX website datastores, despite seemingly identical configurations, highlights a potential issue with the website data itself or the indexing process. While both show no recent automatic refreshes, the production datastore's significantly larger size (4.99 GB vs 1.24 GB) strongly suggests different data being indexed. Let's address your concerns:

1. Why are the Datastores Different?

The most likely reasons for the difference are:

  • Different Website Content: Although you believe the configurations are identical, there might be subtle differences in the URLs included or excluded, robots.txt settings, or even dynamic content on your website that is treated differently depending on the environment (e.g., test vs. production). A thorough comparison of the sites to include and sites to exclude lists is crucial. Double-check for typos or unintended differences. Look for differences in the sitemap files used to specify what to index if you use one.
  • Timing of Content Updates: If you updated your website's content after creating the production datastore, that newer content would naturally be included in the production datastore but absent from the older test datastore.
  • Indexing Errors: There's always a possibility of errors in the indexing process itself. The data size difference suggests a significant divergence, pointing towards this being a substantial factor.
  • Crawling Issues: The website crawler might be encountering unexpected problems accessing some parts of your production website, preventing complete indexing. Check your website's server logs for 404 errors or other HTTP status codes that indicate indexing failures.

2. Ensuring Consistent Re-indexing:

To ensure both datastores stay consistent, follow these steps:

  • Verify Configuration: Do a line-by-line comparison of the "Sites to include" and "Sites to exclude" settings in both the test and production Dialogflow CX projects. This is crucial for identifying even minor discrepancies.
  • Force a Full Re-index (Manually): While you've re-indexed the test datastore, you must manually trigger a re-index of the production datastore to establish a known baseline. Allow sufficient time for the complete indexing process.
  • Monitor Indexing Progress: Track the indexing process in both projects. Look at the "Refreshed pages" graph and watch for any errors reported in the console. A full re-index should significantly increase the number of refreshed pages. A consistent increase in the number of indexed pages over time indicates healthy indexing.
  • Regular Re-indexing Schedule: Implement a scheduled re-indexing process, possibly using a script or Cloud Functions, to regularly refresh both datastores. The frequency should align with how often your website content changes. Daily or weekly could be suitable.
  • Use a Sitemap (Recommended): Submitting a well-structured sitemap to your website datastore dramatically improves reliability and reduces the risk of the crawler missing parts of your website. This ensures complete and consistent indexing.
  • Error Logging and Alerting: Implement monitoring and alerting to receive notifications if indexing fails or if there are significant differences between the sizes of the two datastores. This will help you catch problems early. Consider setting up alerts based on datastore size or the number of indexed pages.
  • Test Regularly: Before deploying updates to your production website, thoroughly test the indexing process in your testing environment. This will prevent unexpected surprises when your website and datastore are updated.

3. Monitoring and Management:

  • Regular Datastore Size Comparisons: Set up a regular automated process to compare the data sizes of both datastores. If the size difference exceeds a predetermined threshold, generate an alert.
  • Document Usage Monitoring: Keep an eye on the "Documents usage" metric. If it approaches the quota limit, you'll need to increase your quota or optimize your website content for indexing (e.g., removing unnecessary pages).

The indexing process isn't instant. Be patient and give it sufficient time to complete. If issues persist after following these steps, contact Google Cloud Support for assistance. They can investigate potential problems with the indexing process itself.

I hope the above information is helpful.

Ruth, thank you for your response. 
1. I ensured that the configuration of sites was identical
2. I had to create a new datastore in order to avoid production issues
3. There is an issue of the datastore reducing in size, this behavior continues (the 4.99 gb is now 2.84 gb in size and is not shown on the activity graph. There is a cost to reindexing my website weekly which I did not expect when starting my project on Dialogflow. 
4. I will look into the sitemap, thank you. 

5. Another Critical Concern is that I tried the curl command for manual refresh, along with the manual recrawl uris feature in the console and neither worked (and the operations were "successful"). Documents were not added to the index and nothing showed up on the activity tab. This is extremely worrying.