Best practice for training on very large documents

amirdolev · 02-26-2024 04:16 AM

I am creating a chatbot using Dialogflow-Cx. A few of the documents in my datastore are very large, with a few hundreds of pages, some are in PDF format and some are in HTML format.

I am wondering how DF handles such big documents, and whether I can help it perform better.

Does anyone know how DF breaks down such documents? Is it by page, paragraph, chapter, subchapter, or something else?

What if instead I break down the large document into a few smaller documents, e.g. one per chapter? Will it improve the Bot?

xavidop

Hi @amirdolev !

The Datastore that uses Dialogflow CX does not care about the size of each document, what it does is the following: Per document, it calls an embedding model and it creates different chunks and stores the data in a vector database. So at the end what you have is basically vectors and chunks that Dialogflow CX will retrieve.

One important tip here is: Create a datastore with data that is related. If you have data/documents that are not related with each other, create different data store and call them in different parts of your DF CX workflows.

I hope I answered your questions!

Xavi

amirdolev

Hi @xavidop.

Please elaborate on chunks. Chunks of what, words, sentences, paragraphs,..? What is the size of a chunk? How does the segmentation into chinks determined and does it take into account the structure of the document?

In my case, the document describes the functionality of a product. The product has many features, each described in a chapter. Would you recommend splitting it into a document per feature or a datastore per feature? If so, why?

Thanks,

Amir

xavidop

Hi, in your case, it could be beneficial if you create data stores per feature/chapter. This is assuming that you are going to have pages and flows per each feature.

A "chunk" typically refers to a segment or piece of data. In the context of embeddings, it can represent a group of words, a phrase, or even an entire document. For example, in natural language processing, a chunk could be a sequence of words like "New York City" or "machine learning algorithms." or something bigger, it depends on the embedding model.

The chunks are generated thanks to an embedding model. An embedding model is a type of machine learning model that learns to represent objects, such as words, phrases, or documents, as vectors in a high-dimensional space. These vectors, known as embeddings, capture semantic relationships and similarities between the objects. The goal is to place similar objects close to each other in the embedding space.

All these things are managed by Vertex AI Search & Conversation meaning that you only have to create the data stores and it will create and store the chunks for you

amirdolev

Hi @xavidop.

Thank you for your explanation, but there is still one point that is not clear to me. Is the structure of a document taken into account?

For example. Let's assume a document with two sections, sections A and B, each describing a different feature and are titled "Feature A" and "Feature B" respectively. Section A includes the terms "A1" and "A2", and section B includes the terms "B1" and "B2". Will the model learn that the vectors of "Feature A", "A1", and "A2" are closer to each other than the vectors of "Feature B", "B1", and "B2"? Or, since all six vectors were generated from the same document, they will all be more or less equally closer to each other?

Regards,

Amir

xavidop

the document that contains the data itself does not matter at the time of the vector creation. there will be 2 different and separated vectors