I have created an R script that uses the text-embedding-004 model to get text embeddngs from media articles. The objective is to use the embeddings to cluster / calssify the articles and the script that access the API and gets the embeddings works fine.
The problem I have is that many articles are longer than the 2,048 token limit. Right now I am truncatiung the text to prevent 400 errors, but I wonder if there is a way to ensure that all information is retained.
If one was to split up a long article, say into three equally sized portions. Then get the embeddings for each portion and calculate the average value for every dimension. Will this accurately portray the content of the article in its totality?
I have a suspicion that such a process will result in nonsense, but I don't know enough about text embeddings to be sure.
Thanks in advance!
Solved! Go to Solution.
Hi @AngusM_RSA,
Welcome to Google Cloud Community!
Text embeddings capture the semantic meaning of a text, with each embedding vector representing the nuances, structure, and word relationships of the entire text. When a text is divided into chunks, each chunk’s embedding reflects its individual context, but may not capture how it connects to the rest of the text. Here are the reasons why averaging can be an issue:
Here are a few strategies that may work better than just averaging embeddings:
This overlap ensures continuity between chunks. Instead of averaging the embeddings for each chunk, explore alternative methods like max-pooling or concatenating the embeddings to preserve the unique contributions of each chunk.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @AngusM_RSA,
Welcome to Google Cloud Community!
Text embeddings capture the semantic meaning of a text, with each embedding vector representing the nuances, structure, and word relationships of the entire text. When a text is divided into chunks, each chunk’s embedding reflects its individual context, but may not capture how it connects to the rest of the text. Here are the reasons why averaging can be an issue:
Here are a few strategies that may work better than just averaging embeddings:
This overlap ensures continuity between chunks. Instead of averaging the embeddings for each chunk, explore alternative methods like max-pooling or concatenating the embeddings to preserve the unique contributions of each chunk.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Thanks for the comprehensive answer. As this is an automated and unsupervised process, I will go with what I see as the simplist solution: Overlapped chunking. But I am certain that I will use your other suggestions for more complex / ad hoc texts.