Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Text embedding on text exceeding limit - is dimension average valid

I have created an R script that uses the text-embedding-004 model to get text embeddngs from media articles. The objective is to use the embeddings to cluster / calssify the articles and the script that access the API and gets the embeddings works fine.

The problem I have is that many articles are longer than the 2,048 token limit. Right now I am truncatiung the text to prevent 400 errors, but I wonder if there is a way to ensure that all information is retained.

If one was to split up a long article, say into three equally sized portions. Then get the embeddings for each portion and calculate the average value for every dimension. Will this accurately portray the content of the article in its totality?

I have a suspicion that such a process will result in nonsense, but I don't know enough about text embeddings to be sure.

Thanks in advance!

Solved Solved
0 2 766
1 ACCEPTED SOLUTION

Hi @AngusM_RSA,

Welcome to Google Cloud Community!

Text embeddings capture the semantic meaning of a text, with each embedding vector representing the nuances, structure, and word relationships of the entire text. When a text is divided into chunks, each chunk’s embedding reflects its individual context, but may not capture how it connects to the rest of the text. Here are the reasons why averaging can be an issue:

  1. Loss of Context: Each embedding captures meaning from a specific part of the text. When you split the article into smaller portions and average the embeddings, you lose information about how different parts of the article relate to each other.
  2. Hierarchical Relationships: Articles often have hierarchical structure sections, paragraphs, and sentences that interact in complex ways. By averaging embeddings, you might flatten this structure, leading to a loss of important hierarchical and contextual information.
  3. Semantic Shift: The meaning of the article as a whole might differ from the sum of its parts. Averaging embeddings might obscure the full scope of the article's central message, especially when different sections deal with distinct topics or perspectives.

Here are a few strategies that may work better than just averaging embeddings:

  1. Chunking with Overlap: Rather than simply dividing the article into non-overlapping sections, consider splitting it into overlapping chunks. This approach helps maintain context between sections. For example:
  • Chunk 1: First 1,000 tokens
  • Chunk 2: Last 500 tokens of Chunk 1 + next 1,000 tokens
  • Chunk 3: Last 500 tokens of Chunk 2 + next 1,000 tokens

This overlap ensures continuity between chunks. Instead of averaging the embeddings for each chunk, explore alternative methods like max-pooling or concatenating the embeddings to preserve the unique contributions of each chunk.

  1. Using Summarization: Use a separate summarization model (or even a simple rule-based method) to create a shorter version of the article that fits within the token limit. After shortening it, you can embed it, keeping the main points without losing context. You can also use a model to summarize the article before embedding, which can be helpful if the original article is too long or repetitive. For more information, you can check this blog post.
  1. Sliding Window: A sliding window approach involves analyzing different sections of the article with some overlap, then using the combined (or other statistical methods) embeddings as your representation. It's similar to chunking with overlap but can be applied more systematically with different window sizes and overlaps to maintain better continuity between sections.
  2. Hierarchical Embeddings: Some models and techniques are built to handle hierarchical information. For instance, you can generate embeddings at various levels (e.g., sentences, paragraphs, sections) and combine them using weighted averages or other aggregation methods. This approach helps capture both local and global meaning in a hierarchical way. You can also check this Github Repository as a baseline for your troubleshooting.
  3. Embedding All Chunks and Clustering Individually: Instead of averaging the embeddings, you can generate embeddings for each chunk and cluster them separately. After clustering, you can use methods like Document Embeddings (e.g., doc2vec, or by feeding the clustered embeddings into a downstream model) to create a representation for the entire article.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

2 REPLIES 2

Hi @AngusM_RSA,

Welcome to Google Cloud Community!

Text embeddings capture the semantic meaning of a text, with each embedding vector representing the nuances, structure, and word relationships of the entire text. When a text is divided into chunks, each chunk’s embedding reflects its individual context, but may not capture how it connects to the rest of the text. Here are the reasons why averaging can be an issue:

  1. Loss of Context: Each embedding captures meaning from a specific part of the text. When you split the article into smaller portions and average the embeddings, you lose information about how different parts of the article relate to each other.
  2. Hierarchical Relationships: Articles often have hierarchical structure sections, paragraphs, and sentences that interact in complex ways. By averaging embeddings, you might flatten this structure, leading to a loss of important hierarchical and contextual information.
  3. Semantic Shift: The meaning of the article as a whole might differ from the sum of its parts. Averaging embeddings might obscure the full scope of the article's central message, especially when different sections deal with distinct topics or perspectives.

Here are a few strategies that may work better than just averaging embeddings:

  1. Chunking with Overlap: Rather than simply dividing the article into non-overlapping sections, consider splitting it into overlapping chunks. This approach helps maintain context between sections. For example:
  • Chunk 1: First 1,000 tokens
  • Chunk 2: Last 500 tokens of Chunk 1 + next 1,000 tokens
  • Chunk 3: Last 500 tokens of Chunk 2 + next 1,000 tokens

This overlap ensures continuity between chunks. Instead of averaging the embeddings for each chunk, explore alternative methods like max-pooling or concatenating the embeddings to preserve the unique contributions of each chunk.

  1. Using Summarization: Use a separate summarization model (or even a simple rule-based method) to create a shorter version of the article that fits within the token limit. After shortening it, you can embed it, keeping the main points without losing context. You can also use a model to summarize the article before embedding, which can be helpful if the original article is too long or repetitive. For more information, you can check this blog post.
  1. Sliding Window: A sliding window approach involves analyzing different sections of the article with some overlap, then using the combined (or other statistical methods) embeddings as your representation. It's similar to chunking with overlap but can be applied more systematically with different window sizes and overlaps to maintain better continuity between sections.
  2. Hierarchical Embeddings: Some models and techniques are built to handle hierarchical information. For instance, you can generate embeddings at various levels (e.g., sentences, paragraphs, sections) and combine them using weighted averages or other aggregation methods. This approach helps capture both local and global meaning in a hierarchical way. You can also check this Github Repository as a baseline for your troubleshooting.
  3. Embedding All Chunks and Clustering Individually: Instead of averaging the embeddings, you can generate embeddings for each chunk and cluster them separately. After clustering, you can use methods like Document Embeddings (e.g., doc2vec, or by feeding the clustered embeddings into a downstream model) to create a representation for the entire article.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Thanks for the comprehensive answer. As this is an automated and unsupervised process, I will go with what I see as the simplist solution: Overlapped chunking. But I am certain that I will use your other suggestions for more complex / ad hoc texts.