LLMs vs. embeddings for Contextual Relevance Asses...

yogendray · 05-07-2025 05:18 PM

In many real-world applications, we encounter tasks that sit at the intersection of semantic understanding and contextual judgment. One such common scenario is determining if a set of keywords truly aligns with a brand, its products, and its messaging (like a tagline). This isn't just about whether the words are synonyms; it's about relevance in a specific context. Does "eco-friendly" align well with a fast-fashion brand's new line, even if the tagline mentions "sustainable materials"? Does "high-performance" fit a budget-friendly product?

We call this challenge Contextual Relevance Assessment. It's a nuanced problem where both advanced Large Language Models (LLMs) and specialized embedding models present viable solutions. Getting this assessment wrong can have significant consequences. Imagine an e-commerce platform: if a user searches for "running shoes" and the system incorrectly flags hiking boots as highly relevant (a False Positive), the user gets irrelevant results, leading to frustration and potentially lost sales. Therefore, while overall accuracy is important, achieving high Precision (minimizing these False Positives) is often the paramount goal.

But which AI approach best delivers this precision under practical constraints? To explore this, we benchmarked Google's Gemini 1.5 Flash LLM against Google's text-embedding-004 and the open-source all-MiniLM-L6-v2 embedding models on Vertex AI.

Vertex AI is Google Cloud's unified machine learning platform, providing tools and infrastructure to build, train, deploy, and manage ML models efficiently at scale. The key takeaway here is that our comparison was conducted within a realistic, enterprise-grade environment often used for deploying such AI solutions.

The contenders: LLMs vs. embeddings

LLMs (Gemini 1.5 Flash): Excel at interpreting language with deep contextual understanding. Can potentially discern subtle relevance cues zero-shot. Hypothesized to achieve high precision by better understanding why something might not be relevant.
Embedding models (text-embedding-004, all-MiniLM-L6-v2): Represent text as numerical vectors for relevance judgment via distance or classifiers/verifiers. Promise significant efficiency gains (speed, cost) at inference. Success often depends on task formulation and fine tuning.

Dataset context & task

Our benchmark utilized a sample from the public Crowdflower "Search Results Relevance" dataset. This dataset contains search queries, product titles, product descriptions, and human-assigned relevance scores.

Example Data Row Structure:

Search query	Product title	Product description	score
tv	Crosley Newport 60-Inch Low Profile TV Stand with Two 60-Inch Audio Piers	Complete your entertainment space with the versatile Newport TV Stand and 2 Audio Piers. The classic cherry hand-rubbed, multi-step finish highlights the beauty of this media unit. The 2 audio piers have been designed to offer spacious storage for your media utilities. The raised panel doors open to give access to adjustable storage shelves. This TV stand with cord management system ensures tangle free arrangement of wires. It accommodates up to a 60" flat panel TV. The solid hardwood and veneer construction of the stand pier makes it durable. The adjustable levelers at the base legs offer comfortable placement.	0
galaxy note 3	INSTEN Plain Checker Hard Plastic Slim Snap-on Phone Case Cover For Samsung Galaxy Note 3	This is an INSTANT snap-on case for Samsung Galaxy Note 3. Not compatible with: Samsung Galaxy note, note ii, note iii neo.	0
duffle bag	McBrine P2705-BK 25 Inch Duffle Bag With Circular Top Opening- Black	Traditional gym/weekend travel bag . The duffle can be easily carried by its double handles or adjustable removable shoulder strap. With a wide opening main section and a number of exterior pockets making it easy to pack	1
sweater dress	Pink Angel Baby Girl 12M White Woven Short Sleeve Sweater Dress	About this item\n \nCustomer reviews\n \nItem recommendations\n \nPolicies\nAbout this item\nA stylish dress, perfect for any special occasion, from La Piccola Danza. This sleeveless white dress is COVERED in sequins. The sash at the waist ties in the front. white is fully lined.	1
snow boots	Sorel Tofino Boots	Step out in boots with runway style and winter smarts. High-traction soles, waterproof nylon and fleece lining. By Sorel®. Imported. Whole sizes 6 to 11.	1

For our experiments, we normalized the original score into binary relevance (0 for irrelevant/marginally relevant, 1 for relevant/perfectly relevant). The core task for the models was: given the query, product_title, and product_description, predict this binary relevance score, with a focus on achieving high precision for the relevant (1) class.

Input formatting:

For the LLM (Gemini 1.5 Flash), the product_title and product_description were combined into a structured "Product Details" block, and this block along with the search_query was presented within a specific prompt instructing the model to assess similarity and return 0 or 1.
For the embedding models, embeddings were generated separately for the combined product information (product_title + product_description) and the search_query. These embeddings were then used for comparison based on the chosen task (Similarity, Verification, Classification).

We used 6704 samples for fine-tuning where applicable, and 1727 samples for testing all configurations.

Results & discussion: Unpacking the performance

Before looking at the specific results, let's clarify what the performance numbers in the table tell us, especially since our goal is high precision:

Accuracy: Overall percentage of correct predictions (both relevant and irrelevant). A basic measure of correctness.
Precision: This is crucial for our goal. Of all the items the model said were relevant, what percentage actually were relevant? High precision means fewer False Positives – minimizing user frustration from irrelevant results.
Recall: Of all the items that truly were relevant, what percentage did the model successfully identify? High recall means finding most of the good matches.
F1 Score: A single metric that balances Precision and Recall (their harmonic mean). A high F1 score indicates a good balance between finding relevant items and not flagging irrelevant ones.
Why read the table? Look for the models achieving the highest Precision and F1 Score, as these best meet our objective of accurately identifying relevant items while strictly controlling for errors. We've highlighted key performers and findings in yellow to guide your eye.

Summary Performance Table:

We evaluated various configurations. Here's a summary of the key results:

Approach	Model	Notes	Training Num Samples	Test Num Samples	Accuracy	Precision	Recall	F1 score
LLM	1.5-flash-002	Foundational: - Temperature: 1.0 - Top P: 0.95	0	1727	84.25	89.8	89.9	0.90
LLM	1.5-flash-002	Foundational: - Temperature: 0 - Top P: 0.95	0	1727	84.08	89.74	89.81	0.90
LLM	1.5-flash-002	Finetuned: - Temperature: 1.0 - Top P: 0.95	6704	1727	87.15	93.3	90.48	0.92
LLM	1.5-flash-002	Finetuned: - Temperature: 0 - Top P: 0.95	6704	1727	89.29	94.79	91.72	0.93
Embedding	text-embedding-004	Foundational: - Threshold = 0.6 - Task: Classification	0	1727	80.8	83.7	93.6	0.88
Embedding	text-embedding-004	Foundational: - Threshold = 0.6 - Task: FACT_VERIFICATION	0	1727	80.6	88.3	86.5	0.87
Embedding	text-embedding-004	Finetuned: - Threshold = 0.6 - Task: FACT_VERIFICATION	6704	1727	81.3	88	87.9	0.88
Embedding	all-MiniLM-L6-v2	Out-of-box: - Threshold = 0.6 - Task: Classification	0	1727	48.35	91.2	37.2	0.53
Embedding	all-MiniLM-L6-v2	FineTuned: - Threshold = 0.6 - Task: Classification - Loss: Contrastive Loss	6704	1727	83.84	83.98	97.92	0.90
Embedding	all-MiniLM-L6-v2	Finetuned: - Threshold = 0.75 - Task: Classification - Loss: Contrastive Loss	6704	1727	85.2	88.2	93.5	0.91

Note: Yellow highlights along text indicate the best performers within key categories or highlight significant findings.

Thanks to Tanya Warrier for conducting many of the experiments detailed above.

1. Peak performance & the LLM advantage:

As shown in the table, the Finetuned Gemini 1.5 Flash (Temp=0) achieved the highest Precision (94.8%) and F1-score (0.93), making it the most accurate and reliable model in this benchmark for avoiding False Positives.
Setting Temperature=0 (deterministic output) consistently yielded better precision for the LLM compared to Temperature=1.0, reinforcing its suitability for fact-based relevance tasks.
Why the Edge? LLMs seem better equipped to grasp complex, contextual relationships. They move beyond simple similarity to understand implied meaning or user intent. For instance, an LLM might better distinguish that a query for "galaxy note 3" (the phone) is contextually different from a product listing for a "Case Cover For Samsung Galaxy Note 3" (an accessory), assigning relevance correctly (Score 0), while a simpler model might be misled by the strong keyword overlap. (This example reflects patterns seen in the data).

2. The crucial role of finetuning:

Finetuning markedly improved performance, demonstrating its value in adapting models to the specific contextual nuances of this task.
- Gemini 1.5 Flash (Temp=0): Finetuning significantly boosted performance across the board, lifting Precision from a solid 89.7% (foundational) to the benchmark peak of 94.8%, and Accuracy from 84.1% to 89.3%.
- all-MiniLM-L6-v2: Finetuning was transformative for this open-source embedding model. Consider its foundational performance using Classification with a threshold of 0.6: it achieved a misleadingly high Precision of 91.2%. However, this came at the cost of extremely poor overall performance, with Accuracy at only 48.35% and an F1 score of 0.53, indicating it failed to identify most relevant items. In contrast, the finetuned achieved a strong, balanced result: 88.2% Precision, 85.2% Accuracy, and a 0.91 F1 score. While its precision is slightly lower than the flawed foundational T=0.6 result, the finetuned model is vastly more useful and effective due to its dramatically better ability to correctly classify relevant pairs.
- Implication: Adapting models via finetuning is often essential for unlocking high, reliable performance in contextual relevance tasks, especially for smaller, open-source embedding models which might otherwise struggle or produce misleading metrics with naive thresholding.

3. Embedding models: Efficiency meets performance (with caveats):

While the fine-tuned LLM led, embedding models proved highly capable and offered significant efficiency advantages (lower cost/latency at inference).
Text-embedding-004: Configuration is king
- This model achieved excellent foundational performance (88.3% Precision, 0.87 F1) using the task (T=0.6). This highlights the importance of choosing the right task formulation out-of-the-box.
- Finetuning surprise: Interestingly, finetuning this model did not improve its performance on our dataset (Precision remained ~88%). For robust foundational models like this, optimizing the task type and threshold might be more crucial than fine tuning on moderately sized datasets.
- All-MiniLM-L6-v2: Finetuning necessity & success
  - This open-source model required fine tuning to become competitive.
  - Once finetuned (with Contrastive Loss, Classification Task, T=0.75), it achieved 88.2% Precision and 0.91 F1, rivaling the foundational text-embedding-004 and offering a potentially highly efficient solution.
  - Embedding task formulation matters:
    - As seen with text-embedding-004, the choice of task (SEMANTIC_SIMILARITY, FACT_VERIFICATION, Classification) dramatically impacts performance. Understanding how these tasks frame the relevance question is key:
      - SEMANTIC_SIMILARITY: How close in meaning? (Often too broad).
      - FACT_VERIFICATION: Does text A support/refute text B? (Good fit here).
      - Classification: Learn a boundary between relevant/irrelevant pairs. (Effective, especially when finetuned).
      - Careful selection of task type and threshold tuning is vital for optimizing embedding models.

Making the choice: Precision, performance & practicality

Choose the LLM (specifically Finetuned Gemini 1.5 Flash) if:
- Peak performance is required: You need the absolute highest precision (~95% in our tests) and overall accuracy (~89%). Finetuning Gemini 1.5 Flash (Temp=0) demonstrably provided the best results.
- Contextual nuance is key: The task involves subtle distinctions (like differentiating a product from its accessory despite keyword overlap) where deeper language understanding provides a clear advantage.
- Leveraging GCP Finetuning: You are using Google Cloud Platform. A standout benefit is that inference costs for a finetuned Gemini model are the same as the base model. This removes the typical cost barrier associated with using more powerful, customized models, making the fine-tuned Gemini 1.5 Flash an exceptionally attractive option for achieving top-tier performance without incurring higher operational expenses post-training.
- Consider an embedding model if:
  - "Good enough" precision is acceptable (~88%) AND efficiency is critical: While not reaching the LLM's peak, embedding models offer a compelling balance. If slightly lower precision is acceptable, their potential for lower latency and computational footprint might be decisive, especially at massive scale outside of the GCP finetuning cost advantage.
  - Option A (Google Cloud Native - Foundational): text-embedding-004 using the FACT_VERIFICATION task (T=0.6) provides remarkably strong foundational precision (~88%) with minimal setup, ideal if fine tuning resources are limited or initial results are sufficient.
  - Option B (Open Source / Finetuned): all-MiniLM-L6-v2 becomes a strong contender after fine tuning (~88% Precision, 0.91 F1), offering a highly efficient open-source path if you have the data and capability to finetune it effectively.

Conclusion: Finetuned Gemini leads, embeddings offer efficient alternatives

For the task of assessing contextual relevance between product information and search queries, our findings point to Finetuned Gemini 1.5 Flash as the clear winner in terms of performance, delivering superior precision and overall accuracy. The ability to achieve this state-of-the-art result is further amplified by a significant practical advantage on Google Cloud Platform: the inference cost for the finetuned model remains the same as the base model. This unique benefit makes leveraging the power of a customized LLM exceptionally compelling.

While the fine-tuned LLM sets the benchmark, well-configured embedding models remain valuable tools. Google's text-embedding-004 demonstrated impressive foundational performance when appropriately configured (using FACT_VERIFICATION), and open-source models like all-MiniLM-L6-v2 become highly competitive and efficient after finetuning.

Ultimately, while embeddings offer efficient alternatives, the combination of top-tier performance and the advantageous finetuning cost structure on GCP positions Finetuned Gemini 1.5 Flash as the premier choice for demanding contextual relevance assessment tasks where precision and deep understanding are paramount. The decision hinges on specific project requirements, but the path to achieving best-in-class results often leads through tailored LLMs.