VAIS:Retail - Catalog Ingestion best practices gui...

Shrish_marnad · 09-20-2024 03:22 PM

Introduction

Vertex AI Search for retail data ingestion: the key to optimized search performance

Vertex AI Search for retail’s (VAIS:Retail) data ingestion pipeline encompasses both product catalog and user event data. This data stream provides the foundation for robust model training and continuous evaluation through feedback mechanisms. Accurate and complete data ingestion is not just a prerequisite, it's an ongoing process essential for maintaining the adaptability of the underlying models. This, in turn, directly influences the quality and relevance of search results, offering significant returns on investment.

Consider these data ingestion best practices when architecting your retail search solution to maximize efficiency and effectiveness. There are two main sub-sections 1. Product Catalog 2. User events. This blog only discusses the Product Catalog section.

Product Catalog ingestion

1. Bulk import, real-time streaming, or both?

VAIS:Retail offers two primary methods for catalog ingestion: bulk import and real-time streaming. This dual approach accommodates the diverse architectural needs of various customer backends. There's no requirement to exclusively choose one method; a hybrid ingestion mode can be employed, leveraging both bulk import and streaming updates based on specific requirements.

Bulk imports are ideal when dealing with large-scale additions, deletions or updates to thousands of products at once. In contrast, real-time streaming excels when continuous updates are needed for a relatively smaller volume of products. The choice between these methods hinges on the nature of your product catalog, the frequency of updates, and the overall architecture of your backend systems.

Leveraging BigQuery for efficient bulk import

VAIS:Retail's bulk import functionality supports three distinct data sources: BigQuery , Google Cloud Storage , and inline data. For extensive catalogs, inline imports may not be the most scalable option due to size limitations, thus reserving their use for minor updates or experimental testing.

While Google Cloud Storage offers a viable alternative, it necessitates adherence to specific formats (e.g., JSON Lines) and file restrictions. Users are responsible for managing bucket structures, file chunking, and other aspects of the import process. Furthermore, directly editing the catalog within Google Cloud Storage can be cumbersome, and while potentially cost-effective, it lacks the flexibility of other methods.

BigQuery emerges as a compelling choice for numerous reasons. It facilitates easy modification of catalog data, enables the specification of partition dates during import, and allows for efficient data transformation through SQL queries. This empowers users to prepare their data seamlessly before ingestion

Hybrid approach to bulk imports and real-time streaming

For scenarios involving a high volume of product catalog updates (thousands of product changes, additions, or deletions) within a short timeframe and at regular intervals, a combined approach of bulk imports and real-time streaming can be highly effective. Stage the updates in BigQuery or Google Cloud Storage , then perform incremental bulk imports at regular intervals, such as every hour or two. This method efficiently manages large-scale updates while minimizing disruptions.

For smaller, less frequent updates, or those requiring immediate reflection in the catalog, leverage the real-time streaming API. In the hybrid approach, real-time streaming can fill the gaps between bulk imports, ensuring your catalog remains current. This strategy strikes a balance between making individual REST API calls (for patching products) and performing bulk changes, optimizing both efficiency and responsiveness in your VAIS:Retail catalog management.

2. Branching strategies for catalog management

To ensure a seamless user experience and consistent search results, it is strongly recommended to maintain a unified catalog within a single branch rather than having disparate catalogs across multiple branches. This practice streamlines catalog updates and reduces the risk of inconsistencies during branch switching.

Consider these common branching strategies for effective catalog management:

1. Single Branch Updates: Designate a "live" branch as the default and continuously update it as catalog changes occur. For bulk updates, leverage import functionality during periods of low traffic to minimize disruptions. Utilize streaming APIs for smaller, incremental updates or batch them into larger chunks for regular imports.

2. Branch Switching:

There are a couple of choices to manage different branches:

Use Branches for staging and verification
1. Some retailers opt for a branch switching approach, where the catalog is updated within a non-live branch and then made the default (live) branch when ready for production. This enables preparation of the next day's catalog in advance. Updates can be made via bulk import or streaming to the non-live branch, ensuring a seamless transition during low traffic times.
2. The choice between these strategies depends on your specific requirements, update frequency, and infrastructure setup. However, regardless of the chosen strategy, maintaining a unified catalog within a single branch is crucial for optimal performance and consistent search results in VAIS:Retail.
Use Branches for backups
1. A single live branch focuses on continuous ingestion and processing of product updates to keep the VAIS:Retail index up-to-date in near real-time.
2. Another branch focuses on creating a daily snapshot of the transformed data in Retail Search, acting as a robust fallback mechanism in case of data corruption or issues with Branch 0.
3. A third branch focuses on creating a weekly snapshot of the transformed date.
  This way the customer can have a day old backup and a week old backup in different branches

3. Inventory updates in VAIS:Retail

Real time streaming option

For dynamic data such as inventory information (price, availability) and store-level details (fulfillment status, store-specific pricing, etc.), real-time streaming is the sole option within VAIS:Retail.
This distinction arises due to the high-frequency nature of inventory fluctuations compared to the relatively static product catalog data. Product availability, for instance, can change multiple times daily, while descriptions or attributes remain relatively constant.
The frequency of store-level updates further amplifies with the number of retail locations.

Asynchronous updates

To accommodate this rapid pace of change, VAIS:Retail employs asynchronous inventory updates via APIs that return a job ID.
The update process is not considered complete until the job status is polled and confirmed, potentially introducing a minor delay ranging from seconds to minutes.

Out of Order updates

A notable feature of this system is the ability to update inventory information before the corresponding product is ingested into the catalog. This addresses the common scenario where inventory and product data pipelines operate independently within retailers, sometimes leading to inventory information becoming available before the product catalog is updated. When updating the inventory use the allowMissing option to handle out of order updates of inventory vs product
By allowing inventory updates to precede catalog ingestion, VAIS:Retail accommodates these pipeline discrepancies, ensuring accurate inventory data is available even for newly introduced products.
However, it's important to note that inventory information for a product is retained for 24 hours and will be purged if a matching product is not ingested within that window. This mechanism ensures data consistency and prevents outdated inventory information from persisting in the system.

4. Product catalog pre-checks for robust A/B testing in VAIS:Retail

Ensuring consistent catalog updates parity

In preparation for an A/B test within VAIS:Retail, maintaining strict parity between the legacy (control) catalog and the VAIS:Retail (test) catalog is crucial. Any imbalances between the two can negatively impact the A/B test, leading to skewed observations and potentially invalid results. For instance, inconsistencies in product availability, pricing, or even minor attribute discrepancies can introduce unintended biases into the test data.
To mitigate this risk, it's imperative to design a parallel update process for both the control and test catalogs, avoiding sequential updates whenever feasible. The goal is to maximize the time during which both catalogs are in sync. Serial updates, on the other hand, can introduce delays in one lane or the other. These delays can result in temporary catalog mismatches, where a product may be in stock in one catalog but not the other, or where a newly added product appears in one catalog sooner than the other. Such disparities can significantly influence user behavior, clicks, and purchases, ultimately leading to an unfair comparison and inaccurate A/B test outcomes.
By prioritizing parallel updates and striving for consistent catalog parity, retailers can ensure a level playing field for A/B testing within VAIS:Retail. This approach enables unbiased and fair analysis of the test results, leading to more reliable insights and informed decision-making.

Achieving catalog data parity

The depth and accuracy of a retail search model's product comprehension hinges on the richness and quality of its underlying product catalog information. In essence, the more comprehensive the product data within the catalog, the better equipped the model is to understand and classify products effectively.
Therefore, in preparation for A/B testing, it's imperative to ensure that the product data uploaded to both the legacy (control) catalog and the VAIS:Retail (test) catalog are identical. Any discrepancies in product information between these two environments can significantly bias the A/B test results.
For instance, if the legacy search engine benefits from a richer or more extensive catalog compared to VAIS:Retail, this creates an unfair advantage. Missing information in the VAIS:Retail catalog could be critical for product understanding and classification, potentially leading to inaccurate search results and misleading performance comparisons. Detecting such disparities can be challenging with external tools and often requires meticulous manual inspection of both catalogs.
By diligently ensuring that both catalogs contain the same product data with the same level of detail, retailers can create a level playing field for A/B testing in VAIS:Retail. This approach fosters a fair and unbiased comparison of the two search engines, facilitating accurate evaluation of their respective performance and capabilities.

5. Disaster recovery planning for ensuring resiliency and business continuity

In the realm of backend engineering, a comprehensive Disaster Recovery (DR) plan is paramount, particularly for retail-focused subsystems like VAIS:Retail. This plan should address the potential failure of catalog and user event ingestion pipelines, regardless of the underlying cause. In such scenarios, the ability to swiftly restore the catalog to a functional state is critical.

Leveraging BigQuery for data staging offers a distinct advantage in disaster recovery. If the current catalog or user event data within VAIS:Retail is not significantly different from the most recent snapshot stored in BigQuery, a simple import API call can initiate a rapid restoration. This approach minimizes downtime and ensures the search functionality remains operational.

Conversely, if BigQuery is not integrated into your data pipeline, alternative mechanisms must be in place to expeditiously reload the catalog from a known good state. These mechanisms might involve backup systems, data replication, or other failover strategies.

By incorporating these disaster recovery considerations into your VAIS:Retail architecture, you can bolster the system's robustness and maintain business continuity even in the face of unexpected disruptions. A well-prepared DR plan ensures that your retail search capabilities remain operational and responsive, minimizing the impact on customer experience and revenue generation.

Active/Active Design or Active/Passive Design

Multiple components used in the architecture can be configured to be multi regional by default. For example, when leveraging BigQuery for persisting data, it can be configured to be multi-regional so that data redundancy and availability is automatically handled by Google Cloud. Similarly when using Google Cloud Storage, it can be configured to be mutli regional.

Other components like Pubsub, Retail API provide less flexibility for regional configuration and are typically globally available. These components provide global access with Google Cloud being responsible for its availability as per their individual SLAs.

However, there are components in the design that are regional constructs. For example if the dataflow is leveraged to transform and ingest data into VAIS:Retail, then these components need to be DR capable. Customers can either deploy dataflow jobs in multiple regions and have all the instances work actively or they can have only instances in one region process the input while other dataflow instances stay passive .

To achieve continuous availability setup, the data ingested from Pubsub topic should have some header information in each message that can be leveraged to filter on a particular Pubsub subscription. For example when sending product updates as a message into Pubsub topic, it can also include an attribute like key region, value us-central1 or key region, value us-east1. In active/active design we do not want the same message to be processed in multiple regions. We instead want a single message to be processed only once in a single region. By leveraging attributes on the message the subscription for one region can be configured to use filters on the region attribute.

In case of a certain region failure, the upstream system should simply switch the header attribute being passed into Pubsub message to the value of the region where there isn’t any failure. For example if us-central1 has a failure, then all the new messages should only contain the key region, value us-east1 . This way all the messages will be filtered to be sent to us-east region.

In active/passive design only one subscription should be attached and the rest of the subscriptions as shown should be marked as detached. This way all messages will be processed by a single subscription . In case of a regional failure, simply detach one subscription and attach another subscription which is used by dataflow in another region.

Resilience and forensics

Leveraging BigQuery in the design for the data ingestion can result in handling resiliency as well as creating capability for forensics and debugging .

Products and inventory ingested directly with the patch and addLocalInventory API are fire and forget in nature. It implies that once the data is sent to VAIS:Retail , there isn’t any trail left of the product and inventory update anymore.

The customer may want to know why a certain product is not showing up as they expect it to be. Having a staging area built with BigQuery with a complete history of data , makes forensic and debugging those types of questions quite easy.

Reference architecture

In this architecture the data ingestion would typically have Raw, Curated and Consumption stages all built on BigQuery . The system would move data in between the stages using Dataflow and orchestrate to automate all of this using Cloud Workflows
The system would take raw data as it is and time tag it to maintain history. This data is unchanged so customers would consider it as a true source.
Then the system would transform the data into a curated stage and time tag it again. This way customers would know when it transformed and if anything failed.
Finally the system would create views in the consumption stage on the curated data using the time the system tagged the data with earlier. This way customer would know exactly which transformed data is supposed to be finally ingested into VAIS:Retail

Branch 0 and Branch 1 and Branch 2 serve as live and day old backup and a week old back branches. Data ingested directly into Branch 0 gets aggregated and indexed into Branch 1 daily and Branch 2 weekly. This way any corruption of data can be easily rolled back thereby enhancing the business continuity and system resilience

Furthermore, analysis and debugging can be achieved as the entire history and lineage of the data is maintained in Global BigQuery datasets

6. Catalog ingestion in VAIS:Retail: planning for corner cases and future-proofing your data pipeline

Once the core mechanisms for catalog ingestion in VAIS:Retail are established, a proactive approach involves assessing their resilience against various corner cases. While some of these scenarios might not be immediately relevant to your specific business requirements, factoring them into your backend design can provide invaluable future-proofing.

This preparatory step entails reviewing your data pipeline's ability to handle unexpected or edge-case scenarios, ensuring its robustness and adaptability to evolving demands. By anticipating potential challenges and addressing them proactively, you can mitigate future disruptions and maintain the seamless flow of product data into your retail search system.

To achieve this, the dataflow logic should be built such that it

Validate each item of the Raw data to match a proper schema . The contract of the Raw data should be determined upfront and every data element should be always matched against the contract. In case of validation failure, the raw data element should be time tagged and persisted in the BigQuery Failed Raw tables with actual errors that are meant for forensics .

Examples of such failure could be:
1. A certain attribute that is not part of the contract all of a sudden, appears in the raw data element
2. A certain mandatory attribute is not present in the raw data element , etc
Validate each item of the Raw data for transformation into VAIS:Retail format . There are some mandatory fields required by the VAIS:Retail for product ingestion. Every element of the raw data should now be checked again if it can be successfully transformed into VAIS:Retail schema format . In case of transformation failure, the raw data element should be time tagged and persisted in the BigQuery Failed Curated tables with actual error messages that can assist with forensics .

Examples of such failure could be:
1. A certain attribute like price cannot be formatted into a number because Raw data element has it is as alphanumeric
2. Name of the product is completely missing , etc

The following shows a sample of a BigQuery Table schema to persist all failures for debugging:

[
    {
      "mode": "REQUIRED",
      "name": "ingestedTimestamp",
      "type": "TIMESTAMP"
    },
    {
      "mode": "REQUIRED",
      "name": "payloadString",
      "type": "STRING"
    },
    {
      "mode": "REQUIRED",
      "name": "payloadBytes",
      "type": "BYTES"
    },
    {
      "fields": [
        {
          "mode": "NULLABLE",
          "name": "key",
          "type": "STRING"
        },
        {
          "mode": "NULLABLE",
          "name": "value",
          "type": "STRING"
        }
      ],
      "mode": "REPEATED",
      "name": "attributes",
      "type": "RECORD"
    },
    {
      "mode": "NULLABLE",
      "name": "errorMessage",
      "type": "STRING"
    },
    {
      "mode": "NULLABLE",
      "name": "stacktrace",
      "type": "STRING"
    }
  ]

Stress testing and scalability: preparing for high-volume events and growth

High-traffic events (BFCM):

High-traffic events like Black Friday and Cyber Monday (BFCM) pose a significant challenge to data ingestion pipelines. The surge in inventory updates (stock levels, prices, etc.) and potential changes to product attributes demand robust infrastructure. It's crucial to assess whether your ingestion system can handle this increased load. Simulated load testing, replicating peak BFCM traffic patterns, is highly recommended to identify bottlenecks and ensure smooth operation during these critical periods.

Flash sales:

Flash sales introduce a unique challenge due to their short duration and rapid inventory fluctuations. Ensuring real-time inventory synchronization is paramount to prevent discrepancies between search results and actual availability. Failure to do so can lead to negative customer experiences, such as popular products appearing as "in stock" when they're actually sold out, or vice versa. Additionally, price changes during flash sales can significantly impact product ranking, highlighting the need for accurate and timely price updates in the search index.

Catalog expansion:

Business growth or product line expansions can result in a dramatic increase (e.g., 5x or 10x) in the number of products within your catalog. Your ingestion architecture must be scalable to accommodate this growth seamlessly. This may necessitate revisiting the entire ETL (Extract, Transform, Load) pipeline, particularly if new data sources or product information formats are introduced.

By proactively addressing these potential scenarios, you can ensure that your VAIS:Retail ingestion pipeline remains robust, scalable, and responsive, even in the face of sudden traffic spikes, flash sales, or significant catalog growth. This proactive approach safeguards the accuracy and reliability of your search results, contributing to a positive user experience and driving business success.

The data ingestion pipeline’s performance should be evaluated and a baseline should be formed for following metrics:

How long does it take to publish and ingest the entire catalog and inventory data. This may be required on a an ad hoc basis during BFCM when prices can changes significantly for the entire catalog.
How long will a single product update will take to get reflected.
What is the highest rate of product and inventory updates that the system can churn though.

Bottlenecks:

Evaluate and find if the pipelines are able to scale up and down correctly .
Determine if the max ceiling for the numbers of instances is too high or too low
Determine if the the system is getting rate limited by VAIS:Retail by checking for HTTP Code 429
Confirm if certain API quotes need to be increased to reduce the rate limits