Vertex AI Search for retail’s (VAIS:Retail) data ingestion pipeline encompasses both product catalog and user event data. This data stream provides the foundation for robust model training and continuous evaluation through feedback mechanisms. Accurate and complete data ingestion is not just a prerequisite, it's an ongoing process essential for maintaining the adaptability of the underlying models. This, in turn, directly influences the quality and relevance of search results, offering significant returns on investment.
Consider these data ingestion best practices when architecting your retail search solution to maximize efficiency and effectiveness. There are two main sub-sections 1. Product Catalog 2. User events. This blog only discusses the Product Catalog section.
VAIS:Retail offers two primary methods for catalog ingestion: bulk import and real-time streaming. This dual approach accommodates the diverse architectural needs of various customer backends. There's no requirement to exclusively choose one method; a hybrid ingestion mode can be employed, leveraging both bulk import and streaming updates based on specific requirements.
Bulk imports are ideal when dealing with large-scale additions, deletions or updates to thousands of products at once. In contrast, real-time streaming excels when continuous updates are needed for a relatively smaller volume of products. The choice between these methods hinges on the nature of your product catalog, the frequency of updates, and the overall architecture of your backend systems.
VAIS:Retail's bulk import functionality supports three distinct data sources: BigQuery , Google Cloud Storage , and inline data. For extensive catalogs, inline imports may not be the most scalable option due to size limitations, thus reserving their use for minor updates or experimental testing.
While Google Cloud Storage offers a viable alternative, it necessitates adherence to specific formats (e.g., JSON Lines) and file restrictions. Users are responsible for managing bucket structures, file chunking, and other aspects of the import process. Furthermore, directly editing the catalog within Google Cloud Storage can be cumbersome, and while potentially cost-effective, it lacks the flexibility of other methods.
BigQuery emerges as a compelling choice for numerous reasons. It facilitates easy modification of catalog data, enables the specification of partition dates during import, and allows for efficient data transformation through SQL queries. This empowers users to prepare their data seamlessly before ingestion
For scenarios involving a high volume of product catalog updates (thousands of product changes, additions, or deletions) within a short timeframe and at regular intervals, a combined approach of bulk imports and real-time streaming can be highly effective. Stage the updates in BigQuery or Google Cloud Storage , then perform incremental bulk imports at regular intervals, such as every hour or two. This method efficiently manages large-scale updates while minimizing disruptions.
For smaller, less frequent updates, or those requiring immediate reflection in the catalog, leverage the real-time streaming API. In the hybrid approach, real-time streaming can fill the gaps between bulk imports, ensuring your catalog remains current. This strategy strikes a balance between making individual REST API calls (for patching products) and performing bulk changes, optimizing both efficiency and responsiveness in your VAIS:Retail catalog management.
To ensure a seamless user experience and consistent search results, it is strongly recommended to maintain a unified catalog within a single branch rather than having disparate catalogs across multiple branches. This practice streamlines catalog updates and reduces the risk of inconsistencies during branch switching.
Consider these common branching strategies for effective catalog management:
1. Single Branch Updates: Designate a "live" branch as the default and continuously update it as catalog changes occur. For bulk updates, leverage import functionality during periods of low traffic to minimize disruptions. Utilize streaming APIs for smaller, incremental updates or batch them into larger chunks for regular imports.
2. Branch Switching:
There are a couple of choices to manage different branches:
In the realm of backend engineering, a comprehensive Disaster Recovery (DR) plan is paramount, particularly for retail-focused subsystems like VAIS:Retail. This plan should address the potential failure of catalog and user event ingestion pipelines, regardless of the underlying cause. In such scenarios, the ability to swiftly restore the catalog to a functional state is critical.
Leveraging BigQuery for data staging offers a distinct advantage in disaster recovery. If the current catalog or user event data within VAIS:Retail is not significantly different from the most recent snapshot stored in BigQuery, a simple import API call can initiate a rapid restoration. This approach minimizes downtime and ensures the search functionality remains operational.
Conversely, if BigQuery is not integrated into your data pipeline, alternative mechanisms must be in place to expeditiously reload the catalog from a known good state. These mechanisms might involve backup systems, data replication, or other failover strategies.
By incorporating these disaster recovery considerations into your VAIS:Retail architecture, you can bolster the system's robustness and maintain business continuity even in the face of unexpected disruptions. A well-prepared DR plan ensures that your retail search capabilities remain operational and responsive, minimizing the impact on customer experience and revenue generation.
Multiple components used in the architecture can be configured to be multi regional by default. For example, when leveraging BigQuery for persisting data, it can be configured to be multi-regional so that data redundancy and availability is automatically handled by Google Cloud. Similarly when using Google Cloud Storage, it can be configured to be mutli regional.
Other components like Pubsub, Retail API provide less flexibility for regional configuration and are typically globally available. These components provide global access with Google Cloud being responsible for its availability as per their individual SLAs.
However, there are components in the design that are regional constructs. For example if the dataflow is leveraged to transform and ingest data into VAIS:Retail, then these components need to be DR capable. Customers can either deploy dataflow jobs in multiple regions and have all the instances work actively or they can have only instances in one region process the input while other dataflow instances stay passive .
To achieve continuous availability setup, the data ingested from Pubsub topic should have some header information in each message that can be leveraged to filter on a particular Pubsub subscription. For example when sending product updates as a message into Pubsub topic, it can also include an attribute like key region, value us-central1 or key region, value us-east1. In active/active design we do not want the same message to be processed in multiple regions. We instead want a single message to be processed only once in a single region. By leveraging attributes on the message the subscription for one region can be configured to use filters on the region attribute.
In case of a certain region failure, the upstream system should simply switch the header attribute being passed into Pubsub message to the value of the region where there isn’t any failure. For example if us-central1 has a failure, then all the new messages should only contain the key region, value us-east1 . This way all the messages will be filtered to be sent to us-east region.
In active/passive design only one subscription should be attached and the rest of the subscriptions as shown should be marked as detached. This way all messages will be processed by a single subscription . In case of a regional failure, simply detach one subscription and attach another subscription which is used by dataflow in another region.
Leveraging BigQuery in the design for the data ingestion can result in handling resiliency as well as creating capability for forensics and debugging .
Products and inventory ingested directly with the patch and addLocalInventory API are fire and forget in nature. It implies that once the data is sent to VAIS:Retail , there isn’t any trail left of the product and inventory update anymore.
The customer may want to know why a certain product is not showing up as they expect it to be. Having a staging area built with BigQuery with a complete history of data , makes forensic and debugging those types of questions quite easy.
Branch 0 and Branch 1 and Branch 2 serve as live and day old backup and a week old back branches. Data ingested directly into Branch 0 gets aggregated and indexed into Branch 1 daily and Branch 2 weekly. This way any corruption of data can be easily rolled back thereby enhancing the business continuity and system resilience
Furthermore, analysis and debugging can be achieved as the entire history and lineage of the data is maintained in Global BigQuery datasets
Once the core mechanisms for catalog ingestion in VAIS:Retail are established, a proactive approach involves assessing their resilience against various corner cases. While some of these scenarios might not be immediately relevant to your specific business requirements, factoring them into your backend design can provide invaluable future-proofing.
This preparatory step entails reviewing your data pipeline's ability to handle unexpected or edge-case scenarios, ensuring its robustness and adaptability to evolving demands. By anticipating potential challenges and addressing them proactively, you can mitigate future disruptions and maintain the seamless flow of product data into your retail search system.
To achieve this, the dataflow logic should be built such that it
The following shows a sample of a BigQuery Table schema to persist all failures for debugging:
[ { "mode": "REQUIRED", "name": "ingestedTimestamp", "type": "TIMESTAMP" }, { "mode": "REQUIRED", "name": "payloadString", "type": "STRING" }, { "mode": "REQUIRED", "name": "payloadBytes", "type": "BYTES" }, { "fields": [ { "mode": "NULLABLE", "name": "key", "type": "STRING" }, { "mode": "NULLABLE", "name": "value", "type": "STRING" } ], "mode": "REPEATED", "name": "attributes", "type": "RECORD" }, { "mode": "NULLABLE", "name": "errorMessage", "type": "STRING" }, { "mode": "NULLABLE", "name": "stacktrace", "type": "STRING" } ]
High-traffic events like Black Friday and Cyber Monday (BFCM) pose a significant challenge to data ingestion pipelines. The surge in inventory updates (stock levels, prices, etc.) and potential changes to product attributes demand robust infrastructure. It's crucial to assess whether your ingestion system can handle this increased load. Simulated load testing, replicating peak BFCM traffic patterns, is highly recommended to identify bottlenecks and ensure smooth operation during these critical periods.
Flash sales introduce a unique challenge due to their short duration and rapid inventory fluctuations. Ensuring real-time inventory synchronization is paramount to prevent discrepancies between search results and actual availability. Failure to do so can lead to negative customer experiences, such as popular products appearing as "in stock" when they're actually sold out, or vice versa. Additionally, price changes during flash sales can significantly impact product ranking, highlighting the need for accurate and timely price updates in the search index.
Business growth or product line expansions can result in a dramatic increase (e.g., 5x or 10x) in the number of products within your catalog. Your ingestion architecture must be scalable to accommodate this growth seamlessly. This may necessitate revisiting the entire ETL (Extract, Transform, Load) pipeline, particularly if new data sources or product information formats are introduced.
By proactively addressing these potential scenarios, you can ensure that your VAIS:Retail ingestion pipeline remains robust, scalable, and responsive, even in the face of sudden traffic spikes, flash sales, or significant catalog growth. This proactive approach safeguards the accuracy and reliability of your search results, contributing to a positive user experience and driving business success.
The data ingestion pipeline’s performance should be evaluated and a baseline should be formed for following metrics: