Solved: Not all data being ingested in Vertex AI.

tararelan · 11-12-2024 10:51 PM

I'm trying to import my user events to Vertex AI:

10197 datapoints for category-page-view
41500 datapoints for detail-page-view
10329 datapoints for home-page-view
20986 datapoints for purchase-complete
10364 datapoints for search

However, when I look at the data being ingested, it's much lower:

5815 datapoints for category-page-view
3302 datapoints for detail-page-view
5636 datapoints for home-page-view
3173 datapoints for purchase-complete
2950 datapoints for search

I also have 1039 products in my product catalogue. Does anyone have an idea what the issue might be?

ruthseki

Hi @tararelan,

Welcome to Google Cloud Community!

There are several reasons for the discrepancy between the number of user events you're trying to import and the actual number ingested into Vertex AI.

Here's how you can approach this:

Data Validation and Filtering:

Incorrect Data Format: Vertex AI expects data in a specific format (e.g., JSON). Verify that your user event data adheres to the required format. Incorrect formatting can lead to data being skipped.
Data Filtering: Check if there are any filters applied during ingestion. Vertex AI allows for filtering based on specific criteria. Ensure these filters are not accidentally excluding a significant portion of your data.
Schema Mismatch: Ensure that the schema you're using to ingest the data aligns perfectly with the data structure of your user events. A mismatch can lead to data being interpreted incorrectly or discarded.
Duplicate Entries: If your data source contains duplicate events, Vertex AI might be de-duplicating them during ingestion.

Connection and Bandwidth:

Network Issues: Check for any network connectivity problems or bandwidth limitations that might hinder data transfer to Vertex AI.
Rate Limiting: Vertex AI might have rate limits on data ingestion. Ensure you're not exceeding these limits. If you're attempting to ingest a large volume of data quickly, you might need to adjust your ingestion strategy or use a batching approach.

Ingestion Process:

Data Pipeline Errors: If you're using a data pipeline to ingest the data, investigate any potential errors or exceptions within the pipeline. Make sure your data pipeline is working correctly and there aren't any errors leading to dropped events.
Ingestion Mode: Vertex AI supports various ingestion modes. Ensure that you're using the appropriate mode for your data volume and needs (e.g., batch, stream).

Vertex AI Configuration:

Data Retention Policy: Vertex AI might have a data retention policy in place. Check if the data retention policy is set to a shorter period than your expected data lifetime, which could lead to data being deleted.
Data Sampling: If you're using data sampling during ingestion, ensure that the sampling rate is not excessively low, leading to a significant reduction in the number of events ingested.

To understand the specific reasons for the difference, you can:

Review Logs: Check Vertex AI logs for any error messages or warnings related to data ingestion.
Verify Schema: Ensure that your data conforms to the expected schema in your dataset.
Monitor Data Pipeline: Check the performance of your data pipeline and identify any bottlenecks or delays.
Analyze Data Quality: Inspect your data for any potential issues that might cause filtering or rejection.
Use the Vertex AI UI: The Vertex AI UI provides insights into data ingestion status and can help identify potential issues.

I hope the above information is helpful.

View solution in original post

ruthseki