Vertex AI was unable to import data into dataset

  1. On the Datasets page, I create a new dataset (text classification -- multi-label).
  2. I choose "Upload import files from your computer", selecting my training data *.csv file (263 MB).
  3. Then, I wait a while: maybe 30 minutes to an hour.
  4. Half the time, I get an email like this:

"Vertex AI was unable to import data into dataset "[dataset]"

Hello Vertex AI Customer,

Due to an error, Vertex AI was unable to import data into dataset "[dataset]".

Additional Details: Operation State: Failed with errors

Resource Name: [resource]

Error Messages: Internal error occurred. Please retry in a few minutes. If you still experience errors, contact Vertex AI.

To view the error on Cloud Console, go back to your dataset using [link]

Sincerely, The Google Cloud AI Team

Following the link takes me back to the dataset import page, where it says "Unable to import data due to errors". If I click "Details" it has the same error message as the email. It won't allow me to browse the dataset or train a new model, because it failed to import.

What puzzles me is that sometimes it imports successfully or partially, sometimes it doesn't, with nearly identical datasets on the very same day, leading me to believe the file size, CSV formatting, and character encoding must be acceptable. Even so, I have tried:

  • Ensuring that the CSV formatting is correct, with quotes and escapes where necessary.
  • Scrubbing all characters outside the ASCII printable range (32-126) and trimming input.

I don't know what the problem is because the error message does not say anything meaningful. Any help is appreciated.

1 2 276
2 REPLIES 2

It is indeed vague, but commonly is due to CSV formatting. I see that you checked your CSV, you can view this documentation for accepted csv formats for autoML(Or what AutoML training for your use case) here. I would suggest to file a support if its still persistent.

 

Also as a another reference I see a community post with an answer for a similar inquiry like this one here.

Thank you for the response. Just to be sure, I checked my CSV file against the formatting and data requirements for importing a multi-label text classification dataset into AutoML:

  • 217,032/1,000,000 training documents
  • 828/5,000 unique category labels
  • Each label applies to at least 10 documents.
  • Each document has at least one label.
  • The CSV file is 257 MB.
  • All sequences of characters within the text that fall outside the ASCII range '!' to '~' are replaced with a single space, so there are no newlines or special characters. Leading and trailing spaces are trimmed.
  • I am using C#'s CsvWriter class:
    • There is no header row.
    • The first column is always enclosed in double quotes, regardless of whether or not it contains a comma. The labels are integers and are never in quotes.
    • All double quotes within the text are escaped as two double quotes (" becomes "") as is standard in most CSV implementations.

My intuition is that the dataset may be too large. However, larger datasets have imported successfully in the past: for example, one was 263 MB with 220,863 documents, and it partially imported. Because some documents were lost, I had to manually remove labels on the dataset page that no longer applied to at least 10 documents before I could train a new model, but that is acceptable to me. It had a lot of error messages like this (where IMPORTFILE is the CSV dataset import file):

Error: Unable to get storage client in 10 retries for element: for: gs://IMPORTFILE line X

As far as I can tell, there is nothing special about the lines where it fails. Maybe something similar is happening behind the scenes when a dataset entirely fails to import. It looks like some kind of internal network error. I am still open to any ideas.