Importing to Vertex dataset does not import labels...

ed_day · 11-02-2022 04:12 AM

In Vertex AI I am updating an image dataset, thus:

from google.cloud import aiplatform
import_schema_uri = aiplatform.schema.dataset.ioformat.image.single_label_classification
dataset_id = "my_ds_id"

ds = aiplatform.ImageDataset(dataset_id)))
ds.import_data(gcs_source=DATASET_PATH, import_schema_uri=import_schema_uri)

the images are uploaded to the dataset but their labels are ignored and they are classed as Unlabeled. What am I doing wrong? TIA!

PS they are in a csv, like:

gs://path/to/file/barnacles.jpg,label1

which worked fine for the dataset creation.

cristianrm

You could check this sample code to Import data for image classification single label:

from google.cloud import aiplatform


def import_data_image_classification_single_label_sample(
    project: str,
    dataset_id: str,
    gcs_source_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
    timeout: int = 1800,
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.DatasetServiceClient(client_options=client_options)
    import_configs = [
        {
            "gcs_source": {"uris": [gcs_source_uri]},
            "import_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml",
        }
    ]
    name = client.dataset_path(project=project, location=location, dataset=dataset_id)
    response = client.import_data(name=name, import_configs=import_configs)
    print("Long running operation:", response.operation.name)
    import_data_response = response.result(timeout=timeout)
    print("import_data_response:", import_data_response)

ed_day

Thanks, but exactly the same result.

cristianrm

From this Tensorflow blog post:

In addition to image files, we've provided a CSV file (all_data.csv) containing the image URIs and labels. We randomly split this data into two files, train_set.csv and eval_set.csv, with 90% data for training and 10% for eval, respectively.
gs://cloud-ml-data/img/flower_photos/dandelion/17388674711_6dca8a2e8b_n.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/sunflowers/9555824387_32b151e9b0_m.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy
gs://cloud-ml-data/img/flower_photos/roses/512578026_f6e6f2ad26.jpg,roses
gs://cloud-ml-data/img/flower_photos/tulips/497305666_b5d4348826_n.jpg,tulips
We also need a text file containing all the labels (dict.txt), which is used to sequentially map labels to internally used IDs. In this case, daisy would become ID 0 and tulips would become 4. If the label isn't in the file, it will be ignored from preprocessing and training.
daisy 
dandelion 
roses 
sunflowers 
tulips 

Therefore, you need to create the dict.txt file which will have the all the labels used as shown above.