Large Ingestion

All API calls in Nucleus have a 1 minute time-out. In order to support larger ingestions, the best option is to upload as large a batch as possible using our asynchronous API endpoints, which spin up distributed batch jobs to ingest your data.

  1. First, upload your images to a location accessible by Scale. If you are using non-public cloud storage for your images, then you need to make sure your data is accessible to Scale by following this guide

You can quickly check that your images are readable to Scale by sending a single dataset item and printing the response. If you get no errors, proceed to the next step.

import nucleus

client = nucleus.NucleusClient(YOUR_SCALE_API_KEY)
dataset = client.create_dataset("TestDataset", is_scene=False)

accessible_url = "s3://your_example_bucket/your_example_key"
dataset_item = nucleus.DatasetItem(image_location=accessible_url, reference_id="test_image_id", metadata={})

print(dataset.append(dataset_items))

  1. Convert all of your data into the format expected by Nucleus. There are 2 classes of data to convert

    1. Images and their associated metadata, which become DatasetItems
    2. Groundtruth/Model Predictions, which are almost identical, except that predictions have a confidence field.
      1. For more details on all annotation formats see Annotations
      2. For more details on all prediction formats see Predictions

  1. Make the following API calls to ingest all of your data. All these endpoints return AsyncJob objects.
# Setup
import nucleus
client = nucleus.NucleusClient(YOUR_SCALE_API_KEY)
dataset = client.create_dataset("My New Dataset", is_scene=False)


# 1) Ingest images
dataset_item_ingest_job = dataset.append(dataset_items, asynchronous=True)
# Groundtruth and model predictions must be added AFTER item ingestion completes:
dataset_item_ingest_job.sleep_until_complete()

# 2) Ingest groundtruth
groundtruth_ingest_job = dataset.annotate(annotations, asynchronous=True)

# 3) Ingest model predictions
model = client.add_model(name="My First Model", reference_id="My-CNN")
model_prediction_ingest_job = dataset.upload_predictions(model, predictions, asynchronous=True)

model_prediction_ingest_job.sleep_until_complete()

# Only after ingestion completes, kick off comparison and metric calculation.
dataset.calculate_evaluation_metrics(model)