Generating Synthetic Images with GPT-4 (DALL-E)
Note about AI labeling Models used for labeling are trained on vast collections of manually labeled objects. They usually work great for common objects. For labeling uncommon objects, or even objects that are new, e.g. brand-new commercial products, these models fall short. In these cases Cosmos Transfer offer a novel architecture as label data from a 3D scene can be reused. This is due to the highly controllable nature of Cosmos Transfer, where certain aspects of the input signal data will be respected when a new variation of a Omniverse-created video sequence is generated. This can be achieved in a number of ways thanks to multimodal control signal input. In short - a new video clip is generated, for instance with a different background, but objects of interest stay at the same screen-space position as in the input clip. Bounding boxes therefore remain valid for both clips. Cosmos Transfer is not covered in this article, but this should be an important feature to consider when choosing among the different Cosmos models.Another variation is to use video generation, such as OpenAI Sora, Google Gemini Veo 3 or any high performing video generators. Text, image and video input can prompt generation of video sequences. The advantage of generating video clips over still images is that we get the objects of interest in many different angles with minimal effort. For object detection, these sequences can then be split into individual still frames, before labelling. Training image generation with these types of models have the disadvantages of being hard to control. Objects of interest will tend to morph out of form and strict camera control is hard to achieve. Generating training images of novel objects, say a new commercial product, is also currently hard to accomplish. These models have been trained for generalization, from movie stars eating pasta to presidents promoting unconventional vacation resorts. Some comparisons:
Feature | Manual | Sora/Veo | Omniverse Replicator | Cosmos Predict | Cosmos Transfer | Notes |
---|---|---|---|---|---|---|
Installation | N/A | N/A | Medium | Medium** | Medium** | **Requires GPU farm |
Initial domain customization | High - field work | Low | Medium | Low* | Medium* | *Cosmos needs post-training if out of foundation context, high effort |
Iteration effort | High | High* | Medium | Low | Low | *API not generally available June 2025 |
Variation effort | High | High* | Medium | Low | Low | *API not generally available June 2025, seed not accessible, negative prompt not available in Sora |
Photorealism | High | High | Medium | Medium* | Medium* | *14B models |
Suitability for novel objects | High | Low | Medium | Low | Medium | |
Automatic labeling quality | Medium | Medium | High* | Medium | High* | *Perfect labeling from Replicator semantic tagging |
A first person view from the perspective of a FPV quadcopter as it is flying over a harbor port. The FPV drone is slowly moving across the harbor, 20 meters above the ground, looking down at cargo containers, trucks and workers on the docks. The weather is cloudy, it is snowing lightly, the ground is covered in a 1 cm layer of snow. Wheel tracks are visible in the snow. PhotorealisticThe API automatically returns a refined prompt; we can feed this back into the model (or an LLM) to iterate on scene detail:
In a breathtaking aerial journey, we soar above a bustling harbor, captured through the lens of a cutting-edge first-person view (FPV) quadcopter. The camera glides smoothly at a steady 20 meters above the ground, revealing a sprawling landscape of vibrant cargo containers stacked neatly on the docks, their bright colors contrasting against the deep blue of the water. Below, a lively scene unfolds as workers in high-visibility vests coordinate the loading and unloading of containers, while trucks and forklifts crisscross the area, their movements a testament to the port’s dynamic energy. The sun bathes the scene in a golden-hour glow, casting long shadows that dance across the ground, while the clear sky enhances the vivid hues of the containers and the shimmering water. This cinematic experience, enhanced by dynamic color grading and a steady, immersive perspective, invites viewers to feel the thrill of flight and the rhythm of the port’s industrious life.Another slight prompt change can dramatically change the output:
View directly underneath a first-person quadcopter as it is flying over a harbor port. The view is slowly moving across the harbor, 10 meters above the ground, looking straight down at cargo containers, trucks and workers on the docks. The weather is cloudy, it is raining heavily, the ground is wet. PhotorealisticWe can easily achieve combined environmental factors that would require a lot of work if this was created as a 3D scene in Omniverse.
View directly underneath a first-person quadcopter as it is flying over a harbor port. The view is slowly moving across the harbor, 10 meters above the ground, looking straight down at cargo containers, trucks and workers on the docks. The weather is clear, it is midnight and dark. The ground and objects are illuminated by a full moon and powerful site flood lights. It is raining heavily, the ground is wet. PhotorealisticExtract stills
AI Labeling using OWL-ViT
Model | Required GPU VRAM |
---|---|
Cosmos-Predict2-2B-Text2Image | 26.02 GB |
Cosmos-Predict2-14B-Text2Image | 48.93 GB |
Cosmos-Predict2-2B-Video2World | 32.54 GB |
Cosmos-Predict2-14B-Video2World | 56.38 GB |
Read
permissionhuggingface-cli login
Models | Link | Download Command |
---|---|---|
Cosmos-Predict2-2B-Text2Image | 🤗 Huggingface | python -m scripts.download_checkpoints --model_types text2image --model_sizes 2B |
Cosmos-Predict2-2B-Video2World | 🤗 Huggingface | python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B |
--negative_prompt="${NEGATIVE_PROMPT}"
Parking lot Text2Image-2B
Warehouse docking Text2Image-2B
Drainage pipe Text2Image-2B
Bikes Text2Image-2B
e-scooter Text2Image-2B
batch_can_factory.json:
batch_can_factory.json
as batch_can_factory_template.json
.python run_can_factory_batches.py --n 3
Batch output
Notes on prompting for realism Diffusion models like the one Cosmos Predict uses are currently achieving increadible image fidelity. When generating images for training object detection models intended to run on constrained hardware, or any type of hardware for that matter, best results are achieved by generating images of a quality that closest resembles the quality the device itself produces. This should in theory be possible to achieve by prompting e.g.or by adding —negative_prompt e.g.Experiments suggest that low‑fidelity camera‑noise prompts have limited impact on Cosmos Predict or Sora outputs—likely because the training corpus emphasises high‑quality footage. With the Cosmos models however, we have the option to fine-tune to our needs. Also note that adding “Photorealistic” is recommended in the documentation, but shows no significance in testing.
Batch output
Batch
Note: SAM2 video predictor wants jpg-files as input.Clone the repo.
Labeled images with segmentation masks and bounding boxes
--box_threshold
and --text_threshold
. Both are cosine‑similarity cut‑offs in the range 0‑1, but they act at different stages of the pipeline:
Stage | What is scored? | Kept when score ≥ threshold | Typical default |
---|---|---|---|
box_threshold | The highest similarity between an image region (bounding box) and any token in the prompt | The whole box | 0.25–0.30 (case‑study configs) |
text_threshold | The similarity between that region and each individual token | The token becomes the label for that box | 0.25 (Hugging Face default) |
text_threshold
are dropped, so they never appear in the final phrase.
##lift
or canperson
, due to tokenization of can
, person
and forklift
, but you may miss faint objects.
box_threshold 0.3
, text_threshold 0.25
as recommended in the Grounded‑SAM demo scripts.
--box_threshold
decides whether a region is worth keeping at all, while --text_threshold
decides which words (if any) are attached to that region. Tune them together to balance missed objects against noisy labels.
To be able to upload label data to Edge Impulse Studio we need to convert the output to one of the supported formats, in this case Pascal VOC XML.
Conversion program: json_to_pascal_voc.py
Classification results
nvidia-smi
and we need something like:
Lambda Cloud
NVIDIA Cosmos-Predict2-14B-Video2World demo 00092
--seed
to create different variants.
Resulting videos:
For this use case the highest rate of success was achieved by generating images with the 2B parameters Text2Image model and then using the 14B parameters Video2World to generate video clips. When using 14B Text2Image and 14B Video2World the videos would have higher fidelity with regards to details, but in 3 out of 4 attempts there would be some sort of physical anormaly rendering the video looking strange. Out of 40 attempts the following compilation shows the only consistent results. For object detection however, most of the results would still be usable, as we would extract all frames individually and not care about weird movement.
NVIDIA Cosmos-Predict2-14B-Text2Image demo
NVIDIA Cosmos-Predict2-14B-Text2Image demo
NVIDIA Cosmos-Predict2-14B-Text2Image demo
NVIDIA Cosmos-Predict2-14B-Text2Image demo
NVIDIA Cosmos-Predict2-14B-Text2Image demo
NVIDIA Cosmos-Predict2-14B-Text2Image demo