Disney Hot Air Balloon - ML Object Detection
- bennyp3333
- Jun 2
- 3 min read
Updated: 5 days ago
This project established a pipeline to go from a 3D model to a functional object detection model, capable of estimating 3D position in AR using machine learning and synthetic data generation.
Synthetic Data Generation
To generate training data, I created a custom Blender Python script. The script allows any 3D model to be placed in the scene, then automates the rendering of thousands of images by randomizing several parameters:
Camera angle
HDRI lighting environment
Other scene/camera variables

For each render, the script exports both the image and the corresponding bounding box data in a JSON file.
To ensure lighting diversity, I used a plugin from Poly Haven, which gives access to over 800 high-quality HDRIs. This helped maintain variability even across 10,000+ images, reducing the risk of overfitting.
The most time-consuming part of the process is loading the 8K HDRI into RAM and denoising each image (which is CPU-bound). Despite using GPU rendering, the denoising is handled by the CPU, making the script more CPU-heavy than GPU-reliant.
I used this tool to generate:
~30,000 images of a Red Bull can for initial testing
~15,000 images of a Disney hot air balloon once that model was finalized
Debugging Note:
Bounding box output from Blender initially had issues. They didn’t always align correctly with the object. After some trial and error, this was resolved.
ML Training
I began by testing object detection using Python notebooks provided by Snapchat, which support:
MobileNet
YOLOv8
Results were mixed, accuracy and robustness weren’t consistent across models.
To improve performance, I transitioned to Roboflow, which offered:
A faster training process
Built-in data augmentation (rotation, brightness, zoom, etc.)
This led to noticeably better results.
I believe Roboflow also uses YOLO under the hood, but the key advantage came from its augmentation pipeline, which increased dataset variability and robustness. For example, while my Blender script only tilts the camera up to 20° on the roll axis, Roboflow’s augmentations provided much wider angle diversity, improving detection range.

3D Position Inference
The ML model only outputs a 2D bounding box of the object in the camera view. While useful, this isn't sufficient for placing graphics in 3D space (especially for aligning objects to the correct world position and orientation).
Fortunately, for this project, we were detecting a hot air balloon model that is radially symmetrical and always upright. This meant we could make the following assumptions:
The balloon has a fixed world orientation (upright).
The yaw axis is irrelevant due to symmetry.
With its known real-world height, we can use its screen height to estimate distance from the camera.
Step-by-step 3D Position Estimation
Determine Height of the Object Within the Bounding Box
Instead of simply using the height of the bounding box in screen space, we improve accuracy by projecting the world up vector (from Snapchat's world tracking) into screen space and measuring where it intersects the bounding box. This gives a more accurate reading of the object’s vertical extent, even if the phone is tilted.
The world up vector is projected into 2D.
A line is cast through the bounding box along that vector.
The intersection points with the bounding box give us a more accurate measurement of how tall the object appears in the image.
Estimate Depth Using Real-World Height
Given the estimated screen-space height and the known physical height of the balloon, we can calculate how far the object is from the camera using simple projection math. This gives us the Z (depth) position.
Calculate Full 3D Position (X, Y, Z)
Combining the bounding box center (in screen space) with the depth, and using the device’s camera intrinsics, we back-project to find the full 3D world coordinates of the balloon.
Fallback Orientation from Balloon Parts
World tracking doesn’t always initialize in Snapchat reliably. As a fallback, I trained the ML model to detect the balloon fabric and the basket separately. The vector from the basket to the balloon top gives us an inferred "up" direction, which allows us to orient the object without relying on world tracking.
This method only works because the object’s orientation is constrained and predictable, it wouldn’t generalize to arbitrary objects.
Takeaways
I have built out a working pipeline:
3D model → synthetic image dataset → ML training → real-time detection with 3D inference
Standard object detection only gives 2D bounding boxes, which aren’t very useful on their own for AR. We worked around this due to the balloon’s predictable shape and orientation.
The synthetic data generation script is reusable for other models, this is the most general and transferable piece of the project.
The rest of the pipeline is currently specialized for simple, symmetrical objects but could evolve into more robust tracking with the right model and data.