A robot that can’t see can’t navigate. Simple as that. But picking the right computer vision model for a robot that needs to move through the real world is harder than it looks. The model that wins on a benchmark leaderboard might be too slow for real-time control, and the one that’s fast enough might not recognize half the objects your robot will encounter.
I’ve been building robot navigation systems for the last three years, and here are the models that actually earn their keep in 2026.
Detection Models: Seeing What’s There
YOLOv8 and YOLO-NAS — The Reliable Workhorses
YOLOv8 is still the baseline for real-time object detection on robots, and for good reason. It’s mature, well-documented, and runs on practically any hardware. Ultralytics did a great job with the training pipeline and export tooling — going from dataset to ONNX export takes me about an afternoon.
But YOLO-NAS is where I point new projects today. Deci AI’s architecture search found a better trade-off between speed and accuracy. On a Jetson Orin Nano with TensorRT INT8, YOLO-NAS delivers roughly 5% higher mAP than YOLOv8 at the same inference speed. That doesn’t sound huge, but in robotics it means your robot detects a chair 1 meter earlier — enough time to plan a smooth avoidance path.
Here’s where they differ in practice:
- YOLOv8 — Better for custom training from scratch. More tools and community fine-tuned models available. If you need to train on a specific object set (say, warehouse pallets), YOLOv8’s training ecosystem is more mature.
- YOLO-NAS — Better for off-the-shelf deployment. The pre-trained COCO weights are more accurate, and the architecture is more efficient at inference time. I use this for general navigation where the robot needs to recognize common obstacles: people, furniture, vehicles, doors.
For size-constrained robots, YOLO-NAS runs at 90+ FPS on Jetson Orin Nano at 640×640 with INT8 quantization — that’s faster than most cameras can deliver frames. The bottleneck becomes the camera pipeline, not the model.
Detic — The Open-Vocabulary Superpower
Here’s a problem every robot navigator runs into: your training data didn’t include that weird-shaped object. Maybe it’s a stroller in a hallway, a fallen sign, or a piece of construction equipment. YOLO will happily ignore it because it’s not in its 80 COCO classes.
Detic (Detecting Twenty-thousand classes) solves this by using CLIP embeddings as the classifier head. The detection backbone finds regions, and CLIP tells you what’s in them. In practice, this means Detic can detect anything — it’s not limited to predefined classes.
The trade-off: speed. Detic runs at about 15-20 FPS on a Jetson Orin NX. That’s too slow for primary obstacle detection in a fast-moving robot, but perfect for a secondary “what is that thing?” pass. My typical setup: YOLO-NAS handles primary detection at 60 FPS, and every 5th frame goes through Detic for open-vocabulary classification of whatever YOLO flagged as “unknown.”
For a research robot exploring novel environments (construction sites, disaster areas, forests), Detic is indispensable. You don’t know what objects exist until the robot encounters them.
Depth Estimation: Understanding the 3D World
Depth Anything — Monocular Depth That Actually Works
Monocular depth estimation used to be unreliable for navigation. Depth Anything V2 changed that. It’s trained on a massive dataset of labeled and unlabeled images, and the zero-shot generalization is genuinely impressive.
I tested it in a cluttered lab with transparent objects (a nightmare for most depth models) and it produced usable depth maps where stereo cameras failed. The model handles glass, reflections, and low texture surfaces much better than depth estimation models from even two years ago.
On a Jetson Orin NX, Depth Anything V2 runs at 30+ FPS at 512×512 resolution with TensorRT FP16. That’s fast enough to use as the primary depth source for a mobile robot, eliminating the need for a dedicated depth sensor like an Intel RealSense.
The key insight for navigation: you don’t need perfect metric depth. You need relative depth — “there’s an obstacle closer than 2 meters” — and Depth Anything provides that reliably. Pair it with a simple occupancy grid, and you have a functional navigation pipeline with just a single RGB camera.
Visual Feature Models: Where Am I?
DINOv2 — Features That Generalize Everywhere
DINOv2 from Meta is my go-to for visual localization and place recognition. It’s a self-supervised vision transformer that produces features so robust they work across dramatic viewpoint changes, lighting conditions, and seasonal variations.
The practical use case: your robot navigates a building in July. In January, the lighting is completely different, snow is visible through windows, holiday decorations have rearranged the space. Traditional feature extractors (SuperPoint, ORB) fail on this. DINOv2 features are robust enough that place recognition still works.
I use DINOv2-S (small) for speed. It produces 384-dimensional feature vectors that work well with cosine similarity for loop closure detection in SLAM. On a Jetson Orin Nano, DINOv2-S processes a 224×224 image in about 15 ms — fast enough to run every few frames as a localization check.
The downside: DINOv2 is a vision transformer, and transformers are hungry for VRAM. DINOv2-S needs about 1 GB for inference. DINOv2-B needs 3 GB. Plan accordingly.
SLAM-Specific Models
DroidSLAM — Deep Visual Odometry
DroidSLAM replaced traditional geometric SLAM for my robots last year. It uses a recurrent iterative update that produces remarkably smooth trajectory estimates from monocular video. The dense optical flow formulation handles rapid motion and feature-poor environments much better than ORB-SLAM3.
The trade-off: GPU memory. DroidSLAM needs ~2.5 GB VRAM for 512×512 input on a Jetson Orin. It’s worth it for the quality, but it means you can’t run it alongside a heavy detection model on an Orin Nano. I use it on the Orin NX where I have 16 GB to work with.
ORB-SLAM3 with Neural Features
The ORB-SLAM3 pipeline with learned features (replace ORB with SuperPoint features) is my fallback when GPU memory is tight. SuperPoint features are more repeatable than ORB under illumination changes, and the whole pipeline runs on CPU after feature extraction. The GPU only handles the SuperPoint inference (about 5 ms per frame).
On a Jetson Orin Nano, ORB-SLAM3 with SuperPoint features runs at 30+ FPS with about 500 MB VRAM for the feature extractor. The SLAM backend runs entirely on CPU. This is my go-to for smaller robots where every MB of GPU memory matters.
Embedding Models vs Detection Models
Here’s a distinction that took me too long to learn: detection models (YOLO, Detic) tell you what is where. Embedding models (DINOv2, CLIP) tell you where you are or how similar things are.
For navigation, you need both, but they serve different purposes:
- Detection models run on every frame for real-time obstacle avoidance. Speed over breadth.
- Embedding models run periodically for localization and loop closure. Robustness over speed.
Combining them efficiently means designing a pipeline where they share hardware without starving each other. On a single GPU, I run YOLO-NAS every frame (20 ms), DINOv2 every 10th frame (15 ms every ~300 ms), and Depth Anything every 3rd frame (40 ms every ~100 ms). TensorRT’s CUDA streams make this work without contention.
Inference Speed vs Accuracy Tradeoffs
Here’s my priority ordering for robot navigation models, starting from non-negotiable:
- Latency: Must process fast enough for real-time control. For a wheeled robot at 1 m/s, you need at least 10 FPS detection. For a drone, 30+ FPS.
- Reliability: False negatives (missing an obstacle) are worse than false positives. A model that misses a person is dangerous.
- Robustness: Must work across lighting, weather, and viewpoint changes that your robot will actually encounter.
- Accuracy: Only matters up to the point where it doesn’t affect navigation decisions. A 75% mAP model that runs at 60 FPS is more useful than a 85% mAP model that runs at 15 FPS.
This is why YOLO-NAS wins most of my projects over fancier models like Detic or Mask R-CNN. It nails the first three priorities and delivers “good enough” accuracy.
Deployment on Edge Hardware
Practical setup recommendations based on what I’ve learned through painful trial and error:
| Hardware | Detection | Depth | Features | SLAM |
|---|---|---|---|---|
| Jetson Orin NX (16 GB) | YOLO-NAS | Depth Anything V2 | DINOv2-S | DroidSLAM |
| Jetson Orin Nano (8 GB) | YOLO-NAS | MiDaS 3.1 (INT8) | DINOv2-S (every 20th frame) | ORB-SLAM3 + SuperPoint |
| Raspberry Pi 5 + Coral | MobileNetV4 (on TPU) | Depth Anything Tiny | N/A (use grid-based) | ORB-SLAM3 |
| Raspberry Pi 5 (CPU only) | N/A (use ultrasonic + IR sensors) | N/A | N/A | Simple lidar-based |
The Pi 5 entry is realistic: for truly low-cost robots, you trade CV models for classical sensors. A $40 RPLidar + $10 ultrasonic sensors + ORB-SLAM3 on CPU gives you functional navigation. Not as elegant as deep learning, but it works reliably and uses 5 watts total.
My current research robot runs a Jetson Orin NX with YOLO-NAS + Depth Anything V2 + DINOv2-S, all pipelined through TensorRT. It navigates university hallways autonomously, handles open spaces and corridors, and re-localizes when picked up and placed elsewhere. That’s a $600 compute module running a navigation stack that would have required a workstation three years ago.
The field is moving fast. But the models I’ve covered here — YOLO-NAS, Detic, Depth Anything, DINOv2, and the SLAM frameworks — represent the practical frontier for robot navigation in 2026. They’re available now, they work on actual hardware, and they’ll handle most real-world environments your robot will encounter.
