VisionID

Real-time multi-object detection and tracking with YOLOv8 and DeepSort.

April 15, 2024· 6 min read

Problem

Tracking multiple objects across video frames requires balancing detection accuracy with tracking continuity. YOLO alone loses object identity across frames, while naive IoU tracking fails under occlusion.

Solution

Combined YOLOv8 for per-frame detection with DeepSort for appearance-based re-identification, exposed through a FastAPI interface supporting both batch video and live stream inputs.

Architecture

The pipeline runs in two stages:

Detection: YOLOv8n processes each frame and returns bounding boxes with class scores.
Tracking: DeepSort assigns consistent IDs by matching detections against existing tracks using Kalman filter prediction and cosine similarity on appearance embeddings.

API Design

Two endpoints:

POST /track/video — processes uploaded video and returns tracking JSON.
GET /track/stream — SSE endpoint for live tracking data.

Tuning Notes

Confidence threshold 0.45 and NMS IoU 0.5 gave the best balance on the traffic camera test set. Track confirmation requires 3 consecutive detections.