Ultralytics YOLO26: 統一されたリアルタイムビジョンモデルファミリー

2026年06月23日 #AI

Ultralyticsが、実時間の視覚モデル「YOLO26」を発表。

従来のYOLO検出器の限界を克服するために設計された、新しいコーディネートアーキテクチャと訓練手法を採用している。

Ultralyticsが2026年6月に発表したYOLO26は、リアルタイムでの画像認識を実現するための統合型モデルファミリです。従来のYOLOモデルの課題を解決し、多様なタスクに適用可能な新設計が特徴です。

従来のYOLOモデルの課題

従来のYOLOモデルは、非最大値抑制（NMS）を推論時に依存しており、検出ヘッドが重くなる傾向があります。また、分布フォーカルロス（DFL）の影響で、トレーニング時間が長くなり、小さな対象物のラベル割り当てが困難な問題がありました。

YOLO26の革新点

YOLO26は、NMSなしのエンドツーエンド推論を実現するため、二重ヘッド設計を採用しています。DFLを完全に廃止し、軽量なヘッドと制限のない回帰範囲を実現しています。さらに、トレーニングパイプラインには、大規模言語モデルのトレーニングから導き出されたハイブリッド最適化手法を採用しています。

多様なタスクへの応用

YOLO26は、インスタンスセグメンテーションやポーズ推定、方向検出など、複数のタスクに応じたヘッドとロス設計を導入しています。これにより、タスクやスケールに関係なく一貫した性能向上が期待できます。また、テキストや画像、プロンプトなしで推論が可能な拡張版も提供しています。

まとめ

YOLO26は、リアルタイムでの画像認識をより効率的かつ正確に実現するための新しいモデルファミリです。今後の技術革新に向けた重要な進展として注目されています。

原文の冒頭を表示（英語・3段落のみ）

View PDF

HTML (experimental)

Abstract:Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at this https URL.

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

— 元記事を読む ↗

元記事を読む ↗