SubQ 1.1 Small: 小型ながら高性能な長文検索モデル

2026年06月16日 #AI

小さくても強力な長文検索モデルSubQ 1.1 Smallが登場。

SSAの効率性により、64.5倍速い処理と99.12%の精度を実現し、金融分析や法律、ソフトウェア開発に活用される見込み。

2026年6月16日に、テック企業が開発したAIモデル「SubQ 1.1 Small」のリリースが発表されました。このモデルは、大規模な文書やコードベースを処理する際の課題を解決するため、新しい技術「SSA（Subquadratic Sparse Attention）」を採用しています。

技術革新の背景

これまでのAIモデルは、大規模な文書を処理する際、計算リソースが急激に増えるという課題がありました。これは「注意機構（attention mechanism）」が文脈の長さに比例して計算量が増えるためです。SubQはこの制約を解消するために、SSAという新しい技術を採用しました。

性能と効率の向上

SubQ 1.1 Smallは、1200万トークンの文脈を処理する際、計算量を約64倍減らし、FlashAttention-2よりも56倍高速に動作します。また、長文脈の検索や知識の整理など、さまざまなタスクで高い性能を示しています。

応用分野と今後の展望

金融分析や法的文書の処理など、大規模な文書を処理する必要がある分野で活用が期待されます。今後は、より多くのモデルがリリースされ、さまざまな分野での応用が進むと予想されます。

まとめ

SubQ 1.1 Smallのリリースは、大規模文書処理の分野で大きな進展を示しています。今後の展開に注目が集まっています。

原文の冒頭を表示（英語・3段落のみ）

DateJune 16, 2026The hardest enterprise AI problems share a common shape. They require reasoning over complete artifacts: entire codebases, document collections, contracts, financial filings.For years, the industry worked around this problem by building retrieval pipelines, chunking strategies, and agentic scaffolding — useful tools, but ultimately workarounds for context limitations of the model architecture. The underlying constraint was attention: compute that scales quadratically with context length, making direct reasoning over large artifacts prohibitively expensive.SubQ is built to remove that constraint. Today we're releasing the model card for SubQ 1.1 Small — the second iteration of our Subquadratic Sparse Attention (SSA) model, at the smallest size. We are in the process of deploying SubQ 1.1 Small with select design partners and plan to deploy a broader lineup of models ranging from 2M to 12M tokens later in the year.Key FeaturesNear-perfect long-context retrieval up to 12M tokens on the needle-in-a-haystack test, with up to nearly 1,000x attention compute reduction.A balance of long-context optimization and general reasoning ability, with strong performance retained across knowledge, coding, and non-coding enterprise agent benchmarks.At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.These results reflect the scaling advantage that SSA's efficiency gains make possible.BenchmarksSubQ 1.1 Small was evaluated across five axes, covering long-context retrieval, context-length generalization, knowledge, coding, and long-horizon agentic tasks.Long-Context Retrieval & GeneralizationWe selected Needle-In-A-Haystack (NIAH) and Nvidia's RULER test because together they test whether the model can find a single fact buried deep in a large context, and whether it can connect the dots across that context.NIAH is the precision test. It places one retrievable fact at a controlled depth within a long context and asks the model to return it exactly. SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.RULER is the capability test. It's 13 tasks go beyond single-fact lookup to cover multi-hop variable tracing, frequency extraction, and aggregation across the full context using the kind of reasoning complete-artifact workloads actually require. SubQ 1.1 Small scores 99.12% at 128K.Multi-task retrievalRULER (128K)99.12%128KSingle-fact retrievalNeedle-in-a-haystack (1M–12M)100%1M100%2M98%6M98%12MGeneral Knowledge & ReasoningSubQ 1.1 Small balances long-context optimization with general reasoning ability without compromise. GPQA Diamond at 85.4% sits just below mid-tier frontier models and well above the smaller tier. LiveCodeBench at 89.7% pass@4 is close to the absolute frontier. AutomationBench Finance at 13% places SubQ 1.1 Small close to the strongest models on that benchmark, ahead of mid-tier and smaller baselines. Absolute scores remain low across all models on this benchmark.BenchmarkSubQ 1.1 SmallGPT-5.5Opus 4.8Sonnet 4.6GPT-5.4-miniGPT-5.4-nanoHaiku 4.5Graduate-level scienceGPQA Diamond · pass@185.493.29287.587.581.767.2Agentic financeAutomationBench13%18%16%8%0%n/r3%Competitive programmingLiveCodeBench v6 · pass@489.79292.288.978.678.269.7n/r = result not reported by the model providerEfficiencySSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. SSA's advantage over dense attention grows as context length increases. At 1M tokens, SubQ requires 64.5x fewer compute than dense attention and runs 56x faster than FlashAttention-2 on a single attention layer. In practice, this drastically changes the economics of long-context training and inference.A full breakdown of the mechanism and how it compares to FlashAttention, DeepSeek sparse attention, and recurrent architectures is in the Technical Report.SubQ uses 64.5x less compute than dense attention, and is 56× faster than FlashAttention-2 at 1M-token contextTrainingWe started with an existing open-weight frontier model, replaced dense attention with SSA, and built long-context capability through staged context extension (262K, 512K, 1M, 2M) followed by roughly one trillion tokens of continued pretraining on naturally long artifacts: books, documents, and repository-scale code.The strongest lever we found for improving long-context retrieval was long-context continued pretraining, made possible by the efficiency of the SSA algorithm. The 12M generalization result reflects both factors: SSA's selection criterion is independent of absolute position, and the capability to use that generalization reliably develops through training on long data.Additionally, we ran more than one hundred experiments across six to seven model generations to get the balance of capabilities between long- and short-context tasks right. That kind of iteration is only possible because SSA enabled our team to run multi-million-token experiments as a standard procedure rather than a rare event, making the research loop more efficient.Use CasesSubQ is designed for workloads that require reasoning over information distributed across the artifact without fragmentation. Here are just a few of the use cases from our initial research:Financial analysis and due diligence. Filings, earnings reports, contracts, and internal records are only meaningful in combination. SubQ reasons across the full collection rather than summarizing each document in isolation.Legal and contract work. A contract may define a term on page 2, qualify it on page 12, and carve out an exception on page 46. Retrieval finds the sentence but loses the relationships. SubQ holds the whole document and reasons across it directly.Software engineering. Codebases distribute logic across files, modules, and dependencies in ways that short-context models can't hold at once. SubQ loads an entire repository into a single context window, enabling architecture-level reasoning, cross-file refactoring, and dependency tracing in one pass. We believe there will be significant value for long-context models in planning, review, and long-horizon memory within coding.What's NextWe'll be kicking off with the first cohort of design partners in the next few weeks, with broader rollout through the quarter and general model releases by end of year.

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

— 元記事を読む ↗

元記事を読む ↗

Hacker News コメント

機械翻訳。HN の元スレッド ↗

EDM115 2026年06月16日
https://subq.ai/docs/subq-1-1-small-model-card.pdf
giancarlostoro 2026年06月16日
This one's interesting, and I think the next frontier for LLMs should really just be, how can we get something like Opus 4.6 to cost drastically less, for the same output? I say 4.6 because from 4.6 onwards it's been pretty darn good, at least for me, always feels like every model upgrade someone hates it, heck even 4.5 was fine.
cmogni1 2026年06月16日
I don’t understand why this lab is allergic to providing details on what they actually made, especially when Chinese labs are more than willing to share architectural specs/code/kernels (eg NSA/FSA, RAMBa, HISA, DSA LightningIndexer, etc). I don’t doubt that they’ve done something here, but the lack of details makes me default not trust this, particularly when this is the second time that they’ve released a “technical report” that just waxes poetic about the concept.
embedding-shape 2026年06月16日
> SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
wxw 2026年06月16日
> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length.> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
satyarohith 2026年06月16日
It's been all talk and no action ever since their first announcement.
samber 2026年06月16日
According to Subquadratic, Needle in a Haystack is strong up to 12m tokens, but RULER has not been tested above 128k tokens ??
kristjansson 2026年06月16日
It's easy[1] to promise, it's hard to deliver. I hope the best for them.[1]: https://magic.dev/blog/ltm-1 (note the date)
bthornbury 2026年06月16日
we need some better standard long-context benchmarks.needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship.something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.
mark_l_watson 2026年06月17日
Interesting idea but until I get my grubby little fingers in it, to try it - difficult to have an opinion.I am hopefully expectant that we will see all sorts of optimizations in the next few years that will enable even more local model use and slash commercial API costs. I get excited by the results when I enjoy one or two short coding sessions a week with Claude Opus but it is even more exciting to get a major task done and see that I only used $0.05 for DeepSeek v4 Flash or perhaps $0.15 for DeepSeek v4 Pro. It was exciting in even a different way when I two shotted a complete TypeScript/Tauri app using gemma-12b-qat with little-coder on a cheap laptop a few days ago.