Jamba: AI21's SSM-Transformer Hybrid Model

Jamba：首个生产级 SSM-Transformer 混合架构，256K context

源

SOURCE · AI21 JAMBA · SSM-Transformer hybrid · MoE · 256K context · 2024

Jamba: AI21’s SSM-Transformer Hybrid

AI21 (2024-03-28) — the first production-grade Mamba-Transformer hybrid, Apache 2.0 open-source

Jamba mixes SSM layers (Mamba) with Transformer attention layers and MoE layers, optimizing throughput, memory efficiency, and quality at once. Key ratio: 1-in-8 Transformer attention layers, 7-in-8 Mamba — matching the theoretically optimal ratio. A milestone marking SSM hybrid architectures’ move from research to production.

Architecture parameters

Attention/Mamba ratio1/8 Transformer + 7/8 Mamba (per 8 layers)

Parameter count52B total, 12B active at inference (MoE sparse activation)

Context length256K context window; a single 80GB GPU holds 140K context

Throughput3× Mixtral 8x7B on long-context workloads

What it means for agent engineering

Longer effective context

A 256K window reduces compaction pressure — long-running agents feel much less context strain

Higher throughput, lower cost

Mamba layers avoid quadratic complexity — the agent loop gets cheaper

Early validation of Mamba-3's prediction

Cartesia’s “hybrid architectures will dominate” call gets its early production proof

→ SSM Hybrid Architecture · Context Management · Long-Running AgentsAI21 Blog (2024-03-28)

Jamba: AI21’s SSM-Transformer Hybrid Model

来源: sources/ai21-jamba.md
URL: https://www.ai21.com/blog/announcing-jamba/
作者: AI21 Editorial Team
发布: 2024-03-28

摘要

AI21 发布 Jamba，首个生产级 Mamba-Transformer 混合架构模型。通过将 SSM 层（Mamba）与 Transformer 注意力层和 MoE（混合专家）层结合，在吞吐量、内存效率和质量之间同时优化。

架构创新

块-层结构：每 8 层中 1 层为 Transformer attention，其余为 Mamba 层
MoE 集成：总参数 52B，推理时仅激活 12B，活跃参数效率高于同规模纯 Transformer
长上下文：256K context window，单 80GB GPU 可容纳 140K context

性能亮点

长 context 场景下吞吐量为 Mixtral 8x7B 的 3 倍
在同规模模型的多个基准测试上达到或超越 SOTA
Apache 2.0 开源

与其他架构源的关联

Jamba 是 Mamba-3 论文中”混合架构优于纯模型”判断的早期验证。Mamba-3 进一步预测混合架构将成为主流。

对 Agent 工程的意义

混合 SSM-Transformer 架构的长 context + 高吞吐特性直接利好 long-running agents 和 context management——更长的有效 context 意味着更少的 compaction 需求，更高的吞吐意味着更低的运行成本。

References

sources/ai21-jamba.md