Nemotron-3-Super-120B: Hybrid Mamba+MoE Model with 504K Token Retrieval

Nemotron-3-Super-120B achieved perfect needle-in-haystack retrieval at 504K token context length using a hybrid Mamba state-space + mixture-of-experts architecture. The model demonstrates sustained long-context performance without standard transformer attention mechanisms. This validates non-transformer architectures as viable for production long-context workloads. It signals reduced technical necessity for transformer scaling as the primary path to extended context windows, lowering the compute floor for long-context inference and opening alternative optimization vectors. For builders, this expands viable architecture choices beyond transformer variants, potentially reducing per-token inference costs for long-context applications. Operators can now evaluate Mamba-based models as drop-in alternatives for document retrieval, code analysis, and extended reasoning tasks where transformer inference costs currently constrain deployment. The shift from attention-only to hybrid state-space approaches may force re-evaluation of VRAM requirements and throughput characteristics in existing serving stacks.