According to 1M AI News monitoring, Meituan Longmao team has open-sourced LongCat-Next, a native multimodal model based on MoE architecture with 3 billion activated parameters. It unifies five capabilities—text, visual understanding, image generation, speech understanding, and speech synthesis—within a single autoregressive framework. The model and accompanying tokenizer are open-sourced under the MIT license, with weights available on HuggingFace.
LongCat-Next’s core design is the DiNA (Discrete Native Autoregressive) paradigm: by designing paired tokenizers and decoders for each modality, visual and audio signals are converted into discrete tokens, sharing the same embedding space with text, and all tasks are completed through unified next-token prediction. The key component on the visual side, dNaViT (Discrete Native Resolution Vision Transformer), extracts image features into “visual words,” supporting dynamic tokenization and decoding. It maintains strong image generation quality even at 28 times compression ratio, especially excelling in text rendering.
In comparisons with models of similar activated parameter size (A3B), LongCat-Next’s main benchmark performances are:
In cross-model comparisons of understanding and generation within a unified model, LongCat-Next’s MMMU score of 70.6 surpasses second-place NEO-unify (68.9), significantly exceeding previous unified model solutions like BAGEL (55.3) and Ovis-U1 (51.1). The performance of SWE-Bench 43.0 and Tau2 series tool invocation benchmarks also demonstrate that this multimodal unified architecture does not sacrifice pure text and agent capabilities.