Beyond Tokens: Large Concept Models for Multilingual, Scalable AI
The paper introduces Large Concept Models (LCMs), a novel generative AI architecture that operates on higher-level semantic units, termed “concepts,” instead of individual tokens, aiming to mimic human-like hierarchical reasoning. Leveraging the SONAR embedding space, which supports multilingual and multimodal inputs across 200 languages, LCMs perform autoregressive sentence prediction and achieve state-of-the-art zero-shot generalization. The study explores three approaches — Base-LCM (regression-based), diffusion-based models, and quantized models — to predict embeddings, with diffusion models delivering the best coherence and diversity. By operating at the sentence level, LCMs reduce computational complexity while enabling scalable, language-agnostic reasoning. Despite challenges like reliance on pre-trained embeddings and slightly lower fluency compared to token-level models, LCMs mark a significant step toward more abstract, human-like AI reasoning and multilingual accessibility.
Key Contributions
1. Concept-Level Reasoning: The paper introduces a “Large Concept Model” (LCM) that operates at a higher semantic abstraction level (concepts) rather than tokens. This contrasts with current LLMs that process language token-by-token.