Member-only story
Large Concept Models(LCM): Beyond Tokens, Multilingual, Scalable AI
The paper introduces Large Concept Models (LCMs), a novel generative AI architecture that operates on higher-level semantic units, termed “concepts,” instead of individual tokens, aiming to mimic human-like hierarchical reasoning. Leveraging the SONAR embedding space, which supports multilingual and multimodal inputs across 200 languages, LCMs perform autoregressive sentence prediction and achieve state-of-the-art zero-shot generalization. The study explores three approaches — Base-LCM (regression-based), diffusion-based models, and quantized models — to predict embeddings, with diffusion models delivering the best coherence and diversity. By operating at the sentence level, LCMs reduce computational complexity while enabling scalable, language-agnostic reasoning. Despite challenges like reliance on pre-trained embeddings and slightly lower fluency compared to token-level models, LCMs mark a significant step toward more abstract, human-like AI reasoning and multilingual accessibility.

Key Contributions
1. Concept-Level Reasoning: The paper introduces a “Large Concept Model” (LCM) that operates at a higher semantic abstraction level (concepts) rather than tokens. This contrasts with current LLMs that process language token-by-token.
2. SONAR Embedding Space: The LCM leverages SONAR, a sentence embedding space supporting 200 languages, enabling it to reason in a language- and modality-agnostic manner.
3. Autoregressive Sentence Prediction: LCM models the next-sentence prediction task in the SONAR space through various methods, including:
- Base-LCM: MSE regression in embedding space.
- Diffusion-based LCMs: Use noise schedules to generate embeddings.
- Quantized LCMs: Employ residual vector quantization for embedding refinement.
4. Scaling and Performance: Models were trained with up to 7 billion parameters and over 2.7T tokens. Experimental results demonstrate strong zero-shot generalization across tasks and languages.
5. Open-source Commitment: The training code and SONAR encoders/decoders are openly available to facilitate further research.