The Mamba Effect: State Space Models Taking on Transformers
Published:
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Table of Content
- Large Language Models, Transformers, and the Fundamental Bottleneck
- Mamba Dissection: A Top-Down Approach
- Linear-Time Decoding
- State Space Model Foundation
- Selective State Spaces
- Mamba Empirical Performance
- Mamba is Faster than Transformers
- Mamba Scales Linearly up to a Million Tokens
- State-of-the-art Performance
- Rivaling Transformers in Language Modeling
- Ablation Studies
- Final Thoughts
- Appendix
- Explanation of the SSM Discretization Formula
- SSM CNN-RNN View Equivalence
- Mamba Relation to Gated RNN
Large Language Models, Transformers, and the Fundamental Bottleneck
Large Language Models (LLMs) are pretrained on massive datasets to achieve AGI (Artificial General Intelligence). As an unwritten rule, the Transformer [9] architecture is the backbone of LLMs due to its ability to capture rich representations through attention layers. These layers provide direct access to past inputs at any point during processing. However, this capability comes with a computational cost of O(L2) complexity, where L is the number of timesteps (tokens) the Transformer needs to process.