Cheap Large Language Models via Eliminating Matrix Multiplications
Published:
Scalable MatMul-free Language Modeling
Table of Content
Why Large Language Models are Expensive?
Background: BitLinear Layer
MatMul-free Architectures
MatMul-free Channel Mixer
MatMul-free Token Mixer
Important Training Tricks
Empirical Performance
Appendix
- Time Complexity of Parallel Matrix Multiplication
Why Large Language Models are Expensive?
Large Language Models (LLMs) have many layers. For instance, the largest GPT-3 model, with 175 billion parameters, utilizes 96 attention layers, each containing 96 heads with 128 dimensions. Since each attention layer is coupled with several feed-forward layers, all of which heavily rely on matrix multiplication (MatMul) operations, we can say that MatMul dominates their overall computational cost.
To recap what is MatMul, see the example below: