Cheap Large Language Models via Eliminating Matrix Multiplications

less than 1 minute read

Published:

Table of Content

  • Why Large Language Models are Expensive?

  • Background: BitLinear Layer

  • MatMul-free Architectures

    • MatMul-free Channel Mixer

    • MatMul-free Token Mixer

    • Important Training Tricks

  • Empirical Performance

  • Appendix

    • Time Complexity of Parallel Matrix Multiplication

LLM cuts out MatMul

Why Large Language Models are Expensive?

Large Language Models (LLMs) have many layers. For instance, the largest GPT-3 model, with 175 billion parameters, utilizes 96 attention layers, each containing 96 heads with 128 dimensions. Since each attention layer is coupled with several feed-forward layers, all of which heavily rely on matrix multiplication (MatMul) operations, we can say that MatMul dominates their overall computational cost.

To recap what is MatMul, see the example below:

Read the full article