Dobi-SVD Differentiable SVD for LLM Compression and Some New Perspectives
Large language models (LLMs) have sparked a new wave of AI applications; however, their substantial computational costs and memory demands pose significant challenges to democratizing access to LLMs for a broader audience. Singular Value Decomposition (SVD), a technique studied for decades, offers a hardware-independent and flexibly tunable solution for LLM compression. In this paper, we present new directions using SVD: we first theoretically and experimentally analyze the optimality of directly truncating activations, then we further identify three key issues on SVD-based LLM compression, including (1) How can we determine the optimal truncation position for each layer in LLMs? (2) How can we efficiently update the weight matrices based on truncated activations? (3) How can we address the inherent 'injection' nature that results in the information loss of the SVD? We propose a new paradigm for SVD-based LLM compression, Dobi-SVD, to tackle the three issues. First, we propose a differentiable truncation mechanism, along with gradient-robust backpropagation, enabling the model to adaptively find the optimal truncation positions. Next, we utilize the Eckart-Young-Mirsky theorem to derive a theoretically optimal weight update formula through rigorous mathematical analysis. Lastly, by observing and leveraging the quantization-friendly nature of matrices after SVD, we reconstruct a mapping between truncation positions and memory requirements, establishing a bijection from truncation positions to memory. Experimental results show that with a 40% parameter-compression rate, our method achieves a perplexity of 9.07 on the Wikitext2 dataset with the compressed LLama-7B model, a 78.7% improvement over the state-of-the-art SVD for LLM compression method. We emphasize that Dobi-SVD is the first to achieve such a high-ratio LLM compression while maintaining competitive performance. We also extend our Dobi-SVD to vision-language models (VLMs) and vision-language-action models (VLAs), thereby highlighting its generalizability and practical value. We hope that the inference speedup—up to 12.4x on 12GB NVIDIA Titan Xp GPUs and 3x on 80GB A100 GPUs for LLMs, 1.2x and 1.17x on 80GB A100 GPUs for VLMs and VLAs, respectively —will bring significant benefits to the broader community such as multi-modal learning and robotics etc.