semantic_scholar / 2025

TransMLA: Multi-Head Latent Attention Is All You Need

Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang

Foundation ModelsLarge Language ModelsOptimizationPopular and Landmark Papers

In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.

22 citations2 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: TransMLA: Multi-Head Latent Attention Is All You Need
Authors: Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang
Year: 2025
Venue:
Categories: Foundation Models, Large Language Models, Optimization, Popular and Landmark Papers
Citations: 22
Paper URL: https://arxiv.org/abs/2502.07864v5
Open PDF: https://arxiv.org/pdf/2502.07864v5

Abstract:
In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2502.07864v5)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2502.07864v5)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=TransMLA%3A%20Multi-Head%20Latent%20Attention%20Is%20All%20You%20Need)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=TransMLA%3A%20Multi-Head%20Latent%20Attention%20Is%20All%20You%20Need)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=TransMLA%3A%20Multi-Head%20Latent%20Attention%20Is%20All%20You%20Need+paper+explained)