Deploying Kimi K2: Scalable MoE Model on 128 GPUs

Introduction to Kimi K2: Advanced Mixture-of-Experts Model

Meet Kimi K2, a state-of-the-art open-source Mixture-of-Experts (MoE) model developed by Moonshot AI. Kimi K2 is designed for advanced reasoning, complex mathematics, and sophisticated code generation, making it ideal for agentic workflows and multi-step operations beyond simple question answering.

Moonshot AI offers two open-source versions of Kimi K2:

Kimi K2 Base: The foundational model for custom fine-tuning.
Kimi K2 Chat: An instruction-tuned conversational model enhanced with RLHF for superior dialogue.

For more details, visit the official Kimi K2 release page.

Why Mixture-of-Experts Architecture Matters

The Mixture-of-Experts (MoE) architecture in Kimi K2 delivers significant efficiency gains by activating only a subset of experts during inference. However, this approach introduces deployment challenges, such as uneven GPU utilization and complex routing. Large-scale deployment enables intelligent batching and distribution, optimizing hardware usage and lowering inference costs.

Key Challenges in Kimi K2 Deployment

Deploying a trillion-parameter Mixture-of-Experts model like Kimi K2 requires addressing several technical hurdles:

Model Size: Exceeds single GPU memory, requiring distributed inference across multiple GPUs.
Routing Complexity: The router must select from 384 experts, increasing system demands.
Load Balancing: Uneven expert activation can cause computational bottlenecks if not managed.

Scalable Kimi K2 Deployment with OME and SGLang

To overcome these challenges, the team deployed Kimi K2 across 128 H200 GPUs using OME (Open Model Engine) and SGLang. This toolkit streamlines distributed inference, parallelism, and runtime orchestration.

OME and SGLang: The Deployment Toolkit

In May 2025, "Deploying DeepSeek R1 with PD Decoupling and Large-Scale Expert Parallelism" introduced:

Prefill-Decode (PD) Decoupling: Separates compute-intensive prefill from memory-intensive decode for independent scaling.
Large-Scale Expert Parallelism (EP): Distributes expert weights across the GPU cluster to overcome memory limits.

The OME blog post highlights model-driven deployment, a strategy essential for scaling from DeepSeek's 64 experts to Kimi K2's 384 experts.

How to Deploy Kimi K2 with OME

OME simplifies deployment with a declarative, production-ready framework that abstracts parallelization, sharding, and runtime tuning.

Installation:

helm install ome oci://ghcr.io/sgl-project/ome/ome-chart --version 0.1.0 -n ome-system --create-namespace

For step-by-step instructions, see the OME Installation Guide.

Register Kimi K2 Models:

apiVersion: "sglang.ai/v1"
kind: "ClusterBaseModel"
metadata:
  name: "kimi-k2-chat"
spec:
  source:
    huggingface:
      repo: "kimi-ai/Kimi-K2-Chat"
---
apiVersion: "sglang.ai/v1"
kind: "ClusterBaseModel"
metadata:
  name: "kimi-k2-base"
spec:
  source:
    huggingface:
      repo: "kimi-ai/Kimi-K2-Base"

Tip: Customize the YAML for local storage if needed. OME will pull from Hugging Face by default, apply optimized parallelism, and verify file hashes.

Launch Inference Service:

kubectl apply -f - <<EOF
apiVersion: "sglang.ai/v1"
kind: "Serving"
metadata:
  name: "kimi-k2-chat"
spec:
  baseModel: "kimi-k2-chat"
  runtime: "sglang"
EOF

OME manages model downloading, runtime orchestration, and service endpoint creation for scalable, production-grade inference.

System Design: Prefill-Decode Decoupling and Expert Parallelism

The deployment architecture combines Prefill-Decode (PD) decoupling and large-scale expert parallelism (EP), managed by OME. This declarative system handles GPU topology, distributed configurations, and runtime tuning.

Key Components

Prefill-Decode Decoupling: Splits inference into two scalable services:
- Prefill Service: Handles compute-intensive input processing and KV cache generation.
- Decode Service: Manages memory-intensive autoregressive output token generation.
Expert Parallelism (EP=32): 384 experts distributed across 32 GPUs (12 experts per GPU).
Tensor Parallelism (TP=4): Each 32-GPU group uses 4-way tensor parallelism for non-expert layers.
SGLang Router: Coordinates request routing, state management, and load balancing across the 128-GPU cluster.

Prefill-Decode (PD) Decoupling Explained

Prefill Service: Optimized for compute, processes input sequences and generates the KV cache.
Decode Service: Optimized for memory, generates output tokens sequentially.

Large-Scale Expert Parallelism

Expert Parallelism (EP=32): Efficiently manages 384 experts across GPUs.
Tensor Parallelism (TP=4): Shards non-expert layers for further efficiency.
Dynamic Load Balancing: SGLang Router ensures even computational loads and prevents bottlenecks.

Performance Benchmarks for Kimi K2 Deployment

On 128 H200 GPUs (1P1D configuration: 4 Prefill nodes + 12 Decode nodes), Kimi K2 achieved:

Throughput: 4800 tokens/second
Latency: Time to First Token (TTFT) under 1 second
Concurrency: 480 concurrent requests

Optimal resource ratios depend on workload. For this benchmark, scaling decode nodes maximized KV cache capacity, supporting a batch size of 480.

Key Results:

High Throughput: Enabled by PD decoupling and expert parallelism.
Low Latency: TTFT under one second due to optimized routing.
High Concurrency: 480 simultaneous requests, suitable for production.

Future work will focus on long-context, agentic scenarios, leveraging Kimi K2's support for input lengths up to 50,000 tokens.

Conclusion: Production-Grade Kimi K2 Deployment

By integrating OME, SGLang, Prefill-Decode decoupling, and large-scale expert parallelism, the deployment of the trillion-parameter Kimi K2 model on 128 H200 GPUs achieved:

High Throughput: 4800 tokens/second
Low Latency: TTFT under 1 second
High Concurrency: 480 concurrent requests

This was made possible through collaboration with Moonshot AI, SGLang, and NVIDIA DGX Cloud. Organizations can now leverage SGLang and OME to deploy Kimi K2 at scale for advanced AI workloads.

For more on distributed inference and MoE deployment, see our articles on Scaling Large Language Models and Expert Parallelism Best Practices.

Infrastructure Hub