LLM Internals

All Hubs

Knowledge Hub

LLM Internals Hub

Dive deep into the core concepts, architectures, and inner workings of Large Language Models.

Core Articles

Architectures

Optimization

Case Studies

Latest Updates

How to Add Special Tokens to LLMs Safely

Sep 23, 2025

Learn how to add special tokens to LLMs during fine-tuning without causing catastrophic forgetting. Our guide covers smart initialization and PEFT/LoRA.

Multi-head Latent Attention (MLA) Explained

Sep 13, 2025

Learn about Multi-head Latent Attention (MLA) and how it improves on Multi-Query Attention (MQA). Discover Matrix Absorption and its impact on performance.

Alibaba's Qwen3-Next: A Deep Dive into its MoE Arch

Sep 11, 2025

Explore the architecture of Alibaba's Qwen3-Next, a powerful large language model. Learn about its Mixture of Experts (MoE) design and performance.

Build a Llama-Style MoE Model From Scratch (Part 2)

Sep 9, 2025

Learn how to train a language model with this PyTorch training loop guide. Explore text generation, the AdamW optimizer, and Mixture of Experts models.

LLM Token Calculator - Analyze Model Costs

Calculate token costs for different LLM architectures, compare model efficiency, and optimize inference budgets

What Are LLMs? A Guide to Generative AI

Aug 1, 2025

Discover what Large Language Models (LLMs) are and how they power Generative AI. This in-depth guide covers the Transformer architecture, prompt engineering, and more.

What Is a Transformer Model? An In-Depth Guide

Aug 4, 2025

Explore the Transformer model and its architecture. Learn about attention mechanisms and how Transformers power modern AI and Large Language Models (LLMs).

Decoding Strategies for Large Language Models Explained

Jul 2, 2025

# Decoding Strategies for Large Language Models (LLMs) At the core of every large language model (LLM) is a sophisticated process for generating text. Instead of selecting words at random, the model...

From DeepSeek-V3 to Kimi K2：Eight Modern Large Language Model Architecture Designs

Jul 22, 2025

This article dissects the architectural evolution of modern large language models in 2025, moving beyond benchmarks to analyze the core design choices of flagship open-source models. We explore key innovations like DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE), OLMo 2's unique normalization strategies, Gemma 3's use of sliding window attention, and Llama 4's take on MoE. By focusing on these architectural blueprints, we gain a clearer understanding of the engineering priorities shaping the future of LLMs.

Multi-head Latent Attention (MLA) Explained

Sep 13, 2025

Learn about Multi-head Latent Attention (MLA) and how it improves on Multi-Query Attention (MQA). Discover Matrix Absorption and its impact on performance.

How Linear Layers Power Multi-Head Attention in Transformers

Jul 15, 2025

Discover how linear layers enable multi-head attention in Transformers, powering advanced NLP models with parallel processing and rich representations.

Build a Llama-Style MoE Model From Scratch (Part 2)

Sep 9, 2025

Learn how to train a language model with this PyTorch training loop guide. Explore text generation, the AdamW optimizer, and Mixture of Experts models.

How to Add Special Tokens to LLMs Safely

Sep 23, 2025

Learn how to add special tokens to LLMs during fine-tuning without causing catastrophic forgetting. Our guide covers smart initialization and PEFT/LoRA.

Qwen3 QK-Norm: Improved On-Device AI Stability

Jun 26, 2025

## Qwen3 Model Family: QK-Norm and Enhanced Attention Mechanism The Qwen3 model family, Alibaba's latest large language model release, introduces a significant upgrade for on-device AI: the adoption...

What is Knowledge Distillation in AI?

Sep 1, 2025

Learn how knowledge distillation and model temperature work to train smaller, more efficient AI models. A key technique for LLM model compression.

Build a Llama-Style MoE Model From Scratch (Part 1)

Sep 8, 2025

Learn how to build a Llama-style MoE language model from scratch. This guide covers the Mixture of Experts architecture, tokenization, and model setup.