LLM Internals
All HubsLLM Internals Hub
Dive deep into the core concepts, architectures, and inner workings of Large Language Models.
Latest Updates
How to Add Special Tokens to LLMs Safely
Learn how to add special tokens to LLMs during fine-tuning without causing catastrophic forgetting. Our guide covers smart initialization and PEFT/LoRA.
Multi-head Latent Attention (MLA) Explained
Learn about Multi-head Latent Attention (MLA) and how it improves on Multi-Query Attention (MQA). Discover Matrix Absorption and its impact on performance.
Alibaba's Qwen3-Next: A Deep Dive into its MoE Arch
Explore the architecture of Alibaba's Qwen3-Next, a powerful large language model. Learn about its Mixture of Experts (MoE) design and performance.
Build a Llama-Style MoE Model From Scratch (Part 2)
Learn how to train a language model with this PyTorch training loop guide. Explore text generation, the AdamW optimizer, and Mixture of Experts models.
Model Foundations
What Are LLMs? A Guide to Generative AI
Discover what Large Language Models (LLMs) are and how they power Generative AI. This in-depth guide covers the Transformer architecture, prompt engineering, and more.
What Is a Transformer Model? An In-Depth Guide
Explore the Transformer model and its architecture. Learn about attention mechanisms and how Transformers power modern AI and Large Language Models (LLMs).
Decoding Strategies for Large Language Models Explained
# Decoding Strategies for Large Language Models (LLMs) At the core of every large language model (LLM) is a sophisticated process for generating text. Instead of selecting words at random, the model...
Architectures & Mechanisms
From DeepSeek-V3 to Kimi K2:Eight Modern Large Language Model Architecture Designs
This article dissects the architectural evolution of modern large language models in 2025, moving beyond benchmarks to analyze the core design choices of flagship open-source models. We explore key innovations like DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE), OLMo 2's unique normalization strategies, Gemma 3's use of sliding window attention, and Llama 4's take on MoE. By focusing on these architectural blueprints, we gain a clearer understanding of the engineering priorities shaping the future of LLMs.
Multi-head Latent Attention (MLA) Explained
Learn about Multi-head Latent Attention (MLA) and how it improves on Multi-Query Attention (MQA). Discover Matrix Absorption and its impact on performance.
How Linear Layers Power Multi-Head Attention in Transformers
Discover how linear layers enable multi-head attention in Transformers, powering advanced NLP models with parallel processing and rich representations.
Optimization & Training
Build a Llama-Style MoE Model From Scratch (Part 2)
Learn how to train a language model with this PyTorch training loop guide. Explore text generation, the AdamW optimizer, and Mixture of Experts models.
How to Add Special Tokens to LLMs Safely
Learn how to add special tokens to LLMs during fine-tuning without causing catastrophic forgetting. Our guide covers smart initialization and PEFT/LoRA.
Qwen3 QK-Norm: Improved On-Device AI Stability
## Qwen3 Model Family: QK-Norm and Enhanced Attention Mechanism The Qwen3 model family, Alibaba's latest large language model release, introduces a significant upgrade for on-device AI: the adoption...
What is Knowledge Distillation in AI?
Learn how knowledge distillation and model temperature work to train smaller, more efficient AI models. A key technique for LLM model compression.
Case Studies & Implementations
Build a Llama-Style MoE Model From Scratch (Part 1)
Learn how to build a Llama-style MoE language model from scratch. This guide covers the Mixture of Experts architecture, tokenization, and model setup.
Alibaba's Qwen3-Next: A Deep Dive into its MoE Arch
Explore the architecture of Alibaba's Qwen3-Next, a powerful large language model. Learn about its Mixture of Experts (MoE) design and performance.
Curated Resources
Attention Is All You Need (Original Paper)
The foundational Transformer paper introducing scaled dot-product attention and multi-head mechanisms.
Transformer Circuits Interpretability
Anthropic’s deep-dive into how Transformer attention heads implement reasoning and steering behaviour.
bbycroft LLM Visualization
Interactive visual exploration of residual streams, attention patterns, and internal representations in GPT models.