AI Infrastructure: The Real Engine Behind AI Agents
Struggling with AI projects? The problem isn't your models, it's your AI infrastructure. Learn why data silos & lag hold you back and how to build a better f...
Pingxingjilu
Dive deep into the world of Artificial Intelligence with our curated collection of articles, covering the latest breakthroughs and insights from leading researchers and engineers.
Struggling with AI projects? The problem isn't your models, it's your AI infrastructure. Learn why data silos & lag hold you back and how to build a better f...
Pingxingjilu
Learn to install and use LLaMA Factory to fine-tune hundreds of LLMs on your local machine. This guide covers CUDA setup, installation, and WebUI usage.
Number in the Moutain
A deep dive into the Transformer architecture, the engine behind modern LLMs. Understand self-attention, encoders, decoders, and how they work together.
Alex Carter
A comprehensive guide to using Ollama for running large language models like Llama 3 and Mistral on your local machine. Learn installation, commands, and how to create custom models.
Jordan Lee
Discover what Large Language Models (LLMs) are and how they power Generative AI. Learn about pre-training, fine-tuning, and the Transformer architecture.
hhmy27
Explore the shift to separated architectures for RL post-training of LLMs. Learn how systems like AsyncFlow & TransferQueue solve data orchestration challenges.
Little Boji
Explore LLM inference optimization on H800 SuperPods. Learn how a disaggregated architecture with SGLang tackles the prefill bottleneck to boost throughput.
yiakwy
Monitoring **PyTorch GPU memory usage** during model training can be perplexing. To demystify this, we'll dive into the **PyTorch memory snapshot** tool, a powerful utility for detailed **GPU memory ...
Panda
Discover a critical flaw in Supervised Fine-Tuning (SFT) that limits LLM performance. Learn how a simple learning rate tweak unifies SFT and DPO for a 25% gain.
Noll
Unpack the powerful workflow behind GraphRAG. Learn how it transforms data into a network of nodes and edges, uses intelligent graph traversal for searching, and applies advanced metrics and metadata filters to deliver highly relevant, contextualized answers.
Mi
This article delves into the core challenges of GPU performance, analyzing the differences between compute-bound and memory-bound operations and highlighting the issue of underutilized memory bandwidth. It further proposes strategies to maximize throughput and looks ahead to the collaborative future of CPUs and GPUs, as well as the evolution of GPU architecture, offering a first-principles perspective on understanding and optimizing GPU performance.
xiaodong gong
Traditional reinforcement learning models struggle with real-time applications due to "AI lag." Two ICLR 2025 papers from Mila introduce groundbreaking solutions to tackle inaction and delay regret, enabling large AI models to operate in high-frequency, dynamic environments without compromising speed or intelligence.
Noll
The rise of Agentic AI places unprecedented demands on our infrastructure. This article explores the emerging software and hardware requirements, from specialized runtimes and memory services to zero-trust security models, dissecting AWS's new Bedrock AgentCore platform and discussing the future of AI infrastructure.
Noll
This article dissects the architectural evolution of modern large language models in 2025, moving beyond benchmarks to analyze the core design choices of flagship open-source models. We explore key innovations like DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE), OLMo 2's unique normalization strategies, Gemma 3's use of sliding window attention, and Llama 4's take on MoE. By focusing on these architectural blueprints, we gain a clearer understanding of the engineering priorities shaping the future of LLMs.
Noll
Learn how to deploy Kimi K2, a state-of-the-art Mixture-of-Experts (MoE) model, on a massive 128 H200 GPU cluster. This guide covers the key challenges and solutions using OME and SGLang for scalable, high-performance inference, achieving 4800 tokens/second with low latency.
Noll
Learn how to select the best ldmatrix operation in CUTLASS CuTe for high-performance GPU matrix multiplication. Optimize data movement and performance.
Noll
Unlock the full potential of your CUDA kernels by mastering memory coalescing with TiledCopy. This article dives deep into optimizing data transfers from Global to Shared Memory on NVIDIA GPUs, covering cp.async, row-major vs. column-major layouts, and cache line alignment to maximize memory bandwidth and accelerate your deep learning workloads.
Noll
# Fine-Tuning Qwen3 with Unsloth: Step-by-Step Guide Qwen3, the latest generation of large language models, is redefining AI with advanced reasoning, instruction following, and robust multilingual s...
Noll
# Baidu ERNIE 4.5: Advancements in Multimodal Large Language Models Baidu's ERNIE 4.5 marks a major leap in artificial intelligence, especially in the development of **multimodal large language mode...
Noll
# MemOS: Persistent Memory for LLMs & Next-Gen AI Agents  transforms LLMs from base models to chat assistants. Step-by-step guide to SFT workflow, datasets, and best practices.
Noll
Discover how linear layers enable multi-head attention in Transformers, powering advanced NLP models with parallel processing and rich representations.
Noll
Explore single-controller vs multi-controller in veRL, inspired by Google's Pathways, and learn their impact on distributed reinforcement learning systems.
Noll
The field of artificial intelligence has seen rapid advancements in reinforcement learning for reasoning, particularly within large language models (LLMs). This article reviews influential research s...
Noll
Discover how xAI's Grok 4 sets new AI benchmarks, outperforms rivals, and introduces multi-agent systems in the race for next-gen artificial intelligence.
Noll
## Qwen3 Training Pipeline: Pre-training, Reinforcement Learning, and Model Distillation ### Qwen3 Pre-training: Building a Robust Foundation Qwen3 training begins with a comprehensive three-stage ...
Noll
## LLM API Market 2024: Key Trends and Model Leaderboard As we reach the midpoint of 2024, the competitive landscape for large language models (LLMs) is shifting rapidly. The so-called "LLM Wars" ar...
Noll
Discover the technical challenges and solutions in training a 671B parameter LLM with Reinforcement Learning, covering frameworks, memory, and efficiency.
Noll
Discover how traditional infrastructure skills translate to AI infrastructure. Learn key concepts, differences, and engineering fundamentals for LLM systems.
Noll
With its impressive performance and elegant architecture, **SGLang** is rapidly establishing itself in the competitive world of **large language model (LLM) inference**. Could it be the next PyTorch,...
Noll
### Why Direct Reinforcement Learning on Base Language Models is the Next Frontier Direct reinforcement learning (RL) on base language models is emerging as a transformative approach in LLM optimiza...
Noll
Learn Andrew Ng
Noll
Reinforcement learning for LLMs (large language models) is revolutionizing the field of artificial intelligence by enabling models to learn beyond the constraints of supervised learning. This article...
Noll
# Decoding Strategies for Large Language Models (LLMs) At the core of every large language model (LLM) is a sophisticated process for generating text. Instead of selecting words at random, the model...
Noll
## Kimi Researcher: Advancing AI Agents with End-to-End Reinforcement Learning Kimi Researcher is the flagship product of the Kimi Agent initiative, designed to revolutionize research automation thr...
Noll
## Qwen3 Model Family: QK-Norm and Enhanced Attention Mechanism The Qwen3 model family, Alibaba's latest large language model release, introduces a significant upgrade for on-device AI: the adoption...
Noll