Technology

Baidu ERNIE 4.5: Multimodal Model Training & Fine-Tuning

# Baidu ERNIE 4.5: Advancements in Multimodal Large Language Models Baidu's ERNIE 4.5 marks a major leap in artificial intelligence, especially in the development of **multimodal large language mode...
Noll
4 min read
#ERNIE 4.5#multimodal large language models#Baidu#fine-tuning

Baidu ERNIE 4.5: Advancements in Multimodal Large Language Models

Baidu's ERNIE 4.5 marks a major leap in artificial intelligence, especially in the development of multimodal large language models (LLMs). As part of its latest release, Baidu has open-sourced ten cutting-edge multimodal models, featuring both sparse Mixture-of-Experts (MoE) and dense architectures. These models are engineered to deliver superior performance across tasks involving text, images, and other modalities.

Innovative Heterogeneous Model Architecture

At the core of ERNIE 4.5 is a heterogeneous model structure. This architecture fuses knowledge from different modalities—such as text and images—using a shared parameter mechanism, while also providing each modality with dedicated capacity. This design enhances multimodal understanding and can improve results even on pure text tasks.

Specialized Post-Training and Fine-Tuning

Rather than focusing only on pre-training, ERNIE 4.5 emphasizes the post-training phase, where models are refined for specialized tasks through advanced fine-tuning techniques.

Fine-Tuning for Language and Multimodal Models

  • Large Language Models (LLMs): Optimized for general language comprehension and generation.
  • Multimodal Large Models (MLMs): Tailored for visual-language understanding.

MLMs support two distinct modes:

  • Thinking mode: For step-by-step reasoning (e.g., Chain-of-Thought).
  • Non-thinking mode: For direct answers.

Each model undergoes a multi-stage process:

  1. Supervised Fine-Tuning (SFT)
  2. Advanced preference optimization (e.g., Direct Preference Optimization (DPO), Unified Preference Optimization (UPO))

Supervised Fine-Tuning (SFT) for Language Models

The SFT process is illustrated below:

Diagram of the Supervised Fine-Tuning (SFT) process for language models

The SFT dataset spans ten domains, including science, mathematics, coding, logic, creative writing, and multilingual tasks. Data is categorized into reasoning (including Chain-of-Thought) and non-reasoning formats.

To boost creative problem-solving, multiple valid responses are provided for some reasoning queries, encouraging exploration during reinforcement learning (RL). Approximately 2.3 million samples were used, with each model trained for two epochs on average.

Reinforcement Learning (RL) for Language Models

For reasoning tasks, rule-based verifiers ensure correctness. To handle cases where rules don't generalize, three supplementary mechanisms are applied.

For non-reasoning tasks, the model undergoes progressive RL using Proximal Policy Optimization (PPO):

  1. Logic Foundation: Training on logic-heavy corpora for abstract reasoning.
  2. Abstract Enhancement: Training on mathematics and programming data to improve code generation and abstract skills.
  3. Generalization: Integrating previous skills for broad task performance.

Beyond standard PPO loss, Unified Preference Optimization (UPO)—a form of DPO—is used. This includes online-UPO and offline-UPO, depending on preference pair generation. This hybrid approach helps prevent reward hacking and stabilizes training.

The RL input dataset is filtered to remove samples with little learning value and those with uniform reward signals, retaining only discriminative data. Data is stratified by subject, and rewards are normalized per subset to prevent domain interference.

The process is visualized here:

Visualization of the Reinforcement Learning (RL) process for language models

Post-Training for Multimodal Models

High-quality, human-annotated image-caption pairs are rare. To address this, the team synthesized descriptions for real STEM images. This process uses a vision-language model (VLM) to generate captions, and a text-only reasoning model to solve the problem using only the caption. Only effective descriptions are retained.

A three-stage progressive training framework is used:

  1. Text-only Reasoning Fine-Tuning: The model is fine-tuned on math, science, and code datasets, showing emergent reasoning behaviors.
  2. Vision-Related Reasoning Data: Rejection sampling generates data for chart analysis and creative writing, enriching the SFT dataset.
  3. Fusion of Reasoning Modes: "Thinking" and "non-thinking" modes are fused by joint training with a special <think>\n\n</think> tag (masked during backpropagation), and by transferring multimodal expert parameters (inspired by DeepSeek-V2).

Reward Mechanisms for Multimodal Reinforcement Learning

Inspired by verifier-based RL methods like RLVF, custom reward mechanisms are designed for each multimodal task:

Visual STEM

  • Uses open-source datasets with ground-truth answers.
  • Rewards are based on solution correctness.
  • Multiple-choice questions are reformulated as open-ended for deeper reasoning.

Visual Puzzles

  • A dataset of 10,000 visual puzzles is used.
  • A two-LLM verification system checks for internal consistency and final solution correctness.
  • Responses are accepted only if both verifiers approve.

UI2Code

  • For HTML generation from UI images, a custom verifier assesses visual fidelity between the reference image and generated code.

During RL training, the final reward combines outputs from a pre-trained Bradley-Terry (BT) model and verifier-based rewards. Training uses Generalized Reward-based Policy Optimization (GRPO), enhanced with Direct Advantage Policy Optimization (DAPO) techniques.

Model Specifications

Pre-trained Models

Specifications of the pre-trained ERNIE 4.5 models

Post-trained 300B-A47B Model

Specifications of the post-trained 300B-A47B ERNIE 4.5 model

Post-trained 21B-A3B Model

Specifications of the post-trained 21B-A3B ERNIE 4.5 model

Post-trained Multimodal Models (with "Thinking" Support)

Specifications of the post-trained multimodal models with 'Thinking' support

Further specifications of the post-trained multimodal models

Conclusion

Baidu ERNIE 4.5 showcases rapid progress in multimodal large language models, blending innovative architectures, advanced fine-tuning, and robust reinforcement learning. By open-sourcing these models and sharing detailed methodologies, Baidu is advancing the AI research community. As multimodal AI evolves, ERNIE 4.5's approaches are set to influence the next generation of intelligent systems.

Related Articles

Technology
6 min

SFT Flaw: A Learning Rate Tweak Unlocks LLM Potential

Discover a critical flaw in Supervised Fine-Tuning (SFT) that limits LLM performance. Learn how a simple learning rate tweak unifies SFT and DPO for a 25% gain.

Noll
Supervised Fine-Tuning (SFT)Direct Preference Optimization (DPO)+2 more
Technology
7 min

Two Major Challenges in Reinforcement Learning Finally Solved by ICLR Papers

Traditional reinforcement learning models struggle with real-time applications due to "AI lag." Two ICLR 2025 papers from Mila introduce groundbreaking solutions to tackle inaction and delay regret, enabling large AI models to operate in high-frequency, dynamic environments without compromising speed or intelligence.

Noll
TechnologyAI+1 more
Technology
13 min

Discuss the infrastructure requirements of Agentic AI.

The rise of Agentic AI places unprecedented demands on our infrastructure. This article explores the emerging software and hardware requirements, from specialized runtimes and memory services to zero-trust security models, dissecting AWS's new Bedrock AgentCore platform and discussing the future of AI infrastructure.

Noll
TechnologyAI+1 more

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 4 minutes
Last Updated: July 18, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge