Baidu ERNIE 4.5: Advancements in Multimodal Large Language Models

Baidu's ERNIE 4.5 marks a major leap in artificial intelligence, especially in the development of multimodal large language models (LLMs). As part of its latest release, Baidu has open-sourced ten cutting-edge multimodal models, featuring both sparse Mixture-of-Experts (MoE) and dense architectures. These models are engineered to deliver superior performance across tasks involving text, images, and other modalities.

Innovative Heterogeneous Model Architecture

At the core of ERNIE 4.5 is a heterogeneous model structure. This architecture fuses knowledge from different modalities—such as text and images—using a shared parameter mechanism, while also providing each modality with dedicated capacity. This design enhances multimodal understanding and can improve results even on pure text tasks.

Specialized Post-Training and Fine-Tuning

Rather than focusing only on pre-training, ERNIE 4.5 emphasizes the post-training phase, where models are refined for specialized tasks through advanced fine-tuning techniques.

Fine-Tuning for Language and Multimodal Models

Large Language Models (LLMs): Optimized for general language comprehension and generation.
Multimodal Large Models (MLMs): Tailored for visual-language understanding.

MLMs support two distinct modes:

Thinking mode: For step-by-step reasoning (e.g., Chain-of-Thought).
Non-thinking mode: For direct answers.

Each model undergoes a multi-stage process:

Supervised Fine-Tuning (SFT)
Advanced preference optimization (e.g., Direct Preference Optimization (DPO), Unified Preference Optimization (UPO))

Supervised Fine-Tuning (SFT) for Language Models

The SFT process is illustrated below:

图片描述

The SFT dataset spans ten domains, including science, mathematics, coding, logic, creative writing, and multilingual tasks. Data is categorized into reasoning (including Chain-of-Thought) and non-reasoning formats.

To boost creative problem-solving, multiple valid responses are provided for some reasoning queries, encouraging exploration during reinforcement learning (RL). Approximately 2.3 million samples were used, with each model trained for two epochs on average.

Reinforcement Learning (RL) for Language Models

For reasoning tasks, rule-based verifiers ensure correctness. To handle cases where rules don't generalize, three supplementary mechanisms are applied.

For non-reasoning tasks, the model undergoes progressive RL using Proximal Policy Optimization (PPO):

Logic Foundation: Training on logic-heavy corpora for abstract reasoning.
Abstract Enhancement: Training on mathematics and programming data to improve code generation and abstract skills.
Generalization: Integrating previous skills for broad task performance.

Beyond standard PPO loss, Unified Preference Optimization (UPO)—a form of DPO—is used. This includes online-UPO and offline-UPO, depending on preference pair generation. This hybrid approach helps prevent reward hacking and stabilizes training.

The RL input dataset is filtered to remove samples with little learning value and those with uniform reward signals, retaining only discriminative data. Data is stratified by subject, and rewards are normalized per subset to prevent domain interference.

The process is visualized here:

图片描述

Post-Training for Multimodal Models

High-quality, human-annotated image-caption pairs are rare. To address this, the team synthesized descriptions for real STEM images. This process uses a vision-language model (VLM) to generate captions, and a text-only reasoning model to solve the problem using only the caption. Only effective descriptions are retained.

A three-stage progressive training framework is used:

Text-only Reasoning Fine-Tuning: The model is fine-tuned on math, science, and code datasets, showing emergent reasoning behaviors.
Vision-Related Reasoning Data: Rejection sampling generates data for chart analysis and creative writing, enriching the SFT dataset.
Fusion of Reasoning Modes: "Thinking" and "non-thinking" modes are fused by joint training with a special <think>\n\n</think> tag (masked during backpropagation), and by transferring multimodal expert parameters (inspired by DeepSeek-V2).

Reward Mechanisms for Multimodal Reinforcement Learning

Inspired by verifier-based RL methods like RLVF, custom reward mechanisms are designed for each multimodal task:

Visual STEM

Uses open-source datasets with ground-truth answers.
Rewards are based on solution correctness.
Multiple-choice questions are reformulated as open-ended for deeper reasoning.

Visual Puzzles

A dataset of 10,000 visual puzzles is used.
A two-LLM verification system checks for internal consistency and final solution correctness.
Responses are accepted only if both verifiers approve.

UI2Code

For HTML generation from UI images, a custom verifier assesses visual fidelity between the reference image and generated code.

During RL training, the final reward combines outputs from a pre-trained Bradley-Terry (BT) model and verifier-based rewards. Training uses Generalized Reward-based Policy Optimization (GRPO), enhanced with Direct Advantage Policy Optimization (DAPO) techniques.

Model Specifications

Pre-trained Models

图片描述

Post-trained 300B-A47B Model

图片描述

Post-trained 21B-A3B Model

图片描述

Post-trained Multimodal Models (with "Thinking" Support)

图片描述

Conclusion

Baidu ERNIE 4.5 showcases rapid progress in multimodal large language models, blending innovative architectures, advanced fine-tuning, and robust reinforcement learning. By open-sourcing these models and sharing detailed methodologies, Baidu is advancing the AI research community. As multimodal AI evolves, ERNIE 4.5's approaches are set to influence the next generation of intelligent systems.