Bài 5: Tối ưu hóa (Optimization) — Nghệ thuật Leo Núi

Series: Toán học trong AI/ML & Deep Learning

Topics: optimization gradient-descent adam learning-rate PPO

🎯 Tại sao Tối ưu hóa quan trọng?

Huấn luyện mô hình = tìm tham số $W$ tối thiểu hóa loss $L(W)$.

Đây là bài toán tối ưu hóa thuần túy, nhưng với 2 thách thức lớn:

Không gian hàm tỷ chiều — GPT-4 có ~1.8 nghìn tỷ parameters
Non-convex — vô số local minima, saddle points, flat regions

1. Convex vs Non-convex

Convex (linear regression, logistic regression):
Loss │
     │    ╲              ╱
     │      ╲          ╱
     │        ╲      ╱
     │          ╲  ╱
     │            ★ ← global minimum (duy nhất)
     └─────────────────── W

Non-convex (neural network):
Loss │
     │  ╱╲      ╱╲
     │ ╱  ╲    ╱  ╲    ╱
     │╱    ╲  ╱    ╲  ╱
     │      ╲╱  ★   ╲╱ ← local minima
     │       ↑        ← saddle point (∇L=0 nhưng không phải min)
     └─────────────────── W

Tin tốt: Trong DL, local minima thường có loss gần bằng global minima. Saddle points là vấn đề lớn hơn!

2. Gradient Descent và các biến thể

Batch GD vs SGD vs Mini-batch

import numpy as np

def compute_loss_and_grad(W, X_batch, y_batch):
    """MSE loss và gradient"""
    pred = X_batch @ W
    error = pred - y_batch
    loss = np.mean(error ** 2)
    grad = 2 * X_batch.T @ error / len(y_batch)
    return loss, grad

# Dữ liệu giả lập
N, d = 10000, 50
X = np.random.randn(N, d)
W_true = np.random.randn(d)
y = X @ W_true + 0.1 * np.random.randn(N)

W = np.zeros(d)
lr = 0.01

# Batch GD: toàn bộ dataset mỗi step → chậm, ổn định
# loss, grad = compute_loss_and_grad(W, X, y)

# SGD: 1 sample mỗi step → nhanh, noisy
# loss, grad = compute_loss_and_grad(W, X[[i]], y[[i]])

# Mini-batch: batch_size samples → balance
batch_size = 64
for step in range(100):
    idx = np.random.choice(N, batch_size, replace=False)
    loss, grad = compute_loss_and_grad(W, X[idx], y[idx])
    W -= lr * grad
    
    if step % 20 == 0:
        full_loss, _ = compute_loss_and_grad(W, X, y)
        print(f"Step {step:3d}: loss={full_loss:.4f}")

3. Momentum — “Bóng lăn xuống dốc”

Momentum tích lũy velocity theo hướng gradient nhất quán, tránh dao động:

$v_t = \beta \cdot v_{t-1} + (1-\beta) \cdot g_t$ $W_t = W_{t-1} - \eta \cdot v_t$

SGD thuần (dao động):          SGD + Momentum (mượt hơn):
                               
Loss│   ↗↘↗↘↗↘               Loss│   ↗
    │  ↗     ↘               ↘       │  ↗     ↘
    │ ↗         ↘↗↘               │ ↗         ↘
    │               ★           │               ★
    └──────────────────         └──────────────────

import numpy as np

def sgd_momentum(grad_fn, w_init, lr=0.01, beta=0.9, n_steps=100):
    w = w_init.copy()
    v = np.zeros_like(w)
    
    for t in range(n_steps):
        g = grad_fn(w)
        v = beta * v + (1 - beta) * g     # update velocity
        w = w - lr * v                     # update params
    
    return w

4. Adam — Optimizer mặc định của Transformer

Adam = Adaptive Moment Estimation = Momentum + RMSProp

$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \quad \text{(1st moment: mean)}$ $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \text{(2nd moment: variance)}$ $\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)}$ $W_t = W_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t$

Intuition: Mỗi parameter có learning rate riêng, tự điều chỉnh theo gradient history.

Parameter có gradient lớn và nhất quán → lr thực tế giảm
Parameter có gradient nhỏ và không ổn → lr thực tế tăng

import torch
import torch.nn as nn

# So sánh các optimizer
model_sgd  = nn.Linear(100, 10)
model_adam = nn.Linear(100, 10)

opt_sgd  = torch.optim.SGD(model_sgd.parameters(),  lr=0.01, momentum=0.9)
opt_adam = torch.optim.AdamW(model_adam.parameters(), lr=1e-3, weight_decay=1e-2)
# AdamW = Adam + decoupled weight decay (tốt hơn Adam chuẩn)

# Lịch học: thường kết hợp với scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    opt_adam, T_max=100, eta_min=1e-6
)

# Training loop
for epoch in range(100):
    # ... forward, loss.backward()
    opt_adam.step()
    scheduler.step()   # decay lr theo cosine

Khi nào dùng gì?

Optimizer	Dùng khi	Default lr
SGD + momentum	ResNet, CNN cho vision	0.1
Adam	Transformer, NLP, mixed architecture	1e-3
AdamW	Fine-tuning LLM	1e-4 to 5e-5
Lion	Experimental, memory-efficient	1e-4

5. Learning Rate — Hyperparameter quan trọng nhất

lr quá nhỏ:   ●──────────────────── (mãi không đến nơi)
lr tốt:       ●────────★           (hội tụ đẹp)
lr quá lớn:   ● ↗↘↗↘↗↘ (oscillate)
lr cực lớn:   ● → ∞               (diverge)

# Learning Rate Finder — tìm lr tốt
def lr_finder(model, optimizer, train_loader, start_lr=1e-7, end_lr=10, num_iter=100):
    """
    Tăng lr theo log scale từ start → end,
    plot loss để tìm điểm lr tốt nhất
    """
    lrs, losses = [], []
    lr = start_lr
    multiplier = (end_lr / start_lr) ** (1 / num_iter)
    
    for i, (X, y) in enumerate(train_loader):
        if i >= num_iter:
            break
        
        # Set lr
        for g in optimizer.param_groups:
            g['lr'] = lr
        
        # Forward + backward
        pred = model(X)
        loss = F.cross_entropy(pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        lrs.append(lr)
        losses.append(loss.item())
        
        lr *= multiplier
        if loss.item() > 4 * min(losses):   # diverge → stop
            break
    
    # Plot: tìm điểm loss đang giảm nhanh nhất → chọn lr đó / 10
    return lrs, losses

# Learning Rate Schedule phổ biến cho Transformer:
# Warm-up + Cosine decay
def lr_schedule(step, d_model=512, warmup_steps=4000):
    """Từ paper "Attention is All You Need" """
    return d_model**(-0.5) * min(step**(-0.5), step * warmup_steps**(-1.5))

6. Gradient Clipping — Tránh Exploding Gradient

import torch

# Training loop với gradient clipping
max_grad_norm = 1.0

optimizer.zero_grad()
loss.backward()

# Clipping TRƯỚC khi update
grad_norm = torch.nn.utils.clip_grad_norm_(
    model.parameters(), max_grad_norm
)

# Log gradient norm để monitor training
if step % 100 == 0:
    print(f"Grad norm: {grad_norm:.3f}")
    # Nếu grad_norm >> 1 thường xuyên → tăng clipping, giảm lr

optimizer.step()

7. Constrained Optimization — PPO trong RLHF

Trong RL, ta muốn maximize reward nhưng không để policy thay đổi quá nhiều:

\[\max_\theta \mathbb{E}[r] \quad \text{subject to} \quad D_{KL}(\pi_\theta \| \pi_{\theta_{old}}) \leq \delta\]

PPO giải bằng clipped objective:

\[L^{CLIP} = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]\]

def ppo_loss(old_log_prob, new_log_prob, advantage, epsilon=0.2):
    """
    PPO Clipped Objective
    old_log_prob: log π_old(a|s)
    new_log_prob: log π_θ(a|s)
    advantage: estimated advantage Â
    """
    ratio = torch.exp(new_log_prob - old_log_prob)  # π_θ / π_old
    
    # Clipped ratio: không cho policy thay đổi quá (1±ε)
    clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
    
    # Lấy min để pessimistic estimate
    loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
    return loss

8. Lỗi Phổ biến

Lỗi	Triệu chứng	Fix
lr quá lớn	Loss tăng hoặc NaN	Giảm lr 10x, dùng lr finder
lr quá nhỏ	Loss giảm rất chậm	Tăng lr, dùng warm-up
Không zero_grad	Loss kỳ lạ, tăng liên tục	Thêm `optimizer.zero_grad()`
Gradient exploding	Loss → NaN đột ngột	Gradient clipping, giảm lr
Overfitting	Train loss thấp, val cao	Weight decay, dropout, early stopping

8. Checklist

Giải thích Adam khác SGD ở điểm nào
Biết dấu hiệu lr quá lớn và lr quá nhỏ
Implement được gradient clipping
Chọn được optimizer phù hợp cho từng loại model
Giải thích tại sao PPO dùng clipped objective

🔗 Series

← Bài 4: Thống kê
→ Bài 6: Lý thuyết Thông tin — Đo lường Sự không Chắc chắn