Bài 8: Giải tích Số (Numerical Methods) — Khi Toán học Gặp Máy tính

Series: Toán học trong AI/ML & Deep Learning

Topics: numerical-stability floating-point mixed-precision FP16 BF16

🎯 Tại sao Giải tích Số quan trọng?

Máy tính không thể lưu số thực chính xác. $\pi$, $\sqrt{2}$, hay ngay cả $0.1$ đều bị làm tròn.

Trong Deep Learning, các phép tính lặp hàng tỷ lần — sai số nhỏ cộng dồn thành thảm họa:

Loss → NaN giữa training
Softmax trả về inf
Gradient explosion/vanishing do floating point

1. Floating Point — Máy tính lưu số thế nào?

IEEE 754 Float32 (FP32):
┌──┬─────────┬──────────────────────┐
│S │ Exponent│      Mantissa        │
│1 │   8 bit │       23 bit         │
└──┴─────────┴──────────────────────┘
   Sign    Range        Precision

Float16 (FP16): 1 + 5 + 10 bits  ← nhỏ hơn, nhanh hơn, nhưng dễ overflow
BFloat16 (BF16): 1 + 8 + 7 bits  ← range như FP32, precision thấp hơn

import numpy as np
import torch

# Minh họa sai số floating point
print(0.1 + 0.2)           # 0.30000000000000004 — KHÔNG phải 0.3!
print(0.1 + 0.2 == 0.3)    # False ← đây là lý do không dùng == với float

# FP32 vs FP16 range
fp32_max = torch.finfo(torch.float32).max
fp16_max = torch.finfo(torch.float16).max
bf16_max = torch.finfo(torch.bfloat16).max

print(f"FP32 max: {fp32_max:.2e}")   # 3.40e+38
print(f"FP16 max: {fp16_max:.2e}")   # 6.55e+04  ← rất nhỏ!
print(f"BF16 max: {bf16_max:.2e}")   # 3.39e+38

# FP16 overflow demonstration
x = torch.tensor(65000.0, dtype=torch.float16)
print(f"65000 in FP16: {x}")          # 65000.0 ✓
y = torch.tensor(65600.0, dtype=torch.float16)
print(f"65600 in FP16: {y}")          # inf ← OVERFLOW!

2. Mixed Precision Training — FP16/BF16 + FP32

Tại sao Mixed Precision?

FP16/BF16: nhanh gấp 2-8x, dùng ít VRAM hơn
Nhưng: gradient nhỏ → underflow trong FP16

Giải pháp: Loss Scaling + Gradient Scaler

Forward:  FP16/BF16  ← tính nhanh
Backward: FP16/BF16  ← gradient
Scale up: × 2^k trước khi backward  ← tránh underflow
Update:   FP32       ← độ chính xác cao
Scale down: ÷ 2^k khi update FP32 weights

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler()   # tự động quản lý loss scaling

for batch in dataloader:
    x, y = batch
    optimizer.zero_grad()
    
    # autocast: tự động cast sang FP16 cho các ops phù hợp
    with autocast(dtype=torch.bfloat16):   # BF16 ổn định hơn FP16
        output = model(x)
        loss = criterion(output, y)
    
    # Scale loss lên trước backward (tránh gradient underflow)
    scaler.scale(loss).backward()
    
    # Unscale gradient, kiểm tra overflow/NaN
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    # Update (skip nếu gradient có NaN/inf)
    scaler.step(optimizer)
    scaler.update()   # tự điều chỉnh scale factor

# FP16 vs BF16 — khi nào dùng gì?
# FP16: GPU cũ (V100), precision cao hơn trong range nhỏ
# BF16: GPU mới (A100, H100), stable hơn, range như FP32 → DEFAULT

3. Numerical Stability — Các Trick Quan trọng

3.1 Numerically Stable Softmax

import numpy as np
import torch

# ❌ Naive softmax — overflow với large logits
def softmax_naive(x):
    return np.exp(x) / np.exp(x).sum()

# ✅ Stable softmax — trừ max trước
def softmax_stable(x):
    x = x - x.max()               # shift → không thay đổi output
    exp_x = np.exp(x)
    return exp_x / exp_x.sum()

logits = np.array([1000., 1001., 1002.])
print("Naive :", softmax_naive(logits))   # [nan, nan, nan]
print("Stable:", softmax_stable(logits))  # [0.090, 0.245, 0.665] ✓

# PyTorch tự xử lý:
x = torch.tensor([1000., 1001., 1002.])
print(torch.softmax(x, dim=0))   # [0.090, 0.245, 0.665] ✓

3.2 Numerically Stable Cross-Entropy

# ❌ Naive: log(softmax(x)) — mất precision với giá trị lớn
def cross_entropy_naive(logits, label):
    probs = np.exp(logits) / np.sum(np.exp(logits))
    return -np.log(probs[label])

# ✅ Log-softmax trick: log(softmax(x)) = x - log(Σexp(x))
def cross_entropy_stable(logits, label):
    # log-sum-exp: tránh tính exp của giá trị lớn
    log_sum = logits.max() + np.log(np.sum(np.exp(logits - logits.max())))
    return -(logits[label] - log_sum)

# PyTorch: F.cross_entropy tự dùng log-softmax bên trong
import torch.nn.functional as F
logits = torch.tensor([[2.0, 1.0, 0.1]])
label  = torch.tensor([0])
loss = F.cross_entropy(logits, label)   # stable by default ✓

3.3 Epsilon — Tránh Division by Zero

# Batch Normalization ổn định
def batch_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(dim=0)
    var  = x.var(dim=0)
    x_norm = (x - mean) / torch.sqrt(var + eps)   # ← eps quan trọng!
    return gamma * x_norm + beta

# Attention scaling — tránh gradient vanishing trong softmax
def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / (d_k ** 0.5)   # ← chia √d_k để tránh dot product quá lớn
    weights = F.softmax(scores, dim=-1)
    return weights @ V

4. Condition Number — Khi Ma trận “Nhạy cảm”

Condition number $\kappa(A) = \frac{\sigma_{max}}{\sigma_{min}}$ đo mức độ nhạy cảm của nghiệm với nhiễu nhỏ trong input.

import numpy as np

# Ma trận ill-conditioned — condition number cao
A_bad = np.array([[1., 1.],
                   [1., 1.0001]])   # gần singular!

kappa = np.linalg.cond(A_bad)
print(f"Condition number (bad): {kappa:.2e}")   # ~4e4 — rất cao!

# Ma trận well-conditioned
A_good = np.eye(2)
print(f"Condition number (good): {np.linalg.cond(A_good):.2f}")   # 1.0

# Giải hệ phương trình với ma trận ill-conditioned → kết quả không tin được
b = np.array([2., 2.0001])
try:
    x = np.linalg.solve(A_bad, b)
    print(f"Solution: {x}")   # có thể rất sai so với nghiệm đúng
except np.linalg.LinAlgError:
    print("Singular matrix!")

5. Cholesky Decomposition — Gaussian Process

Gaussian Process cần nghịch đảo covariance matrix — dùng Cholesky thay vì full inverse:

$K = LL^T \quad \text{(Cholesky)}$ $K^{-1}b = L^{-T}(L^{-1}b) \quad \text{(giải 2 triangular systems)}$

import torch

def gp_predict(X_train, y_train, X_test, kernel_fn, noise=1e-4):
    """
    Gaussian Process prediction với Cholesky decomposition
    """
    n = X_train.shape[0]
    
    # Covariance matrix
    K = kernel_fn(X_train, X_train)
    K += noise * torch.eye(n)   # jitter để đảm bảo positive definite
    
    # Cholesky: K = L @ L.T
    L = torch.linalg.cholesky(K)
    
    # Solve K⁻¹y dùng Cholesky (ổn định hơn trực tiếp nghịch đảo)
    alpha = torch.cholesky_solve(y_train.unsqueeze(-1), L)
    
    # Predict
    K_star = kernel_fn(X_test, X_train)
    mu = K_star @ alpha
    
    return mu.squeeze()

6. Debugging Numerical Issues

import torch

def check_numerics(tensor, name="tensor"):
    """Utility: kiểm tra NaN/Inf trong training"""
    has_nan = torch.isnan(tensor).any()
    has_inf = torch.isinf(tensor).any()
    
    if has_nan or has_inf:
        print(f"⚠️  {name}: NaN={has_nan}, Inf={has_inf}")
        print(f"   min={tensor.min():.4f}, max={tensor.max():.4f}")
        return False
    return True

# Sử dụng trong training loop:
def training_step(model, batch):
    x, y = batch
    logits = model(x)
    
    check_numerics(logits, "logits")
    
    loss = F.cross_entropy(logits, y)
    
    if not check_numerics(loss, "loss"):
        return None   # skip batch này
    
    loss.backward()
    
    # Kiểm tra gradient
    for name, param in model.named_parameters():
        if param.grad is not None:
            check_numerics(param.grad, f"grad/{name}")
    
    return loss

# PyTorch anomaly detection (debug mode, chậm hơn):
with torch.autograd.detect_anomaly():
    loss = model(x)
    loss.backward()   # sẽ throw exception với stack trace tại NaN đầu tiên

7. Lỗi Phổ biến

Lỗi	Nguyên nhân	Giải pháp
Loss → NaN	FP16 overflow, lr quá lớn	GradScaler, giảm lr, dùng BF16
Softmax → all zeros	Logits quá âm	Stable softmax, check normalization
Cholesky failed	Matrix không positive definite	Thêm jitter `+ eps * I`
Kết quả khác nhau giữa CPU/GPU	Floating point không deterministic	`torch.use_deterministic_algorithms(True)`
Catastrophic cancellation	Trừ 2 số gần nhau	Reformulate bằng log-space

8. Checklist

Giải thích tại sao 0.1 + 0.2 ≠ 0.3 trong máy tính
Biết khác biệt giữa FP16, BF16, FP32 và khi nào dùng gì
Implement được stable softmax và giải thích tại sao cần
Sử dụng được GradScaler cho mixed precision training
Debug được NaN loss bằng check_numerics và anomaly detection

🎓 Kết thúc Series

Bạn đã hoàn thành 8 bài về Toán học trong AI/ML:

#	Lĩnh vực	Ứng dụng cốt lõi
1	Giải tích	Backpropagation, Gradient Descent
2	Đại số Tuyến tính	Attention, Embedding, LoRA
3	Xác suất	MLE, VAE, Diffusion
4	Thống kê	A/B testing, Batch Norm
5	Tối ưu hóa	Adam, PPO, LR scheduling
6	Lý thuyết Thông tin	Cross-entropy, KL, Perplexity
7	Toán học Rời rạc	GNN, BPE, Tree-of-Thought
8	Giải tích Số	Mixed Precision, Stable Numerics

Bước tiếp theo: Implement một mini Transformer từ đầu bằng NumPy — đó là bài kiểm tra tốt nhất để biết bạn đã thực sự hiểu cả 8 lĩnh vực này.

🔗 Series

← Bài 7: Toán học Rời rạc
→ Bài 1: Giải tích — Quay lại đầu series