Bài 6: Lý thuyết Thông tin — Đo lường Sự không Chắc chắn

Series: Toán học trong AI/ML & Deep Learning

Topics: entropy cross-entropy KL-divergence VAE GAN perplexity

🎯 Tại sao Lý thuyết Thông tin quan trọng?

Cross-entropy loss mà bạn dùng mỗi ngày? Đó là Lý thuyết Thông tin.
KL divergence trong VAE? Lý thuyết Thông tin.
Perplexity đánh giá GPT? Lý thuyết Thông tin.

Claude Shannon (1948) xây dựng framework đo lường lượng thông tin — và framework đó trở thành xương sống của modern AI.

1. Entropy — Đo độ không chắc chắn

\[H(X) = -\sum_x p(x) \log p(x)\]

Intuition: Entropy cao = nhiều bất ngờ = cần nhiều bits để mã hóa.

import numpy as np

def entropy(probs):
    """Shannon entropy, đơn vị: nats (dùng log tự nhiên)"""
    probs = np.array(probs)
    # Tránh log(0) — convention: 0·log(0) = 0
    return -np.sum(probs * np.log(probs + 1e-10))

# Phân phối đồng đều: entropy cao nhất (bất định nhất)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"H(uniform) = {entropy(uniform):.4f}")   # 1.3863 nats

# Phân phối chắc chắn: entropy thấp nhất
certain = [1.0, 0.0, 0.0, 0.0]
print(f"H(certain) = {entropy(certain):.4f}")   # 0.0000

# Phân phối hơi lệch:
skewed = [0.7, 0.1, 0.1, 0.1]
print(f"H(skewed)  = {entropy(skewed):.4f}")    # 0.9349

# Kết luận: H(uniform) > H(skewed) > H(certain)
# Model tốt: output có entropy thấp (chắc chắn) với đúng class

2. Cross-Entropy — Loss function mọi nơi

\[H(p, q) = -\sum_x p(x) \log q(x)\]

$p$: phân phối thực (ground truth labels)
$q$: phân phối dự đoán của model (softmax output)

Cross-entropy = Entropy của $p$ + KL divergence từ $p$ đến $q$:

\[H(p, q) = H(p) + D_{KL}(p \| q)\]

Vì $H(p)$ cố định (không phụ thuộc model), minimize cross-entropy = minimize KL divergence = model dự đoán gần ground truth nhất.

import torch
import torch.nn.functional as F

# Multi-class classification
logits = torch.tensor([[3.0, 1.0, 0.2]])   # raw output
label  = torch.tensor([0])                  # class 0 là đúng

# PyTorch cross-entropy
loss = F.cross_entropy(logits, label)
print(f"CE loss: {loss.item():.4f}")        # 0.1019

# Tự tính để hiểu:
probs   = F.softmax(logits, dim=-1)         # [0.8768, 0.1190, 0.0041]
p_true  = probs[0, label].item()           # 0.8768
manual  = -np.log(p_true)
print(f"Manual : {manual:.4f}")             # 0.1013 ✓ (diff nhỏ do eps)

# Binary cross-entropy (sigmoid output)
logit_bin = torch.tensor([2.0])
label_bin = torch.tensor([1.0])
bce = F.binary_cross_entropy_with_logits(logit_bin, label_bin)
print(f"BCE    : {bce.item():.4f}")         # 0.1269

3. KL Divergence — Đo khoảng cách phân phối

\[D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

Quan trọng:

$D_{KL}(p | q) \geq 0$ — luôn không âm
$D_{KL}(p | q) = 0$ khi và chỉ khi $p = q$
Không đối xứng: $D_{KL}(p | q) \neq D_{KL}(q | p)$

import torch
import torch.nn.functional as F

def kl_divergence(p, q, eps=1e-8):
    """KL(p||q) — p là "true", q là "approximate" """
    p = torch.clamp(p, min=eps)
    q = torch.clamp(q, min=eps)
    return (p * (p.log() - q.log())).sum()

# Ví dụ: p = true distribution, q = model prediction
p = torch.tensor([0.4, 0.4, 0.2])   # ground truth
q = torch.tensor([0.3, 0.4, 0.3])   # model output

print(f"KL(p||q) = {kl_divergence(p, q):.4f}")   # 0.0268
print(f"KL(q||p) = {kl_divergence(q, p):.4f}")   # 0.0266
# Gần nhau ở đây, nhưng có thể rất khác trong trường hợp khác!

# PyTorch built-in (input phải là log_probs)
kl = F.kl_div(q.log(), p, reduction='sum')
print(f"PyTorch KL: {kl:.4f}")

KL Divergence trong VAE

def vae_loss(recon_x, x, mu, logvar):
    """
    ELBO = Reconstruction - KL
    Maximize ELBO = Minimize vae_loss
    """
    # Reconstruction loss: -log P(x|z)
    recon_loss = F.mse_loss(recon_x, x, reduction='sum')
    
    # KL: KL(N(μ,σ²) || N(0,1)) — có công thức analytic:
    # = -0.5 Σ(1 + log σ² - μ² - σ²)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    
    return recon_loss + kl_loss

4. Mutual Information — Feature nào quan trọng?

\[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

Intuition: Biết $Y$ giúp ta giảm được bao nhiêu bất định về $X$?

from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Feature selection dùng Mutual Information
X = np.random.randn(1000, 10)
X[:, 0] = X[:, 0] * 2          # feature 0 có signal
X[:, 1] = np.random.randn(1000) # feature 1 random noise
y = (X[:, 0] > 0).astype(int)  # label phụ thuộc feature 0

mi_scores = mutual_info_classif(X, y)
print("Mutual Information scores:")
for i, score in enumerate(mi_scores):
    print(f"  Feature {i}: {score:.4f}")
# Feature 0: 0.6xxx  ← cao nhất, đúng!
# Feature 1: ~0.0    ← gần 0, đúng!

5. Perplexity — Đánh giá Language Model

\[\text{Perplexity} = e^{H(p,q)} = e^{-\frac{1}{N}\sum_{i=1}^N \log P(w_i | w_1, ..., w_{i-1})}\]

Intuition: Perplexity = 100 nghĩa là model “bối rối” như thể mỗi bước phải chọn 1 trong 100 từ.

import torch
import numpy as np

def compute_perplexity(model, tokenizer, text):
    """Tính perplexity của LLM trên một đoạn text"""
    tokens = tokenizer.encode(text, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(tokens, labels=tokens)
        # outputs.loss = cross-entropy trung bình qua các token
        avg_nll = outputs.loss.item()
    
    perplexity = np.exp(avg_nll)
    return perplexity

# Ví dụ minh họa manual:
# Giả sử model predict xác suất từng token
log_probs = [-0.5, -1.2, -0.8, -2.1, -0.6]  # log P(wᵢ|w₁..wᵢ₋₁)
avg_nll = -np.mean(log_probs)
perplexity = np.exp(avg_nll)
print(f"Perplexity: {perplexity:.2f}")  # → e^1.04 ≈ 2.83

Model	Perplexity (WikiText-103)	Ý nghĩa
GPT-2 Large	~18	Tốt
GPT-3	~8	Rất tốt
Random baseline	~50,000	Dở tệ

6. Log-Sum-Exp Trick — Tránh Overflow

import numpy as np

# ❌ Naive softmax → overflow với giá trị lớn
def softmax_naive(x):
    return np.exp(x) / np.sum(np.exp(x))

x = np.array([1000.0, 1001.0, 1002.0])
print(softmax_naive(x))   # [nan, nan, nan] — overflow!

# ✅ Numerically stable softmax dùng log-sum-exp trick
def softmax_stable(x):
    x = x - np.max(x)   # trừ max → không đổi kết quả, tránh overflow
    return np.exp(x) / np.sum(np.exp(x))

print(softmax_stable(x))  # [0.0900, 0.2447, 0.6652] ✓

# Log-sum-exp:
def log_sum_exp(x):
    c = np.max(x)
    return c + np.log(np.sum(np.exp(x - c)))

# Áp dụng trong cross-entropy:
def stable_cross_entropy(logits, label):
    log_sum = log_sum_exp(logits)
    return -(logits[label] - log_sum)

7. Checklist

Giải thích entropy bằng ví dụ coin toss
Biết cross-entropy loss = negative log-likelihood
Nhớ KL divergence không đối xứng và không bao giờ âm
Hiểu perplexity đo gì và số càng thấp càng tốt
Implement được log-sum-exp trick

🔗 Series

← Bài 5: Tối ưu hóa
→ Bài 7: Toán học Rời rạc