Bài 4: Thống kê (Statistics) — Phân biệt Tín hiệu và Nhiễu

Series: Toán học trong AI/ML & Deep Learning

Topics: statistics hypothesis-testing bias-variance batch-norm A/B-testing

🎯 Tại sao Thống kê quan trọng?

Dữ liệu ngoài đời thực có nhiễu. Nhiệm vụ của thống kê: tách tín hiệu thật ra khỏi nhiễu ngẫu nhiên.

Model A accuracy = 87.3%, Model B = 87.5% → B tốt hơn thật sự hay chỉ do may mắn?
Feature mới có làm tăng conversion không, hay chỉ do biến động ngẫu nhiên?
Batch Normalization dùng mean/std — tính từ đâu, khi nào?

Statistics

Phương pháp điều tra

statistics method

Descriptive Statistics

1. Descriptive Statistics — Mô tả dữ liệu

import numpy as np
from scipy import stats

# Dữ liệu: độ trễ inference (ms) của 1000 requests
latency = np.random.lognormal(mean=3, sigma=0.5, size=1000)

mean   = np.mean(latency)          # trung bình — nhạy cảm với outlier
median = np.median(latency)        # trung vị — robust hơn
std    = np.std(latency)           # độ lệch chuẩn
var    = np.var(latency)           # phương sai = std²
p95    = np.percentile(latency, 95)  # P95 latency — metric thực tế

print(f"Mean   : {mean:.1f} ms")
print(f"Median : {median:.1f} ms")
print(f"Std    : {std:.1f} ms")
print(f"P95    : {p95:.1f} ms")

# Khi mean >> median → data skewed right (có outlier lớn)
# → Dùng median + IQR thay vì mean + std để mô tả

Covariance & Correlation — Hai features có liên quan không?

\[\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]\] \[\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]\]

# Correlation matrix — feature selection
import numpy as np

# 4 features, 100 samples
X = np.random.randn(100, 4)
X[:, 2] = X[:, 0] * 2 + 0.1 * np.random.randn(100)  # feature 2 ~ feature 0

corr = np.corrcoef(X.T)
print("Correlation matrix:")
print(corr.round(2))
# feature 0 và feature 2 sẽ có correlation ~1.0
# → loại bớt 1 trong 2 (multicollinearity)

2. Hypothesis Testing — Phân biệt thật và nhiễu

Framework chuẩn

H₀ (null hypothesis): Không ảnh hưởng gì
H₁ (alternative):     Có ảnh hưởng

p-value = P(thấy kết quả này | H₀ đúng)

Nếu p-value < α (thường = 0.05):
    → Bác bỏ H₀ → kết quả "có ý nghĩa thống kê"

t-test — So sánh 2 nhóm

from scipy import stats
import numpy as np

# So sánh accuracy của Model A vs Model B trên 30 test sets
np.random.seed(42)
acc_A = np.random.normal(0.873, 0.015, 30)   # Model A
acc_B = np.random.normal(0.881, 0.015, 30)   # Model B

t_stat, p_value = stats.ttest_ind(acc_A, acc_B)
print(f"t-statistic : {t_stat:.3f}")
print(f"p-value     : {p_value:.4f}")
print(f"Significant : {'Yes' if p_value < 0.05 else 'No (maybe noise!)'}")

# Nếu p_value = 0.15 → không có ý nghĩa thống kê
# → Không kết luận được B tốt hơn A dù mean cao hơn!

A/B Testing cho model deployment

def ab_test_sample_size(baseline_rate, min_detectable_effect,
                         alpha=0.05, power=0.8):
    """
    Tính sample size cần thiết cho A/B test
    baseline_rate: metric hiện tại (e.g., 0.05 = 5% conversion)
    min_detectable_effect: uplift tối thiểu muốn detect (e.g., 0.01 = +1%)
    """
    from scipy.stats import norm
    
    p1 = baseline_rate
    p2 = baseline_rate + min_detectable_effect
    
    z_alpha = norm.ppf(1 - alpha/2)   # 1.96 for α=0.05
    z_beta  = norm.ppf(power)          # 0.84 for power=0.8
    
    p_bar = (p1 + p2) / 2
    
    n = (z_alpha * np.sqrt(2 * p_bar * (1 - p_bar)) +
         z_beta  * np.sqrt(p1*(1-p1) + p2*(1-p2))) ** 2 / (p2 - p1) ** 2
    
    return int(np.ceil(n))

n = ab_test_sample_size(baseline_rate=0.05, min_detectable_effect=0.005)
print(f"Cần {n:,} users/group để detect +0.5% uplift")  # ~11,000

3. Bias-Variance Trade-off

\[\text{MSE} = \underbrace{\text{Bias}^2}_{\text{Underfitting}} + \underbrace{\text{Variance}}_{\text{Overfitting}} + \text{Irreducible Noise}\]

         High Bias          Balanced           High Variance
         (Underfitting)                        (Overfitting)
         
Train:   │ ●●●●●●          │ ●●●●●●           │ ●●●●●●
Error:   │ High            │ Low              │ Very Low
         │                 │                  │
Val:     │ ●●●●●●          │ ●●●●●●           │ ●●●●●●●●●●●
Error:   │ High            │ Low              │ Very High
         │                 │                  │
Fix:     More complex,     ✓ Good!            Regularization,
         more data                            more data, dropout

# Ví dụ: polynomial regression với degree khác nhau
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + 0.3 * np.random.randn(100)

for degree in [1, 4, 15]:
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    
    # Cross-validation: ước lượng generalization error
    scores = cross_val_score(Ridge(), X_poly, y, cv=5,
                              scoring='neg_mean_squared_error')
    mse_cv = -scores.mean()
    print(f"Degree {degree:2d}: CV MSE = {mse_cv:.4f}")

# Degree  1: CV MSE = 0.3012  ← high bias (underfitting)
# Degree  4: CV MSE = 0.0987  ← balanced ✓
# Degree 15: CV MSE = 0.3541  ← high variance (overfitting)

4. Batch Normalization — Thống kê trong DL

Batch Norm dùng mean và std của batch để normalize, giúp training ổn định:

\[\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\] \[y_i = \gamma \hat{x}_i + \beta \quad \text{(scale & shift học được)}\]

import torch
import torch.nn as nn

# Batch Norm trong PyTorch
bn = nn.BatchNorm1d(num_features=64)

x = torch.randn(32, 64)   # (batch=32, features=64)
out = bn(x)

# Kiểm tra: sau BN, mean~0, std~1 (trước scale/shift)
with torch.no_grad():
    x_norm = (x - x.mean(dim=0)) / (x.std(dim=0) + 1e-5)
    print(f"Manual BN - mean: {x_norm.mean():.4f}, std: {x_norm.std():.4f}")
    # → mean: ~0.0000, std: ~1.0000 ✓

# Train vs Eval: BN dùng batch stats khi train,
# dùng running stats (EMA) khi eval
model.train()   # dùng batch mean/std
model.eval()    # dùng stored running_mean, running_var

5. Bootstrap & Cross-Validation

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

# K-Fold Cross Validation — ước lượng model performance
def kfold_cv(model, X, y, k=5):
    kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
    scores = []
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        model.fit(X_train, y_train)
        score = accuracy_score(y_val, model.predict(X_val))
        scores.append(score)
    
    return np.mean(scores), np.std(scores)

# Bootstrap — confidence interval cho metric
def bootstrap_ci(y_true, y_pred, n_bootstrap=1000, ci=0.95):
    n = len(y_true)
    boot_scores = []
    
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        score = accuracy_score(y_true[idx], y_pred[idx])
        boot_scores.append(score)
    
    lower = np.percentile(boot_scores, (1 - ci) / 2 * 100)
    upper = np.percentile(boot_scores, (1 + ci) / 2 * 100)
    return lower, upper

6. Lỗi Phổ biến

Lỗi	Vấn đề	Giải pháp
P-hacking	Test nhiều lần đến khi p < 0.05	Dùng Bonferroni correction: α’ = α/n_tests
Nhầm correlation = causation	Feature tương quan ≠ cause outcome	Cần RCT hoặc causal inference
Không check normality	t-test giả định normal	Dùng Mann-Whitney U nếu non-normal
Dùng mean khi data skewed	Outlier kéo mean	Dùng median, report P95
Không report CI	Point estimate không đủ	Luôn kèm confidence interval

7. Checklist

Tính được mean, std, median, P95 và biết khi nào dùng cái nào
Hiểu p-value là gì và tại sao p < 0.05 không có nghĩa là “đúng”
Tính được sample size trước khi chạy A/B test
Giải thích bias-variance tradeoff bằng ví dụ cụ thể
Biết BN dùng stats nào khi train vs eval

🔗 Series

← Bài 3: Xác suất
→ Bài 5: Tối ưu hóa — Nghệ thuật Leo Núi