Bài 2: Đại số Tuyến tính — Ngôn ngữ của Dữ liệu

Series: Toán học trong AI/ML & Deep Learning

Topics: linear-algebra matrix SVD attention embedding

🎯 Tại sao Đại số Tuyến tính quan trọng?

Mọi thứ trong AI đều là số được sắp xếp thành mảng nhiều chiều:

Ảnh 256×256 RGB → tensor (256, 256, 3)
Câu “Hello world” → ma trận embedding (2, 512)
Batch 32 câu, mỗi câu 128 token, embedding 768 chiều → tensor (32, 128, 768)

Đại số tuyến tính là bộ công cụ để thao tác hiệu quả với những cấu trúc này.

$scalar, vector, matrix, tensor$

1. Các đối tượng cơ bản

Scalar, Vector, Matrix, Tensor

Scalar    Vector      Matrix       Tensor
  5       [1,2,3]    [[1,2],      [[[...]]]
                      [3,4]]      (3D+)
  0D        1D          2D          nD

import numpy as np
import torch

# Scalar — một số
s = 3.14

# Vector — 1D array
v = np.array([1.0, 2.0, 3.0])          # shape: (3,)

# Matrix — 2D array
M = np.array([[1, 2, 3],
              [4, 5, 6]])               # shape: (2, 3)

# Tensor — nD (phổ biến trong DL)
T = torch.randn(32, 128, 768)           # batch × seq_len × d_model
print(f"Tensor shape: {T.shape}")       # torch.Size([32, 128, 768])
print(f"Total params: {T.numel():,}")   # 3,145,728

Phép nhân ma trận — Trái tim của Neural Network

$Eigenvalues - giá trị riêng, PCA, SVD$

Mọi linear layer trong DL đều là phép nhân ma trận:

\[\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}\]

Quy tắc shape: (m × k) @ (k × n) → (m × n) — chiều trong phải khớp.

Input x       Weight W        Output z
(batch, in) × (in, out)   = (batch, out)
(32, 784)   × (784, 256)  = (32, 256)
     k    ← k phải khớp →

# Linear layer = matmul + bias
batch, d_in, d_out = 32, 784, 256

x = np.random.randn(batch, d_in)
W = np.random.randn(d_in, d_out) * 0.01
b = np.zeros(d_out)

z = x @ W + b                  # shape: (32, 256)
print(f"Output shape: {z.shape}")

# PyTorch equivalent:
# nn.Linear(784, 256) → tự quản lý W và b

2. Chuẩn (Norms) — Đo “độ lớn” của vector

L1, L2 Norm

\[\|\mathbf{v}\|_1 = \sum_i |v_i| \qquad \|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}\]

v = np.array([3.0, -4.0, 0.0])

l1 = np.sum(np.abs(v))                  # = 7.0
l2 = np.sqrt(np.sum(v ** 2))            # = 5.0
l2_np = np.linalg.norm(v)               # = 5.0 (tiện hơn)

print(f"L1 norm: {l1}")  # 7.0
print(f"L2 norm: {l2}")  # 5.0

Trong AI

Norm	Ứng dụng
L2	Weight decay / L2 regularization: `loss += λ·‖W‖²`
L1	Lasso regularization: tạo sparse weights
Frobenius	Đo “kích thước” ma trận: `‖M‖_F = √(Σ m²ᵢⱼ)`
Gradient norm	Gradient clipping: nếu `‖∇‖ > threshold` → scale down

3. Eigenvalues & Eigenvectors — Giá trị riêng & Vector riêng

Intuition

Nhân ma trận $A$ vào một vector thường vừa xoay vừa scale. Nhưng với eigenvector $\mathbf{v}$, phép nhân chỉ scale (không xoay):

\[A\mathbf{v} = \lambda \mathbf{v}\]

$\lambda$ là eigenvalue — hệ số scale. $\mathbf{v}$ là eigenvector — hướng bất biến.

Hầu hết vector:          Eigenvector:
    A                        A
  ↗ → ↗ (xoay + scale)     → → → (chỉ scale, không xoay)
  v    Av                   v     λv

Ứng dụng: PCA giảm chiều dữ liệu

from sklearn.decomposition import PCA
import numpy as np

# Dữ liệu 100 điểm, 50 chiều
X = np.random.randn(100, 50)

# PCA dùng eigendecomposition của covariance matrix
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

print(f"Original shape : {X.shape}")      # (100, 50)
print(f"Reduced shape  : {X_2d.shape}")   # (100, 2)
print(f"Variance kept  : {pca.explained_variance_ratio_.sum():.1%}")

# Phía sau hậu trường:
cov = np.cov(X.T)                         # covariance matrix (50×50)
eigenvalues, eigenvectors = np.linalg.eig(cov)
# PCA = project data lên eigenvectors có eigenvalue lớn nhất

4. SVD — Phân tách ma trận mạnh nhất

Công thức

Mọi ma trận $M$ đều có thể phân tách:

\[M = U \Sigma V^T\]

$U$: ma trận trực giao (left singular vectors) — “hướng output”
$\Sigma$: ma trận đường chéo (singular values) — “độ quan trọng”
$V^T$: ma trận trực giao (right singular vectors) — “hướng input”

M (m×n) = U (m×m) × Σ (m×n) × Vᵀ (n×n)

Σ = diag(σ₁, σ₂, ..., σₖ)   với σ₁ ≥ σ₂ ≥ ... ≥ σₖ ≥ 0

Ứng dụng: Nén ảnh + LoRA fine-tuning

import numpy as np
from PIL import Image

# === Nén ảnh bằng SVD ===
# Giả lập ảnh grayscale 256×256
img = np.random.randint(0, 256, (256, 256)).astype(float)

U, sigma, Vt = np.linalg.svd(img, full_matrices=False)

def reconstruct(U, sigma, Vt, k):
    """Giữ lại k singular values đầu tiên (quan trọng nhất)"""
    return U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

for k in [5, 20, 50, 100]:
    img_k = reconstruct(U, sigma, Vt, k)
    compression = k * (256 + 256 + 1) / (256 * 256)
    print(f"k={k:3d}: compression ratio={compression:.1%}")

# k=  5: compression ratio= 4.0%  ← rất nén, chất lượng thấp
# k= 20: compression ratio=15.9%
# k= 50: compression ratio=39.7%
# k=100: compression ratio=79.4%  ← gần giống gốc

# === LoRA (Low-Rank Adaptation) dùng cùng tư tưởng ===
# Thay vì fine-tune W (d×d = d² params):
# Fine-tune ΔW = A @ B  với A:(d×r), B:(r×d), r << d
# r=8: chỉ cần 8*(d+d) = 16d params thay vì d² params!

d, r = 768, 8
A = np.random.randn(d, r) * 0.01   # (768, 8)
B = np.random.randn(r, d) * 0.0    # (8, 768) — init = 0
delta_W = A @ B                     # (768, 768) — low-rank update

params_full = d * d
params_lora = d * r + r * d
print(f"\nLoRA: {params_lora:,} vs Full: {params_full:,} params")
print(f"Reduction: {params_lora/params_full:.1%}")
# LoRA: 12,288 vs Full: 589,824 params
# Reduction: 2.1%  ← giảm 98% số params cần train!

5. Multi-head Attention — Đại số tuyến tính thuần túy

Đây là phép tính cốt lõi của Transformer, 100% là phép nhân ma trận:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Q: (seq_len, d_k)  — Query: "tôi đang tìm gì?"
K: (seq_len, d_k)  — Key:   "tôi có thông tin gì?"
V: (seq_len, d_v)  — Value: "thông tin thực sự"

QKᵀ: (seq_len, seq_len)  — attention scores

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Attention chuẩn — Vaswani et al. 2017
    Q, K, V: (batch, heads, seq_len, d_k)
    """
    d_k = Q.shape[-1]
    
    # 1. Tính attention scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    # shape: (batch, heads, seq_len, seq_len)
    
    # 2. Mask (optional — cho decoder)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # 3. Softmax → attention weights
    weights = F.softmax(scores, dim=-1)
    # shape: (batch, heads, seq_len, seq_len)
    
    # 4. Weighted sum of values
    output = weights @ V
    # shape: (batch, heads, seq_len, d_v)
    
    return output, weights

# Demo: 2 câu, 4 heads, 8 tokens, d_k=64
batch, heads, seq, d_k = 2, 4, 8, 64
Q = torch.randn(batch, heads, seq, d_k)
K = torch.randn(batch, heads, seq, d_k)
V = torch.randn(batch, heads, seq, d_k)

out, attn = scaled_dot_product_attention(Q, K, V)
print(f"Output shape       : {out.shape}")   # (2, 4, 8, 64)
print(f"Attention weights  : {attn.shape}")  # (2, 4, 8, 8)
print(f"Each row sums to 1 : {attn[0,0,0].sum():.4f}")  # 1.0000 ✓

6. Embedding — Biến token thành vector

"cat" → lookup index 2847 → row 2847 of E → [0.2, -0.8, 0.1, ..., 0.5]
                                              └──── vector 512 chiều ────┘

import torch
import torch.nn as nn

# Embedding = ma trận E shape (vocab_size, d_model)
vocab_size, d_model = 50000, 512
embedding = nn.Embedding(vocab_size, d_model)

# Token IDs của câu "Hello world"
token_ids = torch.tensor([15496, 995])  # giả sử đây là IDs của GPT-2

# Lookup = indexing vào ma trận E
embeds = embedding(token_ids)
print(f"Token IDs shape : {token_ids.shape}")  # (2,)
print(f"Embedding shape : {embeds.shape}")     # (2, 512)

# Tại sao embedding works?
# Đơn giản là: embeds = E[token_ids]  (matrix row indexing)
# → Backward qua đây = cập nhật đúng hàng trong E tương ứng với token xuất hiện

7. Lỗi Phổ biến & Shape Debugging

import torch

# ❌ Lỗi phổ biến 1: sai thứ tự transpose
A = torch.randn(3, 4)
B = torch.randn(3, 4)

# A @ B → ERROR! (3,4) × (3,4) không hợp lệ
# A @ B.T → OK!  (3,4) × (4,3) = (3,3) ✓

# ❌ Lỗi phổ biến 2: broadcasting ngầm gây bug
x = torch.randn(32, 10)  # (batch=32, classes=10)
b = torch.randn(10)      # bias

# x + b → (32, 10) + (10,) → broadcast → (32, 10) ✓ (đúng)
# Nhưng nếu b = torch.randn(32) →
# x + b → sẽ cần reshape! dễ gây bug

# ✅ Thói quen tốt: luôn assert shape
def matmul_safe(A, B):
    assert A.shape[-1] == B.shape[-2], \
        f"Shape mismatch: {A.shape} @ {B.shape}"
    return A @ B

# ✅ Debug shape từng bước
def forward_debug(x):
    print(f"Input:    {x.shape}")
    
    W1 = torch.randn(x.shape[-1], 128)
    z = x @ W1
    print(f"After W1: {z.shape}")
    
    return z

8. Checklist “Đã hiểu Linear Algebra cho AI/ML chưa?”

Giải thích được tại sao shape của matmul là (m,k) × (k,n) = (m,n)
Biết khi nào dùng .T vs .transpose(-2,-1) trong PyTorch
Hiểu PCA làm gì và SVD liên quan thế nào
Đọc được code Attention từ shape của Q, K, V
Debug được shape error bằng cách print từng bước
Giải thích LoRA dùng low-rank decomposition như thế nào

📚 Tài nguyên

Tài nguyên	Loại	Ghi chú
Immersive Linear Algebra	Interactive	Visual tốt nhất, free
3Blue1Brown — Essence of Linear Algebra	Video	15 tập, cực kỳ intuitive
Gilbert Strang — MIT 18.06	Lecture	Kinh điển, free trên YouTube
einops	Thư viện	Thao tác tensor rõ ràng hơn

🔗 Series

← Bài 1: Giải tích — Bộ não của Quá trình Học
→ Bài 3: Xác suất — Khi AI Không Chắc Chắn