Đáp án & Giải thích — Final Exam Deep Learning 2022-2023 (Mã đề 112)

PHẦN I — TRẮC NGHIỆM

#	Câu hỏi	Đáp án	Giải thích
01	Which of the following is FALSE about Deep Learning and Machine Learning algorithms?	A	DL kém interpretable hơn ML, không phải ngược lại
02	Which of the following is a type of neural network?	A	Autoencoder, Capsule NN, CNN đều là dạng neural network
03	Which of the following is FALSE about Neural Networks?	D	Tất cả đều đúng → không có phát biểu FALSE
04	Which of the following is FALSE about step activation function?	A	Step function là phi tuyến (non-linear), không phải linear
05	Which of the following is FALSE about activation functions?	C	Activation function không trực tiếp giảm overfitting; Dropout/Regularization mới làm việc đó
06	Output of step (threshold) activation function ranges from:	C	Step function chỉ output 0 hoặc 1
07	Which of the following is FALSE about sigmoid and tanh activation function?	D	Ngược lại: sigmoid → [0,1]; tanh → [-1,1]
08	Which of the following is FALSE about Dropout?	A	Dropout không nên dùng ở output layer
09	Which of the following is FALSE about Dropout?	A	Dropout là hyperparameter, không phải learnable parameter
10	Which of the following is TRUE about Dropout?	B	Dropout chỉ áp dụng lúc training, tắt khi inference (test)
11	Which of the following is TRUE about local and global minima?	D	Local minima đôi khi đủ tốt như global minima trong deep learning
12	Which of the following is a way to avoid local minima?	A	Cả 3 cách (momentum, tăng LR, thêm noise) đều giúp thoát local minima
13	Which of the following SGD variants is NOT based on adaptive learning?	B	Nesterov là momentum-based, không phải adaptive learning
14	Which of the following is TRUE about Weight Initialization?	D	Weight init sai → model không hội tụ; A và B bị đảo chiều (high weight → exploding, low → vanishing)
15	Which of the following is TRUE about Momentum?	A	Momentum giúp hội tụ nhanh hơn, đúng hướng, tránh local minima
16	Which of the following is FALSE about Pooling Layer in CNN?	B	Feature extraction là nhiệm vụ của Conv layer, không phải Pooling layer
17	Which of the following is a valid reason for not using fully connected networks for image recognition?	C	Fully connected: nhiều params hơn, dễ overfit hơn, kém hiệu quả hơn CNN cho ảnh
18	Which of the following is FALSE about Padding in CNN?	D	Valid padding (no padding) → output nhỏ hơn input, tức là có giảm dimension
19	Which of the following is FALSE about Kernels in CNN?	D	Tất cả phát biểu về kernel đều đúng
20	Which of the following is NOT a hyper-parameter in CNN?	D	Code size là hyperparameter của Autoencoder, không phải CNN
21	Which of the following is FALSE about LSTM?	D	LSTM giải quyết vanishing gradient, không phải exploding gradient (exploding → gradient clipping)
22	Which of the following is NOT an application of RNN?	C	Image compression thuộc Autoencoder, không phải ứng dụng của RNN
23	Which of the following is NOT an application of RNN?	A	Anomaly detection thường dùng Autoencoder/statistical model; RNN phù hợp chuỗi thời gian
24	How many parts can the GAN be divided into?	B	GAN gồm 2 phần: Generator và Discriminator
25	Which word is used to explain how data is generated using probabilistic models?	A	“Generative” → mô tả cách dữ liệu được sinh ra bằng mô hình xác suất
26	Which of the following is not an example of a generative model?	B	Discriminator model là discriminative model, không phải generative
27	What is the standard form of YOLO?	D	YOLO = You Only Look Once
28	Which of the following are the components of object recognition system?	E	Hệ thống nhận dạng vật thể gồm cả 4: model DB, hypothesizer, feature detector, hypothesis verifier
29	The face recognition system used in:	C	Face recognition dùng trong cả biometric ID lẫn HCI
30	Which of the following is TRUE about types of Vectorization in NLP?	C	N-gram đúng; Count vectorization không xét weightage (TF-IDF mới làm vậy) → chỉ C đúng hoàn toàn
31	Which of the following is an application of NLP?	D	Google Assistant, Chatbot, Google Translate đều là NLP applications
32	Which of the following is TRUE about NLP?	B	Tất cả phát biểu đều đúng
33	Which of the following techniques can be used to reduce model overfitting?	C	Data augmentation, Dropout, Batch Norm giảm overfitting; Adam thay SGD không trực tiếp giảm overfitting
34	Which of the following is true about dropout?	D	Chỉ (a) đúng: dropout tạo sparsity. (b) sai — inverted dropout nhân ở training, không phải test. (c) sai — keep prob cao = ít regularization
35	GAN for reptile images — which could be indicators of mode collapse?	A	Mode collapse: generator chỉ sinh 1 loại ảnh (komodo) và loss dao động
36	Which one of the following is not a pre-processing technique in NLP?	D	Sentiment analysis là downstream task, không phải preprocessing
37	GANs for reptile images — which could be indicators of mode collapse?	A	Tương tự Q35: chỉ (a) và (b) là dấu hiệu mode collapse
38	Which of the following propositions are TRUE about a CONV layer?	C	(a) đúng: số weight phụ thuộc depth input; (b) đúng: số bias = số filter. Stride/padding không ảnh hưởng số params

PHẦN II — TỰ LUẬN

Q39 — Multi-label Classification

You have been tasked to build a classifier that takes in an image of a movie poster and classifies it into one of four genres: comedy, horror, action, and romance. Your model has 100% accuracy on training set and 96% on validation set. You now decide to expand the model to posters belonging to multiple genres. Propose a way to label new posters where each example can simultaneously belong to multiple classes? To avoid extra work, you retrain with the same architecture (softmax + cross-entropy). Explain why this is problematic?

Cách label: Dùng multi-hot encoding — mỗi poster được gán vector nhị phân [0/1] cho từng genre. Ví dụ: [1,0,1,0] = comedy + action.

Tại sao Softmax + Cross-entropy là sai:

Softmax chuẩn hóa output thành phân phối xác suất tổng = 1 → buộc mô hình chọn 1 class duy nhất.
Cross-entropy thông thường (categorical) chỉ tối ưu cho single-label.

Giải pháp đúng: Thay bằng Sigmoid (per output) + Binary Cross-Entropy loss → mỗi class độc lập, cho phép đồng thời nhiều nhãn = 1.

Q40 — GD vs SGD vs Mini-batch GD

Explain the difference between gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Variant	Tính gradient trên	Ưu điểm	Nhược điểm
GD (Batch)	Toàn bộ dataset	Ổn định, hội tụ mượt	Chậm, tốn RAM
SGD	1 sample	Nhanh, thoát local minima	Noisy, dao động mạnh
Mini-batch GD	Batch nhỏ (32–256)	Cân bằng tốc độ & ổn định	Cần tuning batch size

Trong thực tế, Mini-batch là chuẩn de facto.

Q41 — CNN Layer Dimensions & Parameters

Consider the CNN defined by the layers below. Fill in the shape of the output volume and the number of parameters at each layer (format: H × W × C). Assume padding 1, stride 1 unless specified.

Quy tắc: Output size = (Input - Kernel + 2*Padding) / Stride + 1

Số params CONV: (K × K × C_in + 1) × C_out

Layer	Activation Volume	# Parameters
Input	32 × 32 × 3	0
CONV3-8 (pad=1, stride=1)	32 × 32 × 8	(3×3×3+1)×8 = 224
Leaky ReLU	32 × 32 × 8	0
POOL-2 (stride=2)	16 × 16 × 8	0
BATCHNORM	16 × 16 × 8	2×8 = 16 (γ và β)
CONV3-16 (pad=1, stride=1)	16 × 16 × 16	(3×3×8+1)×16 = 1,168
Leaky ReLU	16 × 16 × 16	0
POOL-2 (stride=2)	8 × 8 × 16	0
FLATTEN	1,024	0
FC-10	10	1,024×10+10 = 10,250

Q42 — Vanishing Gradient với Sigmoid + SGD

Give a method to fight vanishing gradient in fully-connected neural networks. Assume we are using a network with Sigmoid activations trained using SGD.

Các giải pháp:

Thay activation function — dùng ReLU/Leaky ReLU thay Sigmoid (sigmoid bão hòa ở vùng giá trị lớn/nhỏ → gradient ≈ 0)
Batch Normalization — chuẩn hóa activation về phân phối chuẩn trước mỗi layer → tránh bão hòa
Weight initialization tốt hơn — Xavier/He init thay random init
Gradient clipping — giới hạn norm của gradient
Skip connections (ResNet-style) — gradient đi qua shortcut path, không phải xuyên qua nhiều sigmoid liên tiếp

Q43 — Cách Train Deep Network

How do we train the deep network?

Quy trình chuẩn:

Forward pass — tính output qua từng layer
Compute loss — so sánh output với ground truth (cross-entropy, MSE…)
Backward pass (Backpropagation) — tính gradient của loss theo từng weight bằng chain rule
Weight update — dùng optimizer (SGD, Adam…): W = W - lr × ∇W
Lặp lại theo epochs cho đến khi hội tụ

Yếu tố quan trọng: learning rate schedule, proper weight init, regularization (Dropout, BN), early stopping.

Q44 — Sigmoid vs Tanh

Explain the difference between the sigmoid and tanh activation function.

	Sigmoid	Tanh
Output range	[0, 1]	[-1, 1]
Zero-centered	❌	✅
Vanishing gradient	Có (cả 2 đầu)	Có, nhưng nhẹ hơn
Ứng dụng	Output layer (binary classification)	Hidden layers (thường tốt hơn sigmoid)

Tanh là phiên bản rescale của sigmoid: tanh(x) = 2·sigmoid(2x) - 1

Q45 — Jacobian Matrix

What is the Jacobian Matrix?

Ma trận Jacobian là ma trận các đạo hàm riêng bậc nhất của một hàm vector-valued f: ℝⁿ → ℝᵐ:

J[i,j] = ∂f_i / ∂x_j

Trong deep learning, Jacobian xuất hiện trong backpropagation khi tính gradient qua các layer có output nhiều chiều (ví dụ: softmax layer).

Q46 — Generative Adversarial Network (GAN)

Explain Generative Adversarial Network.

GAN gồm 2 thành phần huấn luyện song song:

Generator (G): nhận noise vector z → sinh dữ liệu giả (fake data)
Discriminator (D): phân biệt real data vs fake data → output xác suất [0,1]

Mục tiêu:

D cố maximize khả năng phân biệt đúng
G cố đánh lừa D (minimize khả năng D phân biệt được)

Đây là minimax game:

min_G max_D [ E[log D(x)] + E[log(1 - D(G(z)))] ]

Hội tụ khi G sinh ra dữ liệu không thể phân biệt với real data (D output ≈ 0.5).

Q47 — LSTM vs RNN

How LSTM differ from the RNN?

	RNN	LSTM
Memory	Short-term only	Long-term + short-term
Vanishing gradient	Nghiêm trọng	Được giải quyết bằng cell state
Cơ chế	Hidden state đơn giản	3 gates: Input, Forget, Output
Ứng dụng	Chuỗi ngắn	Chuỗi dài (dịch máy, speech)

LSTM thêm cell state và các gate để kiểm soát luồng thông tin, cho phép nhớ/quên chọn lọc.

Q48 — Same Padding vs Valid Padding

What is the difference between the same padding and valid padding?

	Same Padding	Valid Padding
Thêm zeros	Có (quanh border)	Không
Output size	= Input size (khi stride=1)	< Input size
Giữ edge info	✅	❌
Dùng khi	Muốn giữ spatial dimension	Chấp nhận giảm kích thước

Q49 — IoU (Intersection over Union)

What is IoU?

Metric đánh giá chất lượng bounding box trong object detection:

IoU = Area of Intersection / Area of Union

IoU = 1: predicted box khớp hoàn toàn với ground truth
IoU = 0: không overlap
Ngưỡng thông thường: IoU ≥ 0.5 → True Positive

Q50 — Word Embedding và Khoảng Cách Giữa Tokens

In NLP, how word embedding techniques help to establish the distance between 2 tokens?

Word embedding ánh xạ token → vector số thực trong không gian nhiều chiều (e.g., Word2Vec, GloVe, BERT embeddings). Khoảng cách giữa 2 token được đo bằng:

Cosine similarity: đo góc giữa 2 vector → phổ biến nhất trong NLP
Euclidean distance: khoảng cách hình học
Dot product: dùng trong attention mechanism

Tokens có ngữ nghĩa gần nhau → vector gần nhau trong embedding space.

Ví dụ: king - man + woman ≈ queen