Generative Models
Interview Context
Generative models 在面試中通常考概念層級 — 你不需要推導 ELBO 的每一步,但要能解釋 VAE / GAN / Diffusion 的核心思想、它們各自的優缺點、以及在 data science 中的實際應用(data augmentation, anomaly detection, synthetic data)。
What You Should Understand
- 能區分 generative 和 discriminative models 的根本差異
- 理解 autoencoder → VAE 的演進和 VAE 的 latent space 性質
- 知道 GAN 的 min-max game、training challenges(mode collapse, instability)
- 了解 diffusion models 的高層概念(forward/reverse process)
- 能說出 generative models 在 DS 中的實際應用
Generative vs Discriminative
| Aspect | Discriminative | Generative |
|---|---|---|
| Learns | (decision boundary) | or (data distribution) |
| Goal | Classify / predict | Generate new data similar to training |
| Examples | Logistic regression, SVM, neural nets | VAE, GAN, diffusion, GPT |
| Data efficiency | Usually needs less data | Usually needs more data |
| Can generate? | No | Yes — sample from learned distribution |
為什麼 DS 需要了解 Generative Models?
不是每個 DS 都需要 train GANs。但 generative models 在以下場景 directly useful:(1)Data augmentation for imbalanced data,(2)Anomaly detection(learn normal → detect abnormal),(3)Synthetic data generation for privacy,(4)Understanding LLMs(GPT is a generative model)。
Autoencoders
Architecture
An autoencoder learns a compressed representation of data by training to reconstruct its input:
Bottleneck: forces the network to learn a compressed representation — only the most important information survives。
Relation to PCA
Linear autoencoder with MSE loss = PCA. Non-linear autoencoder(with ReLU)learns a non-linear manifold — more powerful but harder to interpret.
Use Cases
| Application | How |
|---|---|
| Dimensionality reduction | Use encoder output as features |
| Anomaly detection | Train on normal data → high reconstruction error = anomaly |
| Denoising | Train on noisy input → reconstruct clean output |
| Pre-training | Use encoder as feature extractor for downstream tasks |
Autoencoder Limitations
Autoencoder 學的 latent space 是 unstructured — 不同 regions 之間沒有 smooth interpolation,不能 sample meaningful new data。這就是 VAE 要解決的問題。
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim=784, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256), nn.ReLU(),
nn.Linear(256, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid(),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
Variational Autoencoder (VAE)
The Key Idea
VAE adds probabilistic structure to the latent space — encoder outputs a distribution (mean + variance), not a single point:
Training 不只 minimize reconstruction error,還要讓 latent distribution 接近 a prior 。
The Loss Function (ELBO)
| Term | What It Does |
|---|---|
| Reconstruction loss | 確保 decoder 能從 重建 (和 autoencoder 一樣) |
| KL divergence | 迫使 encoder 的 latent distribution 接近 standard normal → smooth, continuous latent space |
KL divergence 是 VAE 的 magic — 它讓 latent space 有 structure,使得 sampling 和 interpolation 成為可能。
Reparameterization Trick
Problem: Sampling from 是 stochastic operation → cannot backpropagate through it。
Solution: Reparameterize as deterministic function + external noise:
現在 是 和 的 differentiable function → can backpropagate normally。Randomness 被 externalize 到 。
面試 Key Insight
「為什麼 VAE 能 generate 但 autoencoder 不能?」— Autoencoder 的 latent space 沒有 structure → random point in latent space 不一定 decode 成 meaningful output。VAE 的 KL loss 強制 latent space = smooth Gaussian → 任何從 N(0,I) sample 的點都能 decode 成 reasonable output。
VAE vs Autoencoder
| Aspect | Autoencoder | VAE |
|---|---|---|
| Latent space | Unstructured points | Structured Gaussian distribution |
| Can generate? | No(random z → garbage) | Yes(sample z ~ N(0,I) → meaningful output) |
| Loss | Reconstruction only | Reconstruction + KL divergence |
| Interpolation | May produce artifacts | Smooth interpolation in latent space |
| Encoder output | Single vector z | Mean μ + variance σ² |
class VAE(nn.Module):
def __init__(self, input_dim=784, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
self.fc_mu = nn.Linear(256, latent_dim)
self.fc_var = nn.Linear(256, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid(),
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_var(h) # μ and log(σ²)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + std * eps # z = μ + σ * ε
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
Generative Adversarial Networks (GANs)
The Min-Max Game
Two networks compete against each other:
- Generator : Takes random noise → produces fake data
- Discriminator : Distinguishes real data from fake — outputs probability of being real
| Player | Goal | Training Signal |
|---|---|---|
| Generator | Fool the discriminator(produce realistic fakes) | Maximize (discriminator thinks fake is real) |
| Discriminator | Correctly identify real vs fake | Maximize accuracy on both real and fake |
直覺:Generator = 偽鈔製造者,Discriminator = 警察。兩者互相提升 — 最終 Generator 產出的 fake 和 real 無法區分。
Training Dynamics
# GAN training loop (simplified)
for batch in dataloader:
# 1. Train Discriminator
real_labels = torch.ones(batch_size)
fake_labels = torch.zeros(batch_size)
z = torch.randn(batch_size, latent_dim)
fake = generator(z).detach() # don't update G
d_loss = bce_loss(discriminator(real_data), real_labels) + \
bce_loss(discriminator(fake), fake_labels)
d_loss.backward()
d_optimizer.step()
# 2. Train Generator
z = torch.randn(batch_size, latent_dim)
fake = generator(z)
g_loss = bce_loss(discriminator(fake), real_labels) # want D to think fake is real
g_loss.backward()
g_optimizer.step()
GAN Training Challenges
| Challenge | Description | Solutions |
|---|---|---|
| Mode collapse | Generator 只產出少數幾種 output(ignores diversity) | Mini-batch discrimination, unrolled GAN, Wasserstein loss |
| Training instability | G and D 的 training 不平衡 → oscillation, divergence | Spectral normalization, progressive growing, careful lr tuning |
| Vanishing gradient for G | 如果 D 太強 → → G 的 gradient 消失 | Use instead(non-saturating loss) |
| No convergence guarantee | Min-max optimization 不像 minimization 有 convergence theory | 實務中靠 tricks + careful monitoring |
| Evaluation | 沒有 single metric(no likelihood like VAE) | FID score, IS score |
Mode Collapse
GAN 最 infamous 的問題:Generator 發現產出某一種 output 就能 fool discriminator → 只產出那一種 → 完全失去 diversity。例如 generate faces 時所有 generated faces 看起來一樣。Wasserstein GAN(WGAN)用 earth mover's distance 替代 JS divergence,大幅改善了這個問題。
GAN Variants
| Variant | Innovation | Impact |
|---|---|---|
| DCGAN (2015) | Conv layers + architectural guidelines | Made GANs stable for images |
| WGAN (2017) | Wasserstein distance, gradient penalty | Solved mode collapse + training stability |
| Progressive GAN (2018) | Grow resolution progressively (4→8→...→1024) | First high-res face generation |
| StyleGAN (2019-2021) | Style-based generator, adaptive normalization | Photorealistic faces, controllable generation |
| Conditional GAN | G and D conditioned on class label | Generate specific classes (e.g., digit 7) |
Diffusion Models
High-Level Idea
Diffusion models learn to reverse a gradual noise-adding process:
Forward process (fixed, not learned): Gradually add Gaussian noise to data over steps until it becomes pure noise:
Reverse process (learned): A neural network learns to denoise step by step:
Training: Network learns to predict the noise that was added at each step:
Generation: Start from pure noise → iteratively denoise → get 。
Why Diffusion Models Won
| Aspect | VAE | GAN | Diffusion |
|---|---|---|---|
| Training stability | Stable | Unstable | Very stable |
| Mode coverage | Good | Mode collapse risk | Excellent |
| Sample quality | Blurry | Sharp | Sharp + diverse |
| Likelihood | Has ELBO | No likelihood | Has likelihood |
| Speed | Fast | Fast | Slow(many denoising steps) |
| Controllability | Limited | Conditional GAN | Excellent(classifier-free guidance) |
Diffusion vs GAN: 面試重點
「為什麼 diffusion models 取代了 GANs?」— 三個原因:(1)Training 更穩定(no adversarial game, just MSE loss on noise prediction),(2)No mode collapse(covers full data distribution),(3)更好的 controllability(text-to-image via classifier-free guidance)。缺點:generation 慢(需要 iterate T steps)。
Key Applications
| Application | Model | How |
|---|---|---|
| Text-to-image | Stable Diffusion, DALL-E 2/3, Midjourney | Text prompt → CLIP text encoder → conditioning for diffusion |
| Image editing | InstructPix2Pix, SDEdit | Edit existing images via text instructions |
| Video generation | Sora | Extend diffusion to temporal dimension |
| Audio | AudioLDM | Diffusion in spectrogram space |
| Protein structure | AlphaFold 3 | Diffusion for 3D structure prediction |
Model Comparison
| Model | How It Generates | Pros | Cons | Best For |
|---|---|---|---|---|
| Autoencoder | Decode from learned z | Simple, fast | Cannot generate(unstructured latent space) | Compression, anomaly detection |
| VAE | Sample z ~ N(0,I), decode | Stable training, has likelihood | Blurry outputs | Structured latent space, interpolation |
| GAN | G(noise) → fake data | Sharp outputs | Unstable training, mode collapse | Image generation (legacy) |
| Diffusion | Iterative denoising | Best quality + diversity | Slow generation | SOTA image/video/audio generation |
| Autoregressive (GPT) | Predict next token | Excellent for text | Sequential → slow | Text generation, LLMs |
Real-World Use Cases
Case 1: 信用卡詐欺 — Anomaly Detection with Autoencoder
Train autoencoder on normal transactions only → high reconstruction error = potential fraud:
# Train on normal data
autoencoder.fit(X_normal)
# At inference: compute reconstruction error
reconstructed = autoencoder(X_test)
errors = ((X_test - reconstructed) ** 2).mean(dim=1)
# High error → anomaly → potential fraud
threshold = np.percentile(errors_on_validation, 99)
is_fraud = errors > threshold
這和 PCA reconstruction error 的思路一樣 — normal data 在 low-dimensional manifold 上 → 可以被 well-reconstructed。Fraud data 偏離 manifold → 高 reconstruction error。
Case 2: Imbalanced Data — Synthetic Minority Oversampling
GAN 或 VAE 可以為 minority class 生成 synthetic samples — 比 SMOTE 更好因為 captures non-linear manifold:
# Train VAE only on fraud transactions
vae = VAE(input_dim=features.shape[1], latent_dim=16)
vae.fit(X_fraud)
# Generate synthetic fraud samples
z = torch.randn(1000, 16) # sample from latent space
synthetic_fraud = vae.decode(z)
# Augment training data with synthetic fraud → retrain classifier
Synthetic Data 的陷阱
Generated data 的 quality 和 training data 成正比。如果 minority class 只有 50 筆,VAE/GAN 學到的 distribution 不夠好 → synthetic data 可能 noisy 或 unrealistic。通常需要至少幾百到幾千筆 minority samples 才值得用 generative augmentation。
Case 3: Privacy-Preserving Synthetic Data
用 generative model 產出 synthetic dataset 取代 real data — 保護 privacy:
| Step | Action |
|---|---|
| 1 | Train VAE / GAN / Diffusion on real data |
| 2 | Generate synthetic dataset of same size |
| 3 | Verify: synthetic data 的 statistical properties 和 real data 相似 |
| 4 | Verify: synthetic data 不能 reverse-engineer individual records |
| 5 | Share synthetic data instead of real data |
醫療、金融等 privacy-sensitive 領域越來越多使用 synthetic data 做研究和 model development。
Hands-on: Generative Models in PyTorch
VAE
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=256, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(input_dim, hidden_dim), nn.ReLU())
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_var = nn.Linear(hidden_dim, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, input_dim), nn.Sigmoid(),
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_var(h)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + std * eps
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
def vae_loss(recon_x, x, mu, log_var):
recon = F.binary_cross_entropy(recon_x, x, reduction="sum")
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon + kl
Simple GAN
class Generator(nn.Module):
def __init__(self, latent_dim=100, output_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, 512), nn.ReLU(),
nn.Linear(512, output_dim), nn.Tanh(),
)
def forward(self, z):
return self.net(z)
class Discriminator(nn.Module):
def __init__(self, input_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1), nn.Sigmoid(),
)
def forward(self, x):
return self.net(x)
Interview Signals
What interviewers listen for:
- 你能區分 generative vs discriminative models
- 你知道 VAE 的 KL loss 為什麼讓 latent space 有 structure
- 你能解釋 GAN 的 min-max game 和 mode collapse 問題
- 你知道 diffusion models 為什麼取代了 GANs
- 你能說出 generative models 在 DS 中的實際應用(anomaly detection, data augmentation, synthetic data)
Practice
Flashcards
Flashcards (1/10)
Generative 和 discriminative models 的核心差異?
Discriminative 學 P(y|x)(decision boundary)→ 分類/預測。Generative 學 P(x) 或 P(x,y)(data distribution)→ 可以 generate new data。Logistic regression, SVM, neural nets = discriminative。VAE, GAN, GPT = generative。
Quiz
VAE 和 autoencoder 最核心的差異是什麼?