Generative Models

Interview Context

Generative models 在面試中通常考概念層級 — 你不需要推導 ELBO 的每一步，但要能解釋 VAE / GAN / Diffusion 的核心思想、它們各自的優缺點、以及在 data science 中的實際應用（data augmentation, anomaly detection, synthetic data）。

What You Should Understand

能區分 generative 和 discriminative models 的根本差異
理解 autoencoder → VAE 的演進和 VAE 的 latent space 性質
知道 GAN 的 min-max game、training challenges（mode collapse, instability）
了解 diffusion models 的高層概念（forward/reverse process）
能說出 generative models 在 DS 中的實際應用

Generative vs Discriminative

Aspect	Discriminative	Generative
Learns	$P(y \mid x)$ （decision boundary）	$P(x)$ or $P(x, y)$ （data distribution）
Goal	Classify / predict	Generate new data similar to training
Examples	Logistic regression, SVM, neural nets	VAE, GAN, diffusion, GPT
Data efficiency	Usually needs less data	Usually needs more data
Can generate?	No	Yes — sample from learned distribution

為什麼 DS 需要了解 Generative Models？

不是每個 DS 都需要 train GANs。但 generative models 在以下場景 directly useful：（1）Data augmentation for imbalanced data，（2）Anomaly detection（learn normal → detect abnormal），（3）Synthetic data generation for privacy，（4）Understanding LLMs（GPT is a generative model）。

Autoencoders

Architecture

An autoencoder learns a compressed representation of data by training to reconstruct its input:

\text{Input } \mathbf{x} \xrightarrow{\text{Encoder } f} \text{Latent code } \mathbf{z} \xrightarrow{\text{Decoder } g} \text{Reconstruction } \hat{\mathbf{x}}

\min_{\theta} \|\mathbf{x} - g(f(\mathbf{x}))\|^2

Bottleneck: $\text{dim}(\mathbf{z}) \ll \text{dim}(\mathbf{x})$ forces the network to learn a compressed representation — only the most important information survives。

Relation to PCA

Linear autoencoder with MSE loss = PCA. Non-linear autoencoder（with ReLU）learns a non-linear manifold — more powerful but harder to interpret.

Use Cases

Application	How
Dimensionality reduction	Use encoder output $\mathbf{z}$ as features
Anomaly detection	Train on normal data → high reconstruction error = anomaly
Denoising	Train on noisy input → reconstruct clean output
Pre-training	Use encoder as feature extractor for downstream tasks

Autoencoder Limitations

Autoencoder 學的 latent space 是 unstructured — 不同 regions 之間沒有 smooth interpolation，不能 sample meaningful new data。這就是 VAE 要解決的問題。

import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

Variational Autoencoder (VAE)

The Key Idea

VAE adds probabilistic structure to the latent space — encoder outputs a distribution (mean + variance), not a single point:

\text{Encoder: } q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)

\text{Decoder: } p_\theta(\mathbf{x} \mid \mathbf{z})

Training 不只 minimize reconstruction error，還要讓 latent distribution 接近 a prior $p(\mathbf{z}) = \mathcal{N}(0, I)$ 。

The Loss Function (ELBO)

\mathcal{L} = \underbrace{-E_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]}_{\text{Reconstruction loss}} + \underbrace{D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{KL divergence}}

Term	What It Does
Reconstruction loss	確保 decoder 能從 $\mathbf{z}$ 重建 $\mathbf{x}$ （和 autoencoder 一樣）
KL divergence	迫使 encoder 的 latent distribution 接近 standard normal → smooth, continuous latent space

KL divergence 是 VAE 的 magic — 它讓 latent space 有 structure，使得 sampling 和 interpolation 成為可能。

Reparameterization Trick

Problem: Sampling from $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ 是 stochastic operation → cannot backpropagate through it。

Solution: Reparameterize as deterministic function + external noise:

\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)

現在 $\mathbf{z}$ 是 $\boldsymbol{\mu}$ 和 $\boldsymbol{\sigma}$ 的 differentiable function → can backpropagate normally。Randomness 被 externalize 到 $\boldsymbol{\epsilon}$ 。

面試 Key Insight

「為什麼 VAE 能 generate 但 autoencoder 不能？」— Autoencoder 的 latent space 沒有 structure → random point in latent space 不一定 decode 成 meaningful output。VAE 的 KL loss 強制 latent space = smooth Gaussian → 任何從 N(0,I) sample 的點都能 decode 成 reasonable output。

VAE vs Autoencoder

Aspect	Autoencoder	VAE
Latent space	Unstructured points	Structured Gaussian distribution
Can generate?	No（random z → garbage）	Yes（sample z ~ N(0,I) → meaningful output）
Loss	Reconstruction only	Reconstruction + KL divergence
Interpolation	May produce artifacts	Smooth interpolation in latent space
Encoder output	Single vector z	Mean μ + variance σ²

class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_var = nn.Linear(256, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)  # μ and log(σ²)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps  # z = μ + σ * ε

    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

Generative Adversarial Networks (GANs)

The Min-Max Game

Two networks compete against each other:

Generator $G$ : Takes random noise $\mathbf{z} \sim p(\mathbf{z})$ → produces fake data $G(\mathbf{z})$
Discriminator $D$ : Distinguishes real data from fake — outputs probability of being real

\min_G \max_D \; E_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + E_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]

Player	Goal	Training Signal
Generator	Fool the discriminator（produce realistic fakes）	Maximize $D(G(z))$ （discriminator thinks fake is real）
Discriminator	Correctly identify real vs fake	Maximize accuracy on both real and fake

直覺：Generator = 偽鈔製造者，Discriminator = 警察。兩者互相提升 — 最終 Generator 產出的 fake 和 real 無法區分。

Training Dynamics

# GAN training loop (simplified)
for batch in dataloader:
    # 1. Train Discriminator
    real_labels = torch.ones(batch_size)
    fake_labels = torch.zeros(batch_size)

    z = torch.randn(batch_size, latent_dim)
    fake = generator(z).detach()  # don't update G

    d_loss = bce_loss(discriminator(real_data), real_labels) + \
             bce_loss(discriminator(fake), fake_labels)
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    z = torch.randn(batch_size, latent_dim)
    fake = generator(z)
    g_loss = bce_loss(discriminator(fake), real_labels)  # want D to think fake is real
    g_loss.backward()
    g_optimizer.step()

GAN Training Challenges

Challenge	Description	Solutions
Mode collapse	Generator 只產出少數幾種 output（ignores diversity）	Mini-batch discrimination, unrolled GAN, Wasserstein loss
Training instability	G and D 的 training 不平衡 → oscillation, divergence	Spectral normalization, progressive growing, careful lr tuning
Vanishing gradient for G	如果 D 太強 → $\log(1 - D(G(z))) \approx 0$ → G 的 gradient 消失	Use $-\log D(G(z))$ instead（non-saturating loss）
No convergence guarantee	Min-max optimization 不像 minimization 有 convergence theory	實務中靠 tricks + careful monitoring
Evaluation	沒有 single metric（no likelihood like VAE）	FID score, IS score

Mode Collapse

GAN 最 infamous 的問題：Generator 發現產出某一種 output 就能 fool discriminator → 只產出那一種 → 完全失去 diversity。例如 generate faces 時所有 generated faces 看起來一樣。Wasserstein GAN（WGAN）用 earth mover's distance 替代 JS divergence，大幅改善了這個問題。

GAN Variants

Variant	Innovation	Impact
DCGAN (2015)	Conv layers + architectural guidelines	Made GANs stable for images
WGAN (2017)	Wasserstein distance, gradient penalty	Solved mode collapse + training stability
Progressive GAN (2018)	Grow resolution progressively (4→8→...→1024)	First high-res face generation
StyleGAN (2019-2021)	Style-based generator, adaptive normalization	Photorealistic faces, controllable generation
Conditional GAN	G and D conditioned on class label	Generate specific classes (e.g., digit 7)

Diffusion Models

High-Level Idea

Diffusion models learn to reverse a gradual noise-adding process:

Forward process (fixed, not learned): Gradually add Gaussian noise to data over $T$ steps until it becomes pure noise:

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})

Reverse process (learned): A neural network learns to denoise step by step:

p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})

Training: Network learns to predict the noise $\boldsymbol{\epsilon}$ that was added at each step:

\mathcal{L} = E_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2]

Generation: Start from pure noise $\mathbf{x}_T \sim \mathcal{N}(0, I)$ → iteratively denoise → get $\mathbf{x}_0$ 。

Why Diffusion Models Won

Aspect	VAE	GAN	Diffusion
Training stability	Stable	Unstable	Very stable
Mode coverage	Good	Mode collapse risk	Excellent
Sample quality	Blurry	Sharp	Sharp + diverse
Likelihood	Has ELBO	No likelihood	Has likelihood
Speed	Fast	Fast	Slow（many denoising steps）
Controllability	Limited	Conditional GAN	Excellent（classifier-free guidance）

Diffusion vs GAN: 面試重點

「為什麼 diffusion models 取代了 GANs？」— 三個原因：（1）Training 更穩定（no adversarial game, just MSE loss on noise prediction），（2）No mode collapse（covers full data distribution），（3）更好的 controllability（text-to-image via classifier-free guidance）。缺點：generation 慢（需要 iterate T steps）。

Key Applications

Application	Model	How
Text-to-image	Stable Diffusion, DALL-E 2/3, Midjourney	Text prompt → CLIP text encoder → conditioning for diffusion
Image editing	InstructPix2Pix, SDEdit	Edit existing images via text instructions
Video generation	Sora	Extend diffusion to temporal dimension
Audio	AudioLDM	Diffusion in spectrogram space
Protein structure	AlphaFold 3	Diffusion for 3D structure prediction

Model Comparison

Model	How It Generates	Pros	Cons	Best For
Autoencoder	Decode from learned z	Simple, fast	Cannot generate（unstructured latent space）	Compression, anomaly detection
VAE	Sample z ~ N(0,I), decode	Stable training, has likelihood	Blurry outputs	Structured latent space, interpolation
GAN	G(noise) → fake data	Sharp outputs	Unstable training, mode collapse	Image generation (legacy)
Diffusion	Iterative denoising	Best quality + diversity	Slow generation	SOTA image/video/audio generation
Autoregressive (GPT)	Predict next token	Excellent for text	Sequential → slow	Text generation, LLMs

Real-World Use Cases

Case 1: 信用卡詐欺 — Anomaly Detection with Autoencoder

Train autoencoder on normal transactions only → high reconstruction error = potential fraud:

# Train on normal data
autoencoder.fit(X_normal)

# At inference: compute reconstruction error
reconstructed = autoencoder(X_test)
errors = ((X_test - reconstructed) ** 2).mean(dim=1)

# High error → anomaly → potential fraud
threshold = np.percentile(errors_on_validation, 99)
is_fraud = errors > threshold

這和 PCA reconstruction error 的思路一樣 — normal data 在 low-dimensional manifold 上 → 可以被 well-reconstructed。Fraud data 偏離 manifold → 高 reconstruction error。

Case 2: Imbalanced Data — Synthetic Minority Oversampling

GAN 或 VAE 可以為 minority class 生成 synthetic samples — 比 SMOTE 更好因為 captures non-linear manifold：

# Train VAE only on fraud transactions
vae = VAE(input_dim=features.shape[1], latent_dim=16)
vae.fit(X_fraud)

# Generate synthetic fraud samples
z = torch.randn(1000, 16)  # sample from latent space
synthetic_fraud = vae.decode(z)
# Augment training data with synthetic fraud → retrain classifier

Synthetic Data 的陷阱

Generated data 的 quality 和 training data 成正比。如果 minority class 只有 50 筆，VAE/GAN 學到的 distribution 不夠好 → synthetic data 可能 noisy 或 unrealistic。通常需要至少幾百到幾千筆 minority samples 才值得用 generative augmentation。

Case 3: Privacy-Preserving Synthetic Data

用 generative model 產出 synthetic dataset 取代 real data — 保護 privacy：

Step	Action
1	Train VAE / GAN / Diffusion on real data
2	Generate synthetic dataset of same size
3	Verify: synthetic data 的 statistical properties 和 real data 相似
4	Verify: synthetic data 不能 reverse-engineer individual records
5	Share synthetic data instead of real data

醫療、金融等 privacy-sensitive 領域越來越多使用 synthetic data 做研究和 model development。

Hands-on: Generative Models in PyTorch

VAE

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, hidden_dim), nn.ReLU())
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps

    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

def vae_loss(recon_x, x, mu, log_var):
    recon = F.binary_cross_entropy(recon_x, x, reduction="sum")
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + kl

Simple GAN

class Generator(nn.Module):
    def __init__(self, latent_dim=100, output_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, output_dim), nn.Tanh(),
        )

    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, input_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid(),
        )

    def forward(self, x):
        return self.net(x)

Interview Signals

What interviewers listen for:

你能區分 generative vs discriminative models
你知道 VAE 的 KL loss 為什麼讓 latent space 有 structure
你能解釋 GAN 的 min-max game 和 mode collapse 問題
你知道 diffusion models 為什麼取代了 GANs
你能說出 generative models 在 DS 中的實際應用（anomaly detection, data augmentation, synthetic data）

Practice

Flashcards

Flashcards (1/10)

Generative 和 discriminative models 的核心差異？

Discriminative 學 P(y|x)（decision boundary）→ 分類/預測。Generative 學 P(x) 或 P(x,y)（data distribution）→ 可以 generate new data。Logistic regression, SVM, neural nets = discriminative。VAE, GAN, GPT = generative。

Click card to flip

Quiz

Question 1/10

VAE 和 autoencoder 最核心的差異是什麼？

Mark as Complete

How confident are you with this topic?

3/5 — Okay