SQL Interview Questions

Interview Context

SQL 題和 metric 定義題通常一起出現。面試官想知道你是不是能從原始事件表整理出可信結論的人。關鍵不只是語法正確,而是你能不能定義清楚「我在算什麼、分母是誰、粒度是什麼」。

What You Should Understand

  • 熟悉 joins、window functions、aggregation grain 與 deduping 的差異
  • 能清楚定義 DAU、retention、conversion、funnel conversion 的計算口徑
  • 知道 event-level、user-level、session-level 指標混用時會造成什麼偏差
  • 遇到數字怪異時,會先檢查資料新鮮度、漏資料、重複資料與分母定義

JOIN Types

The Four Basic Joins

-- INNER JOIN: only matching rows from both tables
SELECT u.user_id, o.order_id
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id;

-- LEFT JOIN: all rows from left table, matching from right (NULL if no match)
SELECT u.user_id, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;

-- RIGHT JOIN: all rows from right table, matching from left
-- (rarely used — just swap table order and use LEFT JOIN)

-- FULL OUTER JOIN: all rows from both tables
SELECT u.user_id, o.order_id
FROM users u
FULL OUTER JOIN orders o ON u.user_id = o.user_id;

Join Comparison

Join TypeLeft TableRight TableUse Case
INNEROnly matchingOnly matching只要有 match 的 rows
LEFTAll rowsMatching (NULL if none)保留所有 left table 的 rows(最常用)
RIGHTMatching (NULL if none)All rows很少用 — 換 table 順序用 LEFT
FULL OUTERAll rowsAll rows找兩邊都有或只有一邊有的 rows
CROSSAll rows × All rowsAll rows產生 cartesian product(很少用,除非刻意要)

Join Pitfalls: Row Multiplication

Many-to-many joins 是面試中最常見的 SQL bug:

-- DANGEROUS: 1 user × 3 orders × 2 logins = 6 rows per user
SELECT u.user_id, o.order_id, l.login_time
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN logins l ON u.user_id = l.user_id;

Solution: Aggregate before joining:

-- CORRECT: aggregate first, then join → 1 row per user
SELECT u.user_id, o.order_count, l.login_count
FROM users u
LEFT JOIN (
    SELECT user_id, COUNT(*) AS order_count
    FROM orders GROUP BY user_id
) o ON u.user_id = o.user_id
LEFT JOIN (
    SELECT user_id, COUNT(*) AS login_count
    FROM logins GROUP BY user_id
) l ON u.user_id = l.user_id;

面試高頻陷阱:JOIN 膨脹

面試時寫了多個 JOIN 卻沒考慮 grain → 面試官一定追問「這個 JOIN 會不會膨脹?」。永遠先想清楚每個 table 的 grain,JOIN 後的 grain 又是什麼。

Window Functions

Window functions compute values across a set of rows without collapsing them(unlike GROUP BY)。

Syntax

function_name() OVER (
    [PARTITION BY column1, column2, ...]
    [ORDER BY column3, column4, ...]
    [ROWS/RANGE BETWEEN ... AND ...]
)

ROW_NUMBER, RANK, DENSE_RANK

-- Get each user's most recent order
SELECT * FROM (
    SELECT user_id, order_id, order_date,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) AS rn
    FROM orders
) ranked WHERE rn = 1;
FunctionTiesGapsScores: 100, 90, 90, 80
ROW_NUMBERArbitraryNo1, 2, 3, 4
RANKSame rankYes1, 2, 2, 4
DENSE_RANKSame rankNo1, 2, 2, 3

面試中 80% 的 window function 題只需 ROW_NUMBER + PARTITION BY。

Running Totals and Moving Averages

-- Cumulative revenue
SELECT order_date, daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue
FROM daily_sales;

-- 7-day moving average
SELECT order_date, daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_sales;

LAG and LEAD

-- Day-over-day change
SELECT order_date, daily_revenue,
    LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS prev_day,
    daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS dod_change
FROM daily_sales;

-- LEAD: next row's value
-- LAG(x, 2): 2 rows back
-- LAG(x, 1, 0): default value 0 if no previous row

NTILE and Percentiles

-- Divide users into 4 quartiles by spending
SELECT user_id, total_spend,
    NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM user_stats;

-- Percentile rank
SELECT user_id, total_spend,
    PERCENT_RANK() OVER (ORDER BY total_spend) AS pct_rank
FROM user_stats;

Window Function 面試要點

典型場景:取每個 user 的最新一筆(ROW_NUMBER)、每個 category 的 top N(ROW_NUMBER + PARTITION)、去除重複但保留最新(ROW_NUMBER)、day-over-day growth(LAG)、cumulative sum(SUM OVER)、moving average(AVG OVER ROWS BETWEEN)。

Common Table Expressions (CTEs)

CTEs make complex queries readable by breaking them into logical steps:

WITH daily_metrics AS (
    SELECT DATE(event_time) AS event_date,
        COUNT(DISTINCT user_id) AS dau
    FROM events WHERE event_type = 'page_view'
    GROUP BY DATE(event_time)
),
weekly_avg AS (
    SELECT event_date, dau,
        AVG(dau) OVER (ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7d_avg
    FROM daily_metrics
)
SELECT * FROM weekly_avg ORDER BY event_date;

CTE vs Subquery

AspectCTESubquery
Readability更易讀(step-by-step)Nested → 難讀
Reuse可以被多次引用每次 reference 要 re-write
Debug可以逐步執行每個 CTE難 debug
Performance某些 DB 會 materialize通常 inline optimize
Rule of thumb> 2 levels nesting → 用 CTESimple one-off → subquery OK

面試中推薦用 CTE — 讓面試官更容易理解你的思路。

Grain Awareness

What Is Grain?

The grain of a table or query = "what does one row represent?" Getting this wrong 是大多數 SQL bugs 的根本原因。

TableGrainOne Row =
usersUserOne unique user
ordersOrderOne order (user can have many)
order_itemsOrder × ItemOne item within one order
eventsEventOne event (many per user per day)
daily_user_statsUser × DayOne user on one day

Grain Mismatch Examples

-- WRONG: counting events, calling it "users"
SELECT COUNT(*) AS "user_count"  -- Actually event count!
FROM events WHERE event_type = 'purchase';

-- CORRECT: count distinct users
SELECT COUNT(DISTINCT user_id) AS user_count
FROM events WHERE event_type = 'purchase';
-- WRONG: revenue inflated by JOIN fanout
SELECT SUM(o.amount) AS total_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Order with 3 items → amount counted 3 times!

-- CORRECT: aggregate at right grain
SELECT SUM(amount) AS total_revenue FROM orders;

面試黃金法則

每寫一個 query,先問自己:「結果的一行代表什麼?」如果答不出來,就是 grain 沒想清楚。

Metric Definition Patterns

DAU (Daily Active Users)

SELECT DATE(event_time) AS dt, COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type IN ('page_view', 'click', 'purchase')
GROUP BY DATE(event_time);

Key decision: What counts as "active"? Page view? Any event? Purchase only? 定義必須和 business meaning align。

Retention

Day-N retention: 在 day 0 signup 的 users 中,有多少比例在 day N 回來?

WITH signups AS (
    SELECT user_id, DATE(created_at) AS signup_date FROM users
),
day7_active AS (
    SELECT DISTINCT user_id, DATE(event_time) AS active_date FROM events
)
SELECT
    s.signup_date,
    COUNT(DISTINCT s.user_id) AS cohort_size,
    COUNT(DISTINCT d7.user_id) AS retained_day7,
    ROUND(COUNT(DISTINCT d7.user_id) * 1.0 / COUNT(DISTINCT s.user_id), 4) AS day7_retention
FROM signups s
LEFT JOIN day7_active d7
    ON s.user_id = d7.user_id
    AND d7.active_date = s.signup_date + INTERVAL '7 days'
GROUP BY s.signup_date
ORDER BY s.signup_date;

Retention 的常見變體

Day-N(Classic):第 N 天有回來就算。Bounded:第 N 到 N+K 天內有回來(更穩定)。Unbounded:第 N 天之後任何時間回來。面試時先確認用哪種定義。

Conversion Funnel

WITH funnel AS (
    SELECT user_id,
        MAX(CASE WHEN event_type = 'page_view' THEN 1 ELSE 0 END) AS viewed,
        MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS added,
        MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchased
    FROM events WHERE DATE(event_time) = '2024-01-15'
    GROUP BY user_id
)
SELECT
    SUM(viewed) AS viewed, SUM(added) AS added_to_cart, SUM(purchased) AS purchased,
    ROUND(SUM(added) * 1.0 / NULLIF(SUM(viewed), 0), 4) AS view_to_cart,
    ROUND(SUM(purchased) * 1.0 / NULLIF(SUM(added), 0), 4) AS cart_to_purchase
FROM funnel;

NULLIF(x, 0) prevents division by zero — 面試中必備習慣。

Year-over-Year Growth

WITH monthly_revenue AS (
    SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS revenue
    FROM orders GROUP BY DATE_TRUNC('month', order_date)
)
SELECT curr.month, curr.revenue,
    prev.revenue AS prev_year,
    ROUND((curr.revenue - prev.revenue) * 100.0 / NULLIF(prev.revenue, 0), 2) AS yoy_pct
FROM monthly_revenue curr
LEFT JOIN monthly_revenue prev ON curr.month = prev.month + INTERVAL '1 year'
ORDER BY curr.month;

Deduplication Patterns

-- ROW_NUMBER: keep most recent record per user
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
    FROM user_profiles
) deduped WHERE rn = 1;

-- PostgreSQL shorthand
SELECT DISTINCT ON (user_id) * FROM user_profiles ORDER BY user_id, updated_at DESC;

Advanced Patterns

Self Join: Finding Consecutive Events

-- Users who purchased within 24 hours of signing up
SELECT DISTINCT s.user_id
FROM events s
JOIN events p ON s.user_id = p.user_id
    AND p.event_type = 'purchase'
    AND p.event_time BETWEEN s.event_time AND s.event_time + INTERVAL '24 hours'
WHERE s.event_type = 'signup';

CASE WHEN for Pivoting

-- Revenue by category, pivoted into columns
SELECT DATE(order_date) AS dt,
    SUM(CASE WHEN category = 'electronics' THEN amount ELSE 0 END) AS electronics,
    SUM(CASE WHEN category = 'clothing' THEN amount ELSE 0 END) AS clothing,
    SUM(CASE WHEN category = 'food' THEN amount ELSE 0 END) AS food
FROM orders GROUP BY DATE(order_date);

EXISTS vs IN

-- EXISTS (often faster for large subqueries — stops at first match)
SELECT u.user_id, u.name FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id);

-- IN (equivalent but can be slower)
SELECT user_id, name FROM users
WHERE user_id IN (SELECT DISTINCT user_id FROM orders);

Gaps and Islands

-- Find consecutive login streaks per user
WITH daily_logins AS (
    SELECT DISTINCT user_id, DATE(login_time) AS login_date FROM logins
),
streaks AS (
    SELECT user_id, login_date,
        login_date - INTERVAL '1 day' * ROW_NUMBER() OVER (
            PARTITION BY user_id ORDER BY login_date
        ) AS streak_group
    FROM daily_logins
)
SELECT user_id, MIN(login_date) AS streak_start,
    MAX(login_date) AS streak_end,
    COUNT(*) AS streak_length
FROM streaks
GROUP BY user_id, streak_group
HAVING COUNT(*) >= 3  -- streaks of 3+ days
ORDER BY streak_length DESC;

Sessionization

-- Define sessions: gap > 30 min = new session
WITH events_with_gap AS (
    SELECT *, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
    FROM events
),
session_starts AS (
    SELECT *,
        CASE WHEN prev_time IS NULL
             OR event_time - prev_time > INTERVAL '30 minutes'
             THEN 1 ELSE 0 END AS is_new_session
    FROM events_with_gap
),
sessions AS (
    SELECT *, SUM(is_new_session) OVER (
        PARTITION BY user_id ORDER BY event_time
    ) AS session_id
    FROM session_starts
)
SELECT user_id, session_id,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    COUNT(*) AS events_in_session
FROM sessions
GROUP BY user_id, session_id;

Performance in Interviews

面試通常不深考 optimization,但你應該知道:(1)避免 SELECT *,(2)WHERE 條件用 indexed columns,(3)JOIN 前先 filter 減少資料量,(4)EXISTS 替代大 IN subquery,(5)避免 function on indexed column(WHERE DATE(ts) = ... → full scan)。主動提到這些會加分。

Debugging Weird Numbers

When SQL output looks wrong, check in this order:

StepCheckCommon Issue
1GrainJOIN 造成 fanout? 一行代表什麼?
2Duplicates需要 DISTINCT? 有重複事件?
3NULLsNULL 被 JOIN 或 WHERE 默默丟掉?
4Date boundariesFilter 是 inclusive or exclusive? Timezone?
5Denominator"Total users" 是正確的 population?
-- Quick sanity check template
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT user_id) AS unique_users,
    MIN(event_time) AS earliest,
    MAX(event_time) AS latest,
    COUNT(*) - COUNT(user_id) AS null_user_count
FROM events
WHERE DATE(event_time) = '2024-01-15';

Real-World Use Cases

Case 1: 信用卡詐欺偵測 — Feature Engineering in SQL

-- Create fraud detection features from transaction history
WITH user_stats AS (
    SELECT user_id,
        COUNT(*) AS txn_count_30d,
        AVG(amount) AS avg_amount_30d,
        MAX(amount) AS max_amount_30d,
        COUNT(DISTINCT merchant_id) AS unique_merchants_30d,
        SUM(CASE WHEN is_international = 1 THEN 1 ELSE 0 END) AS intl_txn_count
    FROM transactions
    WHERE txn_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT t.*, u.avg_amount_30d,
    t.amount / NULLIF(u.avg_amount_30d, 0) AS amount_ratio,  -- unusually high?
    u.txn_count_30d, u.intl_txn_count
FROM transactions t
LEFT JOIN user_stats u ON t.user_id = u.user_id
WHERE t.txn_date = CURRENT_DATE;

amount_ratio = 這筆交易金額 / 使用者近 30 天平均金額。Ratio 很高 → 可能異常 → fraud signal。

Case 2: 推薦系統 — Collaborative Filtering Data

-- Item co-purchase matrix for item-based collaborative filtering
WITH user_purchases AS (
    SELECT DISTINCT user_id, product_id
    FROM orders WHERE order_date >= '2024-01-01'
)
SELECT a.product_id AS item_a, b.product_id AS item_b,
    COUNT(DISTINCT a.user_id) AS co_purchase_count
FROM user_purchases a
JOIN user_purchases b ON a.user_id = b.user_id AND a.product_id < b.product_id
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.user_id) >= 5  -- minimum support
ORDER BY co_purchase_count DESC;

Case 3: 客戶分群 — RFM Analysis

-- RFM (Recency, Frequency, Monetary) segmentation
WITH rfm AS (
    SELECT user_id,
        CURRENT_DATE - MAX(order_date) AS recency_days,
        COUNT(DISTINCT order_id) AS frequency,
        SUM(amount) AS monetary
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '1 year'
    GROUP BY user_id
),
rfm_scored AS (
    SELECT *,
        NTILE(4) OVER (ORDER BY recency_days ASC) AS r_score,   -- low recency = recent = good
        NTILE(4) OVER (ORDER BY frequency DESC) AS f_score,
        NTILE(4) OVER (ORDER BY monetary DESC) AS m_score
    FROM rfm
)
SELECT *,
    CASE
        WHEN r_score >= 3 AND f_score >= 3 THEN 'VIP'
        WHEN r_score >= 3 AND f_score <= 2 THEN 'New/Promising'
        WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk'
        ELSE 'Dormant'
    END AS segment
FROM rfm_scored;

Interview Signals

What interviewers listen for:

  • 你有 grain awareness,知道每個表和每個 metric 的粒度
  • 你會先定義 business logic,再寫 SQL
  • 你能解釋為什麼同一個 conversion rate 可能在不同 cohort 下意義不同
  • 遇到數字不對時,你有系統性的 debug 流程
  • 你會主動用 NULLIF 避免除零、用 DISTINCT 避免重複計數

Practice

Flashcards

Flashcards (1/10)

為什麼 metric 計算前要先確認 grain?

User-level 問題用 event-level 直接算 → 重複計數或分母失真。一個 user 有 5 events → COUNT(*) = 5 不是 1。JOIN 也可能把 grain 從 1-to-1 變成 1-to-many → 膨脹。永遠先問「一行代表什麼?」

Click card to flip

Quiz

Question 1/10

每位使用者最近一次購買日期,最直接的 SQL 方法?

Mark as Complete

3/5 — Okay