返回 筑基・数据元府藏真

高级查询技巧:窗口函数与递归查询

博主
大约 14 分钟

高级查询技巧:窗口函数与递归查询

问题引入:复杂的排名需求

-- 需求:查询每个用户最近3笔订单
-- 传统方式:子查询 + JOIN,性能差
SELECT o1.*
FROM orders o1
WHERE (
    SELECT COUNT(*)
    FROM orders o2
    WHERE o2.user_id = o1.user_id
    AND o2.created_at >= o1.created_at
) <= 3;
-- 执行时间:15秒,全表扫描多次

-- 窗口函数方案:简洁高效
SELECT *
FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
    FROM orders
) t
WHERE rn <= 3;
-- 执行时间:500ms,单次扫描

窗口函数让复杂的分析查询变得简单高效,是现代SQL的必备技能。

现象描述:传统方案的痛点

场景1:分组Top N问题

-- 需求:每个班级成绩前3名的学生
-- 传统方案:关联子查询,性能极差
SELECT s1.*
FROM students s1
WHERE (
    SELECT COUNT(DISTINCT score)
    FROM students s2
    WHERE s2.class_id = s1.class_id
    AND s2.score > s1.score
) < 3;
-- 执行时间:30秒(10万学生)

-- 窗口函数方案
SELECT *
FROM (
    SELECT *,
        DENSE_RANK() OVER (PARTITION BY class_id ORDER BY score DESC) as rank
    FROM students
) t
WHERE rank <= 3;
-- 执行时间:200ms

场景2:累计求和

-- 需求:计算每日累计销售额
-- 传统方案:自连接,复杂度O(n²)
SELECT 
    a.date,
    SUM(b.amount) as cumulative_amount
FROM sales a
JOIN sales b ON b.date <= a.date
GROUP BY a.date;
-- 执行时间:10秒(365天数据)

-- 窗口函数方案
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) as cumulative_amount
FROM sales;
-- 执行时间:50ms

场景3:递归查询层级数据

-- 需求:查询部门树形结构
-- 传统方案:多次查询或应用层组装
-- 需要查询多次数据库,代码复杂

-- 递归CTE方案
WITH RECURSIVE dept_tree AS (
    -- 锚点:顶级部门
    SELECT id, name, parent_id, 0 as level
    FROM departments
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- 递归:查询子部门
    SELECT d.id, d.name, d.parent_id, dt.level + 1
    FROM departments d
    JOIN dept_tree dt ON d.parent_id = dt.id
)
SELECT * FROM dept_tree;
-- 单次查询,简洁高效

原因分析:窗口函数与递归原理

1. 窗口函数原理

执行过程

窗口函数执行流程:

1. 数据分区(PARTITION BY)
   输入数据 -> [分区1] [分区2] [分区3] ...

2. 窗口排序(ORDER BY)
   每个分区内部排序

3. 窗口计算
   在当前行的窗口范围内计算
   
4. 返回结果
   每行都返回计算结果

与传统聚合函数的区别:
- 聚合函数:多行输入,单行输出
- 窗口函数:多行输入,每行都输出(保留原始行)

窗口范围

-- 窗口范围定义
ROWS BETWEEN ... AND ...
RANGE BETWEEN ... AND ...

-- 常用范围
UNBOUNDED PRECEDING  -- 分区第一行
CURRENT ROW          -- 当前行
UNBOUNDED FOLLOWING  -- 分区最后一行

-- 示例:计算移动平均
SELECT 
    date,
    amount,
    AVG(amount) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7d
FROM sales;

2. 递归CTE原理

执行过程

递归CTE执行流程:

1. 执行锚点成员(非递归部分)
   -> 产生初始结果集(种子)

2. 执行递归成员
   -> 基于上一次的结果继续查询
   -> 产生新的结果集

3. 重复步骤2
   -> 直到递归成员返回空结果
   
4. 合并所有结果
   -> 返回最终数据集

注意:必须有终止条件,否则无限循环

递归限制

-- MySQL递归限制参数
SHOW VARIABLES LIKE 'cte_max_recursion_depth';
-- 默认1000层

-- 临时增加限制
SET SESSION cte_max_recursion_depth = 10000;

-- 在CTE中设置限制
WITH RECURSIVE tree AS (
    SELECT id, parent_id, 0 as level
    FROM categories
    WHERE id = 1
    
    UNION ALL
    
    SELECT c.id, c.parent_id, t.level + 1
    FROM categories c
    JOIN tree t ON c.parent_id = t.id
    WHERE t.level < 10  -- 终止条件
)
SELECT * FROM tree;

解决方案:窗口函数实战

1. 排名函数

ROW_NUMBER()

-- 特点:连续排名,无重复
-- 用途:精确排名、分页

-- 示例:用户订单排名
SELECT 
    user_id,
    order_id,
    total_amount,
    ROW_NUMBER() OVER (
        PARTITION BY user_id 
        ORDER BY total_amount DESC
    ) as amount_rank
FROM orders;

-- 结果:
-- user_id | order_id | total_amount | amount_rank
-- 1       | 1001     | 1000.00      | 1
-- 1       | 1002     | 800.00       | 2
-- 1       | 1003     | 600.00       | 3
-- 2       | 2001     | 1200.00      | 1
-- 2       | 2002     | 900.00       | 2

RANK() 和 DENSE_RANK()

-- RANK():跳跃排名
-- 1, 2, 2, 4, 5(有并列时跳号)

-- DENSE_RANK():密集排名
-- 1, 2, 2, 3, 4(有并列不跳号)

-- 示例:学生成绩排名
SELECT 
    student_name,
    score,
    RANK() OVER (ORDER BY score DESC) as rank_jump,
    DENSE_RANK() OVER (ORDER BY score DESC) as rank_dense,
    ROW_NUMBER() OVER (ORDER BY score DESC) as rank_row
FROM scores;

-- 结果:
-- student_name | score | rank_jump | rank_dense | rank_row
-- 张三         | 100   | 1         | 1          | 1
-- 李四         | 95    | 2         | 2          | 2
-- 王五         | 95    | 2         | 2          | 3
-- 赵六         | 90    | 4         | 3          | 4

2. 聚合窗口函数

累计计算

-- 累计求和
SELECT 
    date,
    daily_amount,
    SUM(daily_amount) OVER (ORDER BY date) as cumulative_amount,
    SUM(daily_amount) OVER (
        ORDER BY date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_explicit
FROM daily_sales;

-- 年度累计(分区累计)
SELECT 
    year,
    month,
    amount,
    SUM(amount) OVER (
        PARTITION BY year 
        ORDER BY month
    ) as year_cumulative
FROM monthly_sales;

移动计算

-- 7天移动平均
SELECT 
    date,
    amount,
    AVG(amount) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as avg_7d,
    SUM(amount) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as sum_7d
FROM daily_sales;

-- 前后值比较
SELECT 
    date,
    amount,
    LAG(amount, 1) OVER (ORDER BY date) as prev_day_amount,
    LEAD(amount, 1) OVER (ORDER BY date) as next_day_amount,
    amount - LAG(amount, 1) OVER (ORDER BY date) as day_over_day_change
FROM daily_sales;

3. 取值窗口函数

FIRST_VALUE 和 LAST_VALUE

-- 分组内第一行和最后一行的值
SELECT 
    user_id,
    order_id,
    total_amount,
    FIRST_VALUE(total_amount) OVER (
        PARTITION BY user_id 
        ORDER BY created_at
    ) as first_order_amount,
    LAST_VALUE(total_amount) OVER (
        PARTITION BY user_id 
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_order_amount
FROM orders;

LAG 和 LEAD

-- 前后偏移取值
SELECT 
    date,
    amount,
    -- 前一天
    LAG(amount, 1, 0) OVER (ORDER BY date) as prev_amount,
    -- 前两天
    LAG(amount, 2) OVER (ORDER BY date) as prev_2_amount,
    -- 后一天
    LEAD(amount, 1) OVER (ORDER BY date) as next_amount,
    -- 环比增长率
    (amount - LAG(amount, 1) OVER (ORDER BY date)) / 
    LAG(amount, 1) OVER (ORDER BY date) * 100 as growth_rate
FROM daily_sales;

4. 高级窗口函数

NTILE 分桶

-- 将数据分成N个桶
SELECT 
    student_name,
    score,
    NTILE(4) OVER (ORDER BY score DESC) as quartile,
    CASE NTILE(4) OVER (ORDER BY score DESC)
        WHEN 1 THEN '优秀'
        WHEN 2 THEN '良好'
        WHEN 3 THEN '及格'
        ELSE '不及格'
    END as grade_level
FROM scores;

PERCENT_RANK 和 CUME_DIST

-- 百分比排名
SELECT 
    student_name,
    score,
    PERCENT_RANK() OVER (ORDER BY score) as percentile,
    CUME_DIST() OVER (ORDER BY score) as cumulative_distribution,
    ROUND(PERCENT_RANK() OVER (ORDER BY score) * 100, 2) as percentile_rank
FROM scores;

解决方案:递归查询实战

1. 树形结构查询

查询完整树

-- 部门树查询
WITH RECURSIVE dept_tree AS (
    -- 锚点:顶级部门
    SELECT 
        id, 
        name, 
        parent_id, 
        0 as level,
        CAST(name AS CHAR(1000)) as path
    FROM departments
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- 递归:查询子部门
    SELECT 
        d.id, 
        d.name, 
        d.parent_id, 
        dt.level + 1,
        CONCAT(dt.path, ' > ', d.name)
    FROM departments d
    JOIN dept_tree dt ON d.parent_id = dt.id
)
SELECT 
    id,
    name,
    level,
    path,
    REPEAT('  ', level) || name as tree_display
FROM dept_tree
ORDER BY path;

查询子树

-- 查询某个部门的所有下级
WITH RECURSIVE sub_dept AS (
    -- 锚点:指定部门
    SELECT id, name, parent_id, 0 as level
    FROM departments
    WHERE id = 5
    
    UNION ALL
    
    -- 递归:查询下级
    SELECT d.id, d.name, d.parent_id, sd.level + 1
    FROM departments d
    JOIN sub_dept sd ON d.parent_id = sd.id
)
SELECT * FROM sub_dept;

查询上级路径

-- 查询某个部门的所有上级
WITH RECURSIVE parent_dept AS (
    -- 锚点:指定部门
    SELECT id, name, parent_id, 0 as level
    FROM departments
    WHERE id = 10
    
    UNION ALL
    
    -- 递归:查询上级
    SELECT d.id, d.name, d.parent_id, pd.level + 1
    FROM departments d
    JOIN parent_dept pd ON d.id = pd.parent_id
)
SELECT * FROM parent_dept ORDER BY level DESC;

2. 层级数据处理

生成数字序列

-- 生成1到1000的数字序列
WITH RECURSIVE numbers AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1
    FROM numbers
    WHERE n < 1000
)
SELECT * FROM numbers;

-- 生成日期序列
WITH RECURSIVE dates AS (
    SELECT '2023-01-01' as date
    UNION ALL
    SELECT DATE_ADD(date, INTERVAL 1 DAY)
    FROM dates
    WHERE date < '2023-12-31'
)
SELECT * FROM dates;

字符串拆分

-- 将逗号分隔的字符串拆分成多行
WITH RECURSIVE split AS (
    SELECT 
        'apple,banana,orange' as str,
        SUBSTRING_INDEX('apple,banana,orange', ',', 1) as val,
        SUBSTRING('apple,banana,orange', 
            LENGTH(SUBSTRING_INDEX('apple,banana,orange', ',', 1)) + 2
        ) as remainder
    
    UNION ALL
    
    SELECT 
        remainder,
        SUBSTRING_INDEX(remainder, ',', 1),
        CASE 
            WHEN LOCATE(',', remainder) > 0 THEN
                SUBSTRING(remainder, LENGTH(SUBSTRING_INDEX(remainder, ',', 1)) + 2)
            ELSE ''
        END
    FROM split
    WHERE remainder != ''
)
SELECT val FROM split;

3. 图遍历

最短路径查询

-- 查找两个节点之间的路径(简化版)
WITH RECURSIVE paths AS (
    -- 锚点:起点
    SELECT 
        from_node,
        to_node,
        CAST(from_node AS CHAR(1000)) as path,
        1 as hops
    FROM edges
    WHERE from_node = 'A'
    
    UNION ALL
    
    -- 递归:扩展路径
    SELECT 
        p.from_node,
        e.to_node,
        CONCAT(p.path, '->', e.to_node),
        p.hops + 1
    FROM paths p
    JOIN edges e ON p.to_node = e.from_node
    WHERE p.hops < 5  -- 限制路径长度
      AND LOCATE(e.to_node, p.path) = 0  -- 避免循环
)
SELECT * FROM paths WHERE to_node = 'D' ORDER BY hops;

实施步骤:高级查询应用

步骤1:识别适用场景

窗口函数适用场景:
□ 分组Top N查询
□ 累计/移动计算
□ 行间比较(环比、同比)
□ 分组内排序和编号
□ 百分比和分位数计算

递归CTE适用场景:
□ 树形/层级结构查询
□ 图遍历
□ 序列生成
□ 字符串处理
□ 层级聚合

步骤2:编写和优化

-- 1. 先写基础查询
SELECT * FROM orders WHERE created_at > '2023-01-01';

-- 2. 添加窗口函数
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
FROM orders 
WHERE created_at > '2023-01-01';

-- 3. 添加过滤条件
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
    FROM orders 
    WHERE created_at > '2023-01-01'
) t
WHERE rn <= 3;

-- 4. 性能优化:确保分区列和排序列有索引
CREATE INDEX idx_user_created ON orders(user_id, created_at);

步骤3:性能验证

-- 查看执行计划
EXPLAIN ANALYZE
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
    FROM orders
) t
WHERE rn <= 3;

-- 关注指标:
-- - 是否使用了索引
-- - 是否需要临时表
-- - 执行时间

效果验证:优化前后对比

案例1:分组Top N

方案执行时间扫描次数复杂度
子查询方案15s多次全表扫描O(n²)
窗口函数方案500ms单次扫描O(n)
提升97%大幅减少显著优化

案例2:累计计算

方案执行时间内存使用可读性
自连接方案10s
窗口函数方案50ms
提升99.5%降低显著提升

案例3:树形查询

方案查询次数代码复杂度维护性
应用层组装多次
递归CTE方案1次
提升单次查询简化显著提升

经验总结:高级查询最佳实践

✅ 应该做的

  1. 优先使用窗口函数替代子查询
  2. 为分区列和排序列创建索引
  3. 递归查询设置终止条件
  4. 使用CTE提高可读性
  5. 注意窗口函数的执行顺序
  6. 大数据量时测试性能
  7. 理解不同排名函数的区别
  8. 合理使用窗口范围定义

❌ 不应该做的

  1. 忘记处理递归终止条件
  2. 在窗口函数后错误使用WHERE
  3. 忽视索引对窗口函数的影响
  4. 过度使用递归(深度过大)
  5. 混淆ROW_NUMBER和RANK
  6. 忘记窗口函数后的过滤需要子查询
  7. 在递归中忘记去重或循环检测
  8. 忽视窗口函数的内存开销

深度案例分析:用户行为分析

项目背景

某电商平台需要分析用户购买行为:

  • 用户首次购买时间
  • 用户购买频次变化
  • 用户流失预警

解决方案

-- 1. 计算用户首次购买和最近一次购买
WITH user_purchase_stats AS (
    SELECT 
        user_id,
        MIN(created_at) as first_purchase,
        MAX(created_at) as last_purchase,
        COUNT(*) as total_orders,
        SUM(total_amount) as total_amount
    FROM orders
    GROUP BY user_id
),

-- 2. 计算用户购买频次趋势
user_frequency AS (
    SELECT 
        user_id,
        DATE_FORMAT(created_at, '%Y-%m') as month,
        COUNT(*) as monthly_orders,
        LAG(COUNT(*)) OVER (
            PARTITION BY user_id 
            ORDER BY DATE_FORMAT(created_at, '%Y-%m')
        ) as prev_month_orders
    FROM orders
    GROUP BY user_id, DATE_FORMAT(created_at, '%Y-%m')
),

-- 3. 识别流失用户(3个月未购买)
churned_users AS (
    SELECT 
        user_id,
        last_purchase,
        DATEDIFF(NOW(), last_purchase) as days_since_last,
        CASE 
            WHEN DATEDIFF(NOW(), last_purchase) > 90 THEN 'Churned'
            WHEN DATEDIFF(NOW(), last_purchase) > 60 THEN 'At Risk'
            ELSE 'Active'
        END as user_status
    FROM user_purchase_stats
)

-- 最终查询
SELECT 
    c.user_id,
    c.user_status,
    c.days_since_last,
    s.total_orders,
    s.total_amount,
    s.first_purchase
FROM churned_users c
JOIN user_purchase_stats s ON c.user_id = s.user_id
ORDER BY c.days_since_last DESC;

优化效果

指标优化前优化后提升
查询时间30s2s93% ↓
代码行数200行50行75% ↓
可读性显著提升
维护成本显著降低

窗口函数与递归决策树

需求分析
    ↓
需要分组计算?
    ↓ 是
使用窗口函数
    ↓
需要排名?
    ↓ 是
使用ROW_NUMBER/RANK/DENSE_RANK
    ↓
需要累计计算?
    ↓ 是
使用SUM/AVG OVER
    ↓
需要层级查询?
    ↓ 是
使用递归CTE
    ↓
需要树遍历?
    ↓ 是
使用RECURSIVE WITH
    ↓
完成

常见误区与避坑指南

❌ 误区1:窗口函数后直接使用WHERE

问题

-- 错误:窗口函数后直接过滤
SELECT 
    user_id,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) as rn
FROM orders
WHERE rn <= 3;  -- 错误!rn还未生成

正确做法

-- 正确:使用子查询或CTE
SELECT *
FROM (
    SELECT 
        user_id,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) as rn
    FROM orders
) t
WHERE rn <= 3;

❌ 误区2:递归没有终止条件

问题:递归无限循环直到耗尽资源。

正确做法

-- 始终设置终止条件
WITH RECURSIVE tree AS (
    SELECT id, parent_id, 0 as level
    FROM categories
    WHERE id = 1
    
    UNION ALL
    
    SELECT c.id, c.parent_id, t.level + 1
    FROM categories c
    JOIN tree t ON c.parent_id = t.id
    WHERE t.level < 10  -- 终止条件
)
SELECT * FROM tree;

❌ 误区3:混淆排名函数

问题

  • ROW_NUMBER():即使值相同也有不同排名
  • RANK():值相同排名相同,但会跳号
  • DENSE_RANK():值相同排名相同,不跳号

正确做法:根据业务需求选择合适的排名函数。

高级查询检查清单

窗口函数检查:
□ 分区列是否正确?
□ 排序列是否正确?
□ 窗口范围是否合适?
□ 是否需要在子查询中过滤?
□ 相关列是否有索引?

递归CTE检查:
□ 锚点成员是否正确?
□ 递归成员是否正确?
□ 是否有终止条件?
□ 是否会无限循环?
□ 递归深度是否合理?

附录

A. 窗口函数速查表

函数用途示例
ROW_NUMBER()连续排名ROW_NUMBER() OVER (ORDER BY score DESC)
RANK()跳跃排名RANK() OVER (ORDER BY score DESC)
DENSE_RANK()密集排名DENSE_RANK() OVER (ORDER BY score DESC)
LAG()前N行取值LAG(amount, 1) OVER (ORDER BY date)
LEAD()后N行取值LEAD(amount, 1) OVER (ORDER BY date)
FIRST_VALUE()第一行值FIRST_VALUE(amount) OVER (ORDER BY date)
LAST_VALUE()最后一行值LAST_VALUE(amount) OVER (ORDER BY date)
SUM() OVER累计求和SUM(amount) OVER (ORDER BY date)
AVG() OVER移动平均AVG(amount) OVER (ROWS 6 PRECEDING)
NTILE()分桶NTILE(4) OVER (ORDER BY score)

B. 递归CTE模板

-- 树形查询模板
WITH RECURSIVE tree AS (
    -- 锚点
    SELECT id, parent_id, 0 as level, name as path
    FROM table
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- 递归
    SELECT t.id, t.parent_id, tr.level + 1, 
           CONCAT(tr.path, ' > ', t.name)
    FROM table t
    JOIN tree tr ON t.parent_id = tr.id
    WHERE tr.level < 10
)
SELECT * FROM tree;

-- 序列生成模板
WITH RECURSIVE series AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1 FROM series WHERE n < 1000
)
SELECT * FROM series;

读者练习

  1. 思考题:分析你项目中的复杂查询,找出可以使用窗口函数优化的场景。

  2. 实践题:使用窗口函数实现以下查询:

-- 查询每个用户连续登录的最大天数
-- 表结构:user_log(user_id, login_date)
  1. 挑战题:使用递归CTE查询组织架构中每个员工的完整汇报路径。

系列上一篇查询优化器揭秘:统计信息与执行计划

系列下一篇事务隔离级别与并发问题深度剖析

知识点测试

读完文章了?来测试一下你对知识点的掌握程度吧!

评论区

使用 GitHub 账号登录后即可发表评论,支持 Markdown 格式。

如果评论系统无法加载,请确保:

  • 您的网络可以访问 GitHub
  • giscus GitHub App 已安装到仓库
  • 仓库已启用 Discussions 功能