SQL题:未完成率较高的50%用户近三个月答卷情况
这是一道牛客网上SQL进阶图库中的一道困难题目,个人花了近两个小时才通过所有用例。之所以想记录下来是因为这道题算是一个很考验基本功的题目,也不乏一些SQL中的技巧。下面我们逐步分析:
描述
现有用户信息表user_info(uid用户ID,nick_name昵称, achievement成就值, level等级, job职业方向, register_time注册时间):
id | uid | nick_name | achievement | level | job | register_time |
---|---|---|---|---|---|---|
1 | 1001 | 牛客1号 | 3200 | 7 | 算法 | 2020-01-01 10:00:00 |
2 | 1002 | 牛客2号 | 2500 | 6 | 算法 | 2020-01-01 10:00:00 |
3 | 1003 | 牛客3号 | 2200 | 5 | 算法 | 2020-01-01 10:00:00 |
试卷信息表examination_info(exam_id试卷ID, tag试卷类别, difficulty试卷难度, duration考试时长, release_time发布时间):
id | exam_id | tag | difficulty | duration | release_time |
---|---|---|---|---|---|
1 | 9001 | SQL | hard | 60 | 2020-01-01 10:00:00 |
2 | 9002 | SQL | hard | 80 | 2020-01-01 10:00:00 |
3 | 9003 | 算法 | hard | 80 | 2020-01-01 10:00:00 |
4 | 9004 | PYTHON | medium | 70 | 2020-01-01 10:00:00 |
试卷作答记录表exam_record(uid用户ID, exam_id试卷ID, start_time开始作答时间, submit_time交卷时间, score得分):
id | uid | exam_id | start_time | submit_time | score |
---|---|---|---|---|---|
1 | 1001 | 9001 | 2020-01-01 09:01:01 | 2020-01-01 09:21:59 | 90 |
15 | 1002 | 9001 | 2020-01-01 18:01:01 | 2020-01-01 18:59:02 | 90 |
13 | 1001 | 9001 | 2020-01-02 10:01:01 | 2020-01-02 10:31:01 | 89 |
2 | 1002 | 9001 | 2020-01-20 10:01:01 | ||
3 | 1002 | 9001 | 2020-02-01 12:11:01 | ||
5 | 1001 | 9001 | 2020-03-01 12:01:01 | ||
6 | 1002 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:41:01 | 90 |
4 | 1003 | 9001 | 2020-03-01 19:01:01 | ||
7 | 1002 | 9001 | 2020-05-02 19:01:01 | 2020-05-02 19:32:00 | 90 |
14 | 1001 | 9002 | 2020-01-01 12:11:01 | ||
8 | 1001 | 9002 | 2020-01-02 19:01:01 | 2020-01-02 19:59:01 | 69 |
9 | 1001 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:20:01 | 99 |
10 | 1002 | 9002 | 2020-02-02 12:01:01 | ||
11 | 1002 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:43:01 | 81 |
12 | 1002 | 9002 | 2020-03-02 12:11:01 | ||
17 | 1001 | 9002 | 2020-05-05 18:01:01 | ||
16 | 1002 | 9003 | 2020-05-06 12:01:01 |
请统计SQL试卷上未完成率较高的50%用户中,6级和7级用户在有试卷作答记录的近三个月中,每个月的答卷数目和完成数目。按用户ID、月份升序排序。
由示例数据结果输出如下:
uid | start_month | total_cnt | complete_cnt |
---|---|---|---|
1002 | 202002 | 3 | 1 |
1002 | 202003 | 2 | 1 |
1002 | 202005 | 2 | 1 |
解释:各个用户对SQL试卷的未完成数、作答总数、未完成率如下:
uid | incomplete_cnt | total_cnt | incomplete_rate |
---|---|---|---|
1001 | 3 | 7 | 0.4286 |
1002 | 4 | 8 | 0.5000 |
1003 | 1 | 1 | 1.0000 |
1001、1002、1003分别排在1.0、0.5、0.0的位置,因此较高的50%用户(排位<=0.5)为1002、1003;
1003不是6级或7级;
有试卷作答记录的近三个月为202005、202003、202002;
这三个月里1002的作答题数分别为3、2、2,完成数目分别为1、1、1。
###解法:
这道题看起来很复杂,需要我们划分多个步骤,进行多次SQL嵌套才能完成。
**步骤一.**首先需要统计各个用户对SQL试卷的未完成数、作答总数、未完成率。其中需要确保试卷是SQL试卷。需要注意的是,这一步需要考虑多增加一列未完成率排名,排名应该使用开窗函数。SQL写法如下:
select exam_record.uid,
sum(case when submit_time is null then 1 else 0 end) incomplete_cnt,
count(1) total_cnt,
round(sum(case when submit_time is null then 1 else 0 end)/(count(1)), 4) incomplete_rate ,
user_info.level,
row_number() over(order by round(sum(case when submit_time is null then 1 else 0 end)/(count(1)), 4)) r
from exam_record
inner join user_info
on user_info.uid = exam_record.uid
inner join examination_info
on exam_record.exam_id = examination_info.exam_id
where examination_info.tag = 'SQL'
group by exam_record.uid
order by incomplete_rate
下一步则根据上一步所得出的数据筛选出哪些用户未完成率排在前50%且是6级或7级用户,加上将上一步SQL所得出的表命名为表a,可写如下sql进行筛选:
select a.uid from a
where r >= (select floor(count(distinct uid)/2) from exam_record) + 1 and (a.level = 6 or a.level = 7)
此时我们就得出了应该被算入最终统计结果的所有用户uid。
**步骤二.**下一步需要考虑统计用户近三个月的总答题数和完成数。此时需要注意的是需要选出近三个月,因而至少需要一次针对不同用户uid和start_month的排序。代码如下:
select exam_record.uid,
date_format(exam_record.start_time,"%Y%m") start_month,
count(1) over(partition by exam_record.uid, date_format(exam_record.start_time,"%Y%m")) total_cnt,
sum(case when exam_record.submit_time is null then 0 else 1 end) over(partition by exam_record.uid , date_format(exam_record.start_time,"%Y%m")) complete_cnt,
dense_rank() over(partition by exam_record.uid order by date_format(exam_record.start_time,'%Y%m') desc) x
from exam_record
上段代码包含了复杂的开窗,其实主要是针对不同用户uid和start_month进行聚合,统计当月的答题总数total_cnt和当月的总完成数complete_cnt。需要注意的是,我们添加了一次排序使用的是dense_rank()进行排序,目的是同时达到筛选前三个月的数据和去重。将上一个SQL所得出的表命名为表t,SQL写法如下:
select t.uid,t.start_month,t.total_cnt, t.complete_cnt
from t
where t.x <= 3
group by t.uid,t.start_month,t.total_cnt, t.complete_cnt
order by t.uid,t.start_month
以上代码很重要,同时达到去重和选取固定行数的目的,是重要的SQL技巧。
**步骤三.**下面我们将以上两个步骤的所有代码结合起来,得出最终的解:
select t.uid,t.start_month,t.total_cnt, t.complete_cnt /*除去下面注释部分所标注的内容都是步骤二所完成查询*/
from (
select exam_record.uid,
date_format(exam_record.start_time,"%Y%m") start_month,
count(1) over(partition by exam_record.uid , date_format(exam_record.start_time,"%Y%m")) total_cnt,
sum(case when exam_record.submit_time is null then 0 else 1 end) over(partition by exam_record.uid , date_format(exam_record.start_time,"%Y%m")) complete_cnt,
dense_rank() over(partition by exam_record.uid order by date_format(exam_record.start_time,'%Y%m') desc) x
from exam_record
where exam_record.uid in ( /*这里对uid的筛选其实主要是从步骤一中得出的结果中筛选*/
select a.uid from
(
select exam_record.uid,
sum(case when submit_time is null then 1 else 0 end) incomplete_cnt,
count(1) total_cnt,
round(sum(case when submit_time is null then 1 else 0 end)/(count(1)), 4) incomplete_rate ,
user_info.level,
row_number() over(order by round(sum(case when submit_time is null then 1 else 0 end)/(count(1)), 4)) r
from exam_record
inner join user_info
on user_info.uid = exam_record.uid
inner join examination_info
on exam_record.exam_id = examination_info.exam_id
where examination_info.tag = 'SQL'
group by exam_record.uid
order by incomplete_rate
) a
where r >= (select floor(count(distinct uid)/2) from exam_record) + 1 and (a.level = 6 or a.level = 7)
)
) t
where t.x <= 3
group by t.uid,t.start_month,t.total_cnt, t.complete_cnt
order by t.uid,t.start_month
比较复杂,详细查看前两步,才能看懂最终结合的逻辑。