总结
为了突出重点,总结就写在最前面了。从拿到需求开始,我们经历了以下步骤来完成工作:
需求分析
设计测试数据集及测试用例
数据清洗
需要实现
测试
其中数据清洗主要是做了两个工作:
实现的判断依据如下:
实现过程中主要使用了下面的函数:
rank
concat_ws
collect_set
split
unix_timestamp
子查询
需求
概述
一般来说,客户会按照A->B->C->D的顺序来访问页面,而且越到后面的页面访问率就越低,比如A是网站首页,B是列表页,C是详情页,D是支付页,根据常识,很容易知道A页面的访问率是最高的而D则是最低的。
现在有一个需求:使用SQL语句来完成以下的工作:
找出乱序访问的用户访问记录,即不按照A->B->C->D的顺序访问的
找出顺序访问的用户访问记录,即严格按照A->B->C->D的顺序访问的
找出顺序的并符合间隔条件的用户访问记录,即严格按照A->B->C->D的顺序访问的,并且访问A后超过2小时再访问B的。
分析
这个需求表述的不是很清楚,我们再来明确一下各种情况应该怎么处理。
首先要完成这种数据漏斗,必须能够将一个用户的每次访问严格的界定开来,即有办法区分每个用户的每次访问都请求了哪些网址,这项工作显然不应该是我们这次的任务,故假定数据集符合该要求。
其次,还有一些特殊的情况的处理需要进一步明确:
对于第1个问题,按照最后一次访问的时间为准;对于第2个问题,则约定其为乱序访问。
数据集及测试用例
我们并没有现成的数据集可以使用。为了说明问题,需要我们自己创建一个数据集用于测试。
表结构设计
为了说明问题方便,在不影响结论的情况下,应该让表结构尽可能的简单。必须的字段包括:
测试数据
用等价类划分法来给一些测试数据。等价类大概划成下面这样:
基于这样的等价类划分,可以给出下面的测试数据集:
页面无缺失,顺序,超过间隔,无重复页面session_id req_url req_times_01 a.html 2017-03-26 08:00:00s_01 b.html 2017-03-26 10:01:00s_01 c.html 2017-03-26 10:03:00s_01 d.html 2017-03-26 10:04:00页面无缺失,顺序,超过间隔,有重复页面session_id req_url req_times_02 a.html 2017-03-26 08:00:00s_02 a.html 2017-03-26 09:00:00s_02 b.html 2017-03-26 11:01:00s_02 c.html 2017-03-26 11:03:00s_02 d.html 2017-03-26 11:04:00页面无缺失,顺序,间隔不足,无重复页面session_id req_url req_times_03 a.html 2017-03-26 08:00:00s_03 b.html 2017-03-26 09:01:00s_03 c.html 2017-03-26 09:03:00s_03 d.html 2017-03-26 09:04:00页面无缺失,顺序,间隔不足,有重复页面session_id req_url req_times_04 a.html 2017-03-26 08:00:00s_04 a.html 2017-03-26 09:00:00s_04 b.html 2017-03-26 10:01:00s_04 c.html 2017-03-26 10:03:00s_04 d.html 2017-03-26 10:04:00页面无缺失,顺序访问完后乱序,超过间隔,有重复页面session_id req_url req_times_05 a.html 2017-03-26 08:00:00s_05 b.html 2017-03-26 10:01:00s_05 c.html 2017-03-26 10:03:00s_05 d.html 2017-03-26 10:04:00s_05 c.html 2017-03-26 11:04:00页面无缺失,顺序访问完后乱序,间隔不足,有重复页面session_id req_url req_times_06 a.html 2017-03-26 09:00:00s_06 b.html 2017-03-26 10:01:00s_06 c.html 2017-03-26 10:03:00s_06 d.html 2017-03-26 10:04:00s_06 c.html 2017-03-26 11:04:00页面无缺失,其他乱序,超过间隔,有重复页面session_id req_url req_times_07 a.html 2017-03-26 07:00:00s_07 b.html 2017-03-26 10:01:00s_07 c.html 2017-03-26 10:03:00s_07 d.html 2017-03-26 10:04:00s_07 c.html 2017-03-26 11:04:00页面无缺失,其他乱序,超过间隔,无重复页面session_id req_url req_times_08 a.html 2017-03-26 07:00:00s_08 c.html 2017-03-26 10:01:00s_08 b.html 2017-03-26 10:03:00s_08 d.html 2017-03-26 10:04:00页面无缺失,其他乱序,间隔不足,有重复页面session_id req_url req_times_09 a.html 2017-03-26 09:00:00s_09 b.html 2017-03-26 10:01:00s_09 c.html 2017-03-26 10:03:00s_09 d.html 2017-03-26 10:04:00s_09 c.html 2017-03-26 11:04:00页面无缺失,其他乱序,间隔不足,无重复页面session_id req_url req_times_10 a.html 2017-03-26 09:00:00s_10 c.html 2017-03-26 10:01:00s_10 b.html 2017-03-26 10:03:00s_10 d.html 2017-03-26 10:04:00页面缺失,其他乱序,超过间隔,有重复页面session_id req_url req_times_11 a.html 2017-03-26 07:00:00s_11 b.html 2017-03-26 10:03:00s_11 d.html 2017-03-26 10:04:00s_11 d.html 2017-03-26 11:04:00页面缺失,其他乱序,超过间隔,无重复页面session_id req_url req_times_12 a.html 2017-03-26 07:00:00s_12 b.html 2017-03-26 10:03:00s_12 d.html 2017-03-26 10:04:00页面缺失,其他乱序,间隔不足,有重复页面session_id req_url req_times_13 a.html 2017-03-26 07:00:00s_13 d.html 2017-03-26 10:04:00s_13 d.html 2017-03-26 11:04:00页面缺失,其他乱序,间隔不足,无重复页面session_id req_url req_times_14 a.html 2017-03-26 07:00:00s_14 d.html 2017-03-26 10:04:00
测试用例
有了数据集,我们就可以给出下面的测试用例:
数据漏斗
期望输出
实际输出
|
|
|
乱序漏斗 | s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14
|
|
顺序漏斗 | s_01 s_02 s_03 s_04 |
|
顺序间隔漏斗 | s_01 s_02 |
|
准备工作
完成了需求分析以及测试数据和测试用例的准备以后,我们就可以开始实现这几个数据漏斗了。
使用下面的命令在Hive中创建一个表,用于存放用户访问数据。
use test_db;create table t_visitlog(session_id string,req_url string,req_time timestamp)row format delimitedfields terminated by ',';
然后创建一个数据文件,visitlog.data,输入下面的内容:
s_01,a.html,2017-03-26 08:00:00s_01,b.html,2017-03-26 10:01:00s_01,c.html,2017-03-26 10:03:00s_01,d.html,2017-03-26 10:04:00s_02,a.html,2017-03-26 08:00:00s_02,a.html,2017-03-26 09:00:00s_02,b.html,2017-03-26 11:01:00s_02,c.html,2017-03-26 11:03:00s_02,d.html,2017-03-26 11:04:00s_03,a.html,2017-03-26 08:00:00s_03,b.html,2017-03-26 09:01:00s_03,c.html,2017-03-26 09:03:00s_03,d.html,2017-03-26 09:04:00s_04,a.html,2017-03-26 08:00:00s_04,a.html,2017-03-26 09:00:00s_04,b.html,2017-03-26 10:01:00s_04,c.html,2017-03-26 10:03:00s_04,d.html,2017-03-26 10:04:00s_05,a.html,2017-03-26 08:00:00s_05,b.html,2017-03-26 10:01:00s_05,c.html,2017-03-26 10:03:00s_05,d.html,2017-03-26 10:04:00s_05,c.html,2017-03-26 11:04:00s_06,a.html,2017-03-26 09:00:00s_06,b.html,2017-03-26 10:01:00s_06,c.html,2017-03-26 10:03:00s_06,d.html,2017-03-26 10:04:00s_06,c.html,2017-03-26 11:04:00s_07,a.html,2017-03-26 07:00:00s_07,b.html,2017-03-26 10:01:00s_07,c.html,2017-03-26 10:03:00s_07,d.html,2017-03-26 10:04:00s_07,c.html,2017-03-26 11:04:00s_08,a.html,2017-03-26 07:00:00s_08,c.html,2017-03-26 10:01:00s_08,b.html,2017-03-26 10:03:00s_08,d.html,2017-03-26 10:04:00s_09,a.html,2017-03-26 09:00:00s_09,b.html,2017-03-26 10:01:00s_09,c.html,2017-03-26 10:03:00s_09,d.html,2017-03-26 10:04:00s_09,c.html,2017-03-26 11:04:00s_10,a.html,2017-03-26 09:00:00s_10,c.html,2017-03-26 10:01:00s_10,b.html,2017-03-26 10:03:00s_10,d.html,2017-03-26 10:04:00s_11,a.html,2017-03-26 07:00:00s_11,b.html,2017-03-26 10:03:00s_11,d.html,2017-03-26 10:04:00s_11,d.html,2017-03-26 11:04:00s_12,a.html,2017-03-26 07:00:00s_12,b.html,2017-03-26 10:03:00s_12,d.html,2017-03-26 10:04:00s_13,a.html,2017-03-26 07:00:00s_13,d.html,2017-03-26 10:04:00s_13,d.html,2017-03-26 11:04:00s_14,a.html,2017-03-26 07:00:00s_14,d.html,2017-03-26 10:04:00
然后使用下面的命令将其上传到hdfs中。
hadoop fs -put visitlog.data /user/hive/warehouse/test_db.db/t_visitlog/
然后使用下面的SQL语句检查是否上传成功:
select * from t_visitlog;
我这边是上传成功的。到这里我们就完成了准备工作。
数据清洗
接下来,我们要将数据进行一些处理,使得后面的工作更容易开展,主要工作包括:
去掉重复的页面记录
使用下面的SQL语句即可实现去掉每一次访问中重复的页面记录,同时只保留每个页面最后一次访问记录。
select session_id,req_url,req_time from( select session_id,req_url,req_time,rank() over(partition by session_id,req_url order by req_time desc) as rank
from( select session_id,req_url,req_time from t_visitlog
distribute by session_id,req_url
sort by session_id,req_time desc
)a
)bwhere rank = 1;
为了粘贴方便,这里就不缩进了。
为了方便后续的操作,我们使用下面的SQL语句将查询结果放到一个新的表t_vlog_norepeat中。
使用下面的命令来查看一下新创建的表的结构:
desc t_vlog_norepeat;
输出结果如下:
session_id string req_url string req_time timestamp
是符合我们要求的。
合并访问记录
使用下面的命令将访问记录保存到t_vlog_merge表中去。
create table t_vlog_merge asselect session_id,concat_ws(',',collect_set(vr)) as vtext
from( select distinct session_id,concat_ws('_',cast(req_time as string),req_url) as vr
from ( select session_id,req_url,req_time from t_vlog_norepeat sort by session_id,req_time asc
)a
) tempgroup by session_id;
结果如下:
s_01 2017-03-26 08:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:03:00_c.html,2017-03-26 10:04:00_d.htmls_02 2017-03-26 09:00:00_a.html,2017-03-26 11:01:00_b.html,2017-03-26 11:03:00_c.html,2017-03-26 11:04:00_d.htmls_03 2017-03-26 08:00:00_a.html,2017-03-26 09:01:00_b.html,2017-03-26 09:03:00_c.html,2017-03-26 09:04:00_d.htmls_04 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:03:00_c.html,2017-03-26 10:04:00_d.htmls_05 2017-03-26 08:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.htmls_06 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.htmls_07 2017-03-26 07:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.htmls_08 2017-03-26 07:00:00_a.html,2017-03-26 10:01:00_c.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.htmls_09 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.htmls_10 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_c.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.htmls_11 2017-03-26 07:00:00_a.html,2017-03-26 10:03:00_b.html,2017-03-26 11:04:00_d.htmls_12 2017-03-26 07:00:00_a.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.htmls_13 2017-03-26 07:00:00_a.html,2017-03-26 11:04:00_d.htmls_14 2017-03-26 07:00:00_a.html,2017-03-26 10:04:00_d.html
实现
实现乱序漏斗
我们来分析一下乱序漏斗的几个特征,第1个是按照”,”分割后长度不为4,第2个就是按照”,”分割后页面顺序不对。
使用下面的代码即可实现乱序漏斗:
select session_id
from( select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) tempwhere size(arr) != 4 or split(arr[0],'_')[1] != 'a.html'or split(arr[1],'_')[1] != 'b.html' or split(arr[2],'_')[1] != 'c.html'or split(arr[3],'_')[1] != 'd.html';
输出结果为:
s_05s_06
s_07
s_08
s_09
s_10
s_11
s_12
s_13
s_14
实现顺序漏斗
使用下面的代码即可实现顺序漏斗:
select session_id
from( select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) tempwhere size(arr) == 4 and split(arr[0],'_')[1] == 'a.html'and split(arr[1],'_')[1] == 'b.html' and split(arr[2],'_')[1] == 'c.html'and split(arr[3],'_')[1] == 'd.html';
输出结果:
s_01s_02
s_03
s_04
实现顺序间隔漏斗
使用下面的代码可以实现顺序间隔漏斗(访问A页面2小时以后访问B):
select session_id
from( select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) tempwhere size(arr) == 4 and split(arr[0],'_')[1] == 'a.html'and split(arr[1],'_')[1] == 'b.html' and split(arr[2],'_')[1] == 'c.html'and split(arr[3],'_')[1] == 'd.html' and unix_timestamp(split(arr[1],'_')[0]) - unix_timestamp(split(arr[0],'_')[0]) > 7200;
输出结果:
s_01s_02
测试
上面已经测试过了,测试结果如下:
数据漏斗
期望输出
实际输出
|
|
|
乱序漏斗 | s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14
| s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14
|
顺序漏斗 | s_01 s_02 s_03 s_04 | s_01 s_02 s_03 s_04 |
顺序间隔漏斗 | s_01 s_02 | s_01 s_02
|
本文作者:mrbcy 来源:voidcn
CIO之家 www.ciozj.com 微信公众号:imciow