本文共 6438 字,大约阅读时间需要 21 分钟。
三个字段的意思:用户名uid,月份month,访问次数countA,2015-01,5A,2015-01,15B,2015-01,5A,2015-01,8B,2015-01,25A,2015-01,5A,2015-02,4A,2015-02,6B,2015-02,10B,2015-02,5A,2015-03,16A,2015-03,22B,2015-03,23B,2015-03,10B,2015-03,11
最后结果展示:
用户 月份 最大访问次数 总访问次数 当月访问次数A 2015-01 33 33 33A 2015-02 33 43 10A 2015-03 38 81 38B 2015-01 30 30 30B 2015-02 30 45 15B 2015-03 44 89 44
建表与导入数据
create table if not exists interview01(userId string,month string,count int) row format delimited fields terminated by ","; load data local inpath "/home/hadoop/hive_data/interview01" into table interview01;
辅助SQL:
先按userid和month分组,求出总次数counts
select userid,month,sum(count)counts from interview01 group by userid,month
最终SQL:
使用窗口函数求截止到当月的最大访问次数,直接从上表里面查
select userid,month,max(counts) over(distribute by userid sort by month)maxcount,sum(counts) over(distribute by userid sort by month)sumcount,countsfrom (select userid,month,sum(count)counts from interview01 group by userid,month)tmp;
建表语句:
CREATE TABLE `course` ( `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY, `sid` int(11) DEFAULT NULL, `course` varchar(255) DEFAULT NULL, `score` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
插入数据
字段解释:id, 学号, 课程, 成绩INSERT INTO `course` VALUES (1, 1, 'yuwen', 43);INSERT INTO `course` VALUES (2, 1, 'shuxue', 55);INSERT INTO `course` VALUES (3, 2, 'yuwen', 77);INSERT INTO `course` VALUES (4, 2, 'shuxue', 88);INSERT INTO `course` VALUES (5, 3, 'yuwen', 98);INSERT INTO `course` VALUES (6, 3, 'shuxue', 65);
辅助SQL:
这题可以看做为一到mysql题,因为每个字段都只有一个成绩,所以这里我们要对没有的字段进行补0操作
selectsid,case coursewhen "yuwen" then score else 0 end yuwen,case course when "shuxue" then score else 0 end shuxuefrom course
然后我们从上表中操作语文和数学的最大值,然后作为一个表,直接进行查询
selectsid,shuxue,yuwenfrom(selectsid,sum(yuwen) yuwen,max(shuxue) shuxuefrom(selectsid,case coursewhen "yuwen" then score else 0 end yuwen,case course when "shuxue" then score else 0 end shuxuefrom course) agroup by sid) b where b.shuxue>b.yuwen;
结果
+------+--------+-------+| sid | shuxue | yuwen |+------+--------+-------+| 1 | 55 | 43 || 2 | 88 | 77 |+------+--------+-------+
2014010114201401021620140103172014010410201401050620120106092012010732201201081220120109192012011023200101011620010102122001010310200101041120010105292013010619201301072220130108122013010929201301102320080101052008010216200801033720080104142008010516200701061920070107122007010812200701099920070110232010010114201001021620100103172010010410201001050620150106492015010722201501081220150109992015011023
建表与导入数据
create table if not exists interview03(data string);load data local inpath "/home/hadoop/hive_data/interview03" into table interview03;
辅助SQL:
先求每一年的最大温度
selectsubstr(data,1,4) year,max(cast(substr(data,-2,2) as int))max_tempfrom interview03group by substr(data,1,4)
然后将上表与原表连接,连接条件是年份与气温符合的,然后再找出那天是哪一天
selecta.year year,a.max_temp max_temp,substr(b.data,1,8) dayfrom(selectsubstr(data,1,4) year,max(cast(substr(data,-2,2) as int))max_tempfrom interview03group by substr(data,1,4))a joininterview03 b ona.year=substr(b.data,1,4) and a.max_temp=cast(substr(b.data,-2,2) as int);
结果
year max_temp day2001 29 200101052007 99 200701092008 37 200801032010 17 201001032012 32 201201072013 29 201301092014 17 201401032015 99 20150109
id course 1,a 1,b 1,c 1,e 2,a 2,c 2,d 2,f 3,a 3,b 3,c 3,e
编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修id a b c d e f1 1 1 1 0 1 02 1 0 1 1 0 13 1 1 1 0 1 0
建表与导入数据
create table if not exists interview04(id int,course string) row format delimited fields terminated by ","; load data local inpath "/home/hadoop/hive_data/interview04" into table interview04;
首先得到所有备选课程
select collect_set(trim(course))courses from interview04;
然后求每一个学员选修的课程
selectid,collect_set(trim(course)) id_course from interview04group by id;
这里要使用笛卡尔积关联(a表中的每一天数据关联b表所有数据),从原表中查询course字段放入set集合中,然后与上表关联,得到已选课程与备选课程
首先修改笛卡尔积参数
set hive.strict.checks.cartesian.product=false;set hive.mapred.mode=nonstrict;辅助SQL:
selectb.id id,b.id_course id_course,a.courses coursesfrom(select collect_set(trim(course))courses from interview04) a join(selectid,collect_set(trim(course)) id_course from interview04group by id) b;
id id_course courses1 ["a","b","c","e"] ["a","b","c","e","d","f"]2 ["a","c","d","f"] ["a","b","c","e","d","f"]3 ["a","b","c","e"] ["a","b","c","e","d","f"]
最终SQL:
判断a表与b表中元素的包含关系即可
selectid,if(array_contains(id_course,courses[0]),1,0) a,if(array_contains(id_course,courses[1]),1,0) b,if(array_contains(id_course,courses[2]),1,0) c,if(array_contains(id_course,courses[3]),1,0) e,if(array_contains(id_course,courses[4]),1,0) d,if(array_contains(id_course,courses[5]),1,0) ffrom(selectb.id id,b.id_course id_course,a.courses coursesfrom(select collect_set(trim(course))courses from interview04) a join(selectid,collect_set(trim(course)) id_course from interview04group by id) b) c;
最终结果
id a b c e d f1 1 1 1 1 0 02 1 0 1 0 1 13 1 1 1 1 0 0
uid month salea,01,150a,01,200b,01,1000b,01,800c,01,250c,01,220b,01,6000a,02,2000a,02,3000b,02,1000b,02,1500c,02,350c,02,280a,03,350a,03,250
建表与导入数据
create table if not exists interview05(uid string,month string,sale int) row format delimited fields terminated by ",";load data local inpath "/home/hadoop/hive_data/interview05" into table interview05;
最终SQL:
与第一题类似,先求总销售额,再进行开窗算出累计到当月的总销售额
select uid,month,sum(sales) over(distribute by uid sort by month)sumsales,salesfrom (select uid,month,sum(sale)sales from interview05 group by uid,month)tmp;
结果:
uid month sumsales salesa 01 350 350a 02 5350 5000a 03 5950 600b 01 7800 7800b 02 10300 2500c 01 470 470c 02 1100 630
转载地址:http://owgzi.baihongyu.com/