Hive如何实现查询-大数据-互联网-天达云

Hive如何实现查询
更新：HHH 时间：2023-1-7

这篇文章给大家分享的是有关Hive如何实现查询的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。

1、查询

官方演示案例：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

1.1 完整语法

SELECT [ALL | DISTINCT] 字段1,字段2,字段3...
--all 是默认的 表是全部查出来，distinct 表示去重查询(可以精确某个列)
  FROM table_reference      --从哪个表查
  [WHERE where_condition]   --过滤条件
  [GROUP BY col_list]       --以某某字段分组(可以有多个字段)
  [HAVING col_list]         --给分组过后一些数据进行过滤
  [ORDER BY col_list]       --全局排序
  [DISTRIBUTE BY col_list] [SORT BY col_list] --分区、及排序
  [CLUSTER BY col_list]		--分区排序
  [LIMIT number]            --限制输出的行数(翻页)

SQL执行顺序：from < join < where < group by < count(*) < having < select < order by < limit

1.2 、基本查询

1.2.1、算术运算符

1.2.1、比较运算符

1.2.1、逻辑运算符

1.3、分组

1.3.1、group by

GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作。
select t.deptno, t.job, max(t.sal) max_sal 
from emp t 
group by t.deptno, t.job;

注意：在使用了group by后，select后面接的字段只能是group by后面有的。

1.3.2、having

--having与where不同点
--（1）where后面不能写分组聚合函数，而having后面可以使用分组聚合函数。
--（2）having只用于group by分组统计语句。
select deptno, avg(sal) avg_sal 
from emp
group by deptno
having avg_sal > 2000;

1.4、join on

1.4.1、内连接

--只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来
select e.empno, e.ename, d.deptno 
from emp e 
(inner)join dept d 
on e.deptno = d.deptno;

1.4.2、左外连接

--JOIN操作符左边表中符合WHERE子句的所有记录将会被返回
select e.*, d.dname, d.loc
from emp e
left join dept d
on e.deptno=d.deptno；

1.4.3、右外连接

--JOIN操作符右边表中符合WHERE子句的所有记录将会被返回
select e.*, d.*
from emp e
right join dept d
on e.deptno=d.deptno

1.4.4、满外连接

--将会返回所有表中符合WHERE语句条件的所有记录

--方式一：
select e.*, d.*
from dept d
full join emp e
on d.deptno=e.deptno

--方式二：
select e.empno, e.ename, d.dname
from dept d
left join emp e
on d.deptno=e.deptno

union  all

select e.empno, e.ename, d.dname
from dept d
right join emp e
on d.deptno=e.deptno

--union 竖向拼接两张表  可以将相同数据去重
--union all 竖向拼接两张表  直接拼接不去重

1.5、排序

1.5.1、order by

--全局排序，只有一个Reducer
--asc 升序 (默认)
--desc 倒序

select  * from emp
order by sal desc

1.5.2、sort by & distribute by

--distribute by （分区） and sort by（区内排序）

按照部门编号分区，再按照员工编号降序排序。
//设置reduce数量
set mapreduce.job.reduces=3; --默认-1
insert overwrite local directory '/opt/module/hive/datas/distribute-result'
select * from emp 
distribute by deptno sort by empno desc;

注意：
--distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分-到一个区。
--Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

1.5.3、cluster by

--当distribute by和sort by字段相同时，可以使用cluster by方式
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

注意：
--cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则
为ASC或者DESC

1.6、多维聚合

1.6.1、grouping sets

--group by a,b,c grouping sets((a,b),c)
--相当于(group by a,b) union (group by c)

select 
  region,school,class,count(1)
from school
group by region,school,class grouping sets(region,school,class);

+---------+---------+----------+------+
| region  | school  | class    | _c3  |
+---------+---------+----------+------+
| NULL    | NULL    | 三年一班   | 5    |
| NULL    | NULL    | 坦克一班   | 6    |
| NULL    | NULL    | 大数据一班 | 4    |
| NULL    | NULL    | 小学生一班 | 4    |
| NULL    | NULL    | 法师一班   | 4    |
| NULL    | 宝安中学 | NULL      | 4    |
| NULL    | 王者峡谷 | NULL      | 10   |
| NULL    | 黄田小学 | NULL      | 4    |
| NULL    | 龙华小学 | NULL      | 5    |
| 宝安区   | NULL    | NULL     | 8    |
| 王者区   | NULL    | NULL     | 10   |
| 龙华区   | NULL    | NULL     | 5    |
+---------+---------+----------+------+

1.6.2、with cube

--group by a,b,c with cube 相当于对a,b,c各种组合group by之后union
--相当于union -- group by null,a,b,c,ab,ac,bc,abc
select 
    region,class, school,count(1)
from school 
group by region,class, school with cube;

1.6.3、with rollup

--group by a,b,c with rollup 
--相当于union -- group by null,a,ab,abc
select 
    region,class, school,count(1)
from school 
group by region,class, school with rollup

感谢各位的阅读！关于“Hive如何实现查询”这篇文章就分享到这里了，希望以上内容可以对大家有一定的帮助，让大家可以学到更多知识，如果觉得文章不错，可以把它分享出去让更多的人看到吧！


返回大数据教程...