数据库
首页 > 数据库> > 2020-12-03《Presto分布式SQL查询引擎——kkb笔记复习》

2020-12-03《Presto分布式SQL查询引擎——kkb笔记复习》

作者:互联网

Presto分布式SQL查询引擎

一、课前准备

  1. jdk版本要求:Java 8 Update 151 or higher (8u151+), 64-bit

  2. 安装好hadoop集群

  3. 安装好hive

二、课堂主题

  1. 介绍presto

  2. presto架构

  3. prsto安装部署

  4. presto使用

三、课堂目标

  1. 理解presto

  2. 独立完成presto安装部署

  3. 使用presto

 

四、知识要点

1. Presto是什么?

2. Presto架构

 

 

3. Presto特点

1. 优点

2. 缺点

3. presto与impala对比

 

4. 安装部署Presto

官网地址:https://prestodb.io/

github地址

presto集群规划

主机名角色
node01coordinator
node02worker
node03worker

1. 安装部署Presto Server

 

 

1. 下载安装包

https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.237/presto-server-0.237.tar.gz

然后将tar.gz包上传到node01的/kkb/soft目录

2. 解压

cd /kkb/soft/
tar -xzvf presto-server-0.237.tar.gz -C /kkb/install/

3. 配置JAVA

cd /kkb/soft/
tar -xzvf jdk-8u251-linux-x64.tar.gz -C /kkb/install/
cd /kkb/install/
scp -r jdk1.8.0_251/ node02:$PWD
scp -r jdk1.8.0_251/ node03:$PWD
ln -s presto-server-0.237/ presto
vim /kkb/install/presto/bin/launcher

添加如下内容

PATH=/kkb/install/jdk1.8.0_251/bin:$PATH
java -version

注意:需要加在exec "$(dirname "$0")/launcher.py" "$@"之前

 

3. 创建相关目录

cd /kkb/install
cd presto
mkdir data
<span style="color:#333333">mkdir etc</span>

4. 添加JVM配置文件

cd /kkb/install/presto/etc
vim jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError</span>

5. 配置数据源

presto支持不同的数据源,通过catalog进行配置;不同的数据源,有不同的catalog

cd /kkb/install/presto-server-0.237/etc
mkdir catalog
cd catalog
vim hive.properties
<span style="color:#333333">connector.name=hive-hadoop2
hive.metastore.uri=thrift://node03:9083</span>

6. 分发presto

<span style="color:#333333">cd /kkb/install/
scp -r presto node02:/kkb/install/
scp -r presto node03:/kkb/install/</span>

7. 配置node.properties

<span style="color:#333333">cd /kkb/install/presto/etc
vim node.properties</span>
<span style="color:#333333"># node01如下内容
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffff1
node.data-dir=/kkb/install/presto/data
​
# node2如下内容
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffff2
node.data-dir=/kkb/install/presto/data
​
# node03如下内容
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffff3
node.data-dir=/kkb/install/presto/data</span>

说明:

node.environment 环境的名称;presto集群各节点的此名称必须保持一致

node.id presto每个节点的id,必须唯一

node.data-dir 存储log及其他数据的目录

8. 配置config.properties

<span style="color:#333333">cd /kkb/install/presto/etc
vim config.properties</span>
<span style="color:#333333">coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8880
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://node01:8880</span>

说明:

coordinator=true 允许此presto实例作为coordinator

node-scheduler.include-coordinator 是否允许在coordinator上运行work

http-server.http.port presto使用http服务进行内部、外部的通信;指定http server的端口

query.max-memory 一个查询运行时,使用的所有的分布式内存的总量的上限

query.max-memory-per-node query在执行时,使用的任何一个presto服务器上使用的内存上限

discovery-server.enabled presto使用discovery服务,用来发现所有的presto节点

discovery.uri discovery服务的uri

<span style="color:#333333">cd /kkb/install/presto/etc
vim config.properties</span>
<span style="color:#333333">coordinator=false
http-server.http.port=8880
query.max-memory=50GB
discovery.uri=http://node01:8880</span>

9. 启动presto server

<span style="color:#333333">nohup hive --service metastore > /dev/null 2>&1 &</span>
<span style="color:#333333">cd /kkb/install/presto
# 前台启动,控制台打印日志
bin/launcher run
# 或使用后台启动presto
bin/launcher start</span>
<span style="color:#333333">/kkb/install/presto/data/var/log</span>

 

2. 安装部署Presto命令行接口

1. 下载安装包

2. 重命名文件

<span style="color:#333333">cd /kkb/soft
mv presto-cli-0.237-executable.jar prestocli</span>

3. 增加可执行权限

<span style="color:#333333">chmod u+x prestocli</span>

4. 启动presto cli

<span style="color:#333333">./prestocli --help</span>
<span style="color:#333333">./prestocli --server node01:8880 --catalog hive --schema default</span>

说明:

--catalog hive 中的hive指的是etc/catalog中的hive.properties的文件名

<span style="color:#333333">java -jar presto-cli-0.237-executable.jar --server node01:8880 --catalog hive --schema default</span>
<span style="color:#333333">quit</span>

 

5. 体验命令操作

Presto的命令行操作,相当于Hive命令行操作。每个表必须要加上schema前缀;例如

<span style="color:#333333">select * from schema.table limit 5
​
或者切换到指定的schema,再查询表数据
use myhive;
select * from score limit 3;</span>

 

3. 安装部署Presto 可视化客户端

1. 下载安装包

2. 解压缩

<span style="color:#333333">cd /kkb/soft
unzip -d /kkb/install yanagishima-18.0.zip
​
# 若出现-bash: unzip: command not found,表示没有安装unzip;需要安装;然后再解压缩
sudo yum -y install unzip zip
​
cd /kkb/install/yanagishima-18.0</span>

 

3. 修改配置文件

<span style="color:#333333">cd /kkb/install/yanagishima-18.0/conf
vim yanagishima.properties</span>
<span style="color:#333333">jetty.port=7080
presto.datasources=kkb-presto
presto.coordinator.server.kkb-presto=http://node01:8880
catalog.kkb-presto=hive
schema.kkb-presto=default
sql.query.engines=presto</span>

 

4. 启动yanagishima

<span style="color:#333333">后台启动:nohup bin/yanagishima-start.sh >yanagishima.log 2>&1 &
​
[hadoop@node01 yanagishima-18.0]$ pwd
/kkb/install/yanagishima-18.0
前台启动:bin/yanagishima-start.sh
>yanagishima.log </span>

5. Presto查询及优化

 

1. Presto sql语法

<span style="color:#333333">SHOW SCHEMAS;</span>
<span style="color:#333333">SHOW TABLES;</span>
<span style="color:#333333">语法:CREATE SCHEMA [ IF NOT EXISTS ] schema_name
​
CREATE SCHEMA testschema;</span>
<span style="color:#333333">语法:DROP SCHEMA [ IF EXISTS ] schema_name
drop schema testschema;</span>
<span style="color:#333333">语法:CREATE TABLE [ IF NOT EXISTS ]
table_name (column_name data_type [ COMMENT comment],... ]
​
create table stu4(id int, name varchar(20));</span>
<span style="color:#333333">语法:
CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]
​
create table if not exists myhive.stu5 as select id, name from stu1;</span>
<span style="color:#333333">语法:DELETE FROM table_name [ WHERE condition ]
说明:hive connector只支持一次性的删除一个完整的分区;不支持删除一行数据
​
DELETE FROM order_partition where month='2019-03';</span>
<span style="color:#333333">DESCRIBE hive.myhive.stu1;</span>
<span style="color:#333333">语法:ANALYZE table_name
​
ANALYZE hive.myhive.stu1;</span>
<span style="color:#333333">语法:PREPARE statement_name FROM statement
​
prepare my_select1 from select * from score;
execute my_select1;
​
prepare my_select2 from select * from score where s_score < 90 and s_score > 70;
execute my_select2;
​
prepare my_select3 from select * from score where s_score < ? and s_score > ?;
execute my_select3 using 90, 70;</span>
<span style="color:#333333">语法:
EXPLAIN [ ( option [, ...] ) ] statement
​
where option can be one of:
​
    FORMAT { TEXT | GRAPHVIZ | JSON }
    TYPE { LOGICAL | DISTRIBUTED | VALIDATE | IO }
​
查询逻辑计划语句:
explain select s_id, avg(s_score) from score group by s_id;
等价于
explain (type logical)select s_id, avg(s_score) from score group by s_id;
​
查询分布式执行计划distributed execution plan
explain (type distributed)select s_id, avg(s_score) from score group by s_id;
​
校验语句的正确性
explain (type validate)select s_id, avg(s_score) from score group by s_id;
​
explain (type io, format json)select s_id, avg(s_score) from score group by s_id;</span>
<span style="color:#333333">语法:
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
​
from_item:
table_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
from_item join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
​
join_type:
[ INNER ] JOIN
LEFT [ OUTER ] JOIN
RIGHT [ OUTER ] JOIN
FULL [ OUTER ] JOIN
CROSS JOIN
​
grouping_element:
()
expression
GROUPING SETS ( ( column [, ...] ) [, ...] )
CUBE ( column [, ...] )
ROLLUP ( column [, ...] )
​
语句:
with语句:用于简化内嵌的子查询
select a, b
from (
select s_id as a, avg(s_score) as b from score group by s_id
) as tbl1;
​
等价于:
with tbl1 as (select s_id as a, avg(s_score) as b from score group by s_id)
select a, b from tbl1;
​
多个子查询也可以用with
WITH
  t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),
  t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)
SELECT t1.*, t2.*
FROM t1
JOIN t2 ON t1.a = t2.a;
​
with语句中的关系可以串起来(chain)
WITH
  x AS (SELECT a FROM t),
  y AS (SELECT a AS b FROM x),
  z AS (SELECT b AS c FROM y)
SELECT c FROM z;
​
​
group by:
select s_id as a, avg(s_score) as b from score group by s_id;
等价于:
select s_id as a, avg(s_score) as b from score group by 1;
1代表查询输出中的第一列s_id
​
select count(*) as b from score group by s_id;</span>

2. 存储优化

 

3. SQL优化

<span style="color:#333333">[GOOD]: SELECT s_id, c_id FROM score
​
[BAD]:  SELECT * FROM score</span>
<span style="color:#333333">[GOOD]: SELECT vtime, stu, address FROM tbl where day=20200501
​
[BAD]:  SE LECT * FROM tbl where vtime=20200501</span>
<span style="color:#333333">uid个数多;gender少
[GOOD]: SELECT GROUP BY uid, gender
​
[BAD]:  SELECT GROUP BY gender, uid</span>
<span style="color:#333333">[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100
​
[BAD]:  SELECT * FROM tbl ORDER BY time</span>
<span style="color:#333333">select approx_distinct(s_id) from score;</span>
<span style="color:#333333">SELECT
...
FROM
access
WHERE
method LIKE '%GET%' OR
method LIKE '%POST%' OR
method LIKE '%PUT%' OR
method LIKE '%DELETE%'
​
优化:
SELECT
...
FROM
access
WHERE
regexp_like(method, 'GET|POST|PUT|DELETE')</span>
<span style="color:#333333">[GOOD] SELECT ... FROM large_table l join small_table s on l.id = s.id
[BAD] SELECT ... FROM small_table s join large_table l on l.id = s.id</span>

6. 其他注意事项

1. 字段名引用

2. 函数

<span style="color:#333333"><span style="color:#aa5500">/*MySQL的写法*/</span>
<span style="color:#770088">SELECT</span> t <span style="color:#770088">FROM</span> a <span style="color:#770088">WHERE</span> t > <span style="color:#aa1111">'2020-05-01 00:00:00'</span>; 
​
<span style="color:#aa5500">/*Presto的写法*/</span>
<span style="color:#770088">SELECT</span> t <span style="color:#770088">FROM</span> a <span style="color:#770088">WHERE</span> t > <span style="color:#116644">timestamp '2020-05-01 00:00:00'</span>;</span>

 

3. 不支持INSERT OVERWRITE语法

4. QUET格式

 

五、拓展点、未来计划、行业趋势

 

注:以上来自kkb课堂笔记

 

标签:03,12,presto,Presto,查询,score,id,kkb
来源: https://blog.csdn.net/weixin_45091011/article/details/110585981