编程语言
首页 > 编程语言> > 如何用hive调度堡垒机上的python脚本

如何用hive调度堡垒机上的python脚本

作者:互联网

     工作中,如果我们本地要操作的数据量大,那么主机是跑不起来python脚本的,这个时候,就要用到服务器(也叫堡垒机)了。那么如何用HIVE调用堡垒机上的python脚本呢?今天小白就总结一下步骤和一些注意事项~

1.首先将Python脚本上传到堡垒机上

2.上传后,在HIVE中编写shell JOB

#这里设置地址等变量,可以重复使用
file_path="/home/chen_lib" #服务器大目录
file_name_t="traindatas.csv"     #训练数据
file_name_y="df2.csv"            #结果数据集
python_name="Untitled1.py"       #python脚本

#传入数据 由于两个库不通,所以直接把测试数据放到服务器上,就不做从表读入的操作
#若要将HIVE中的表读入到服务器上,执行下面命令
hive -e "set hive.resultset.use.unique.column.names=false;set hive.cli.print.header=true; 
select * from table " >> $file_path/$file_name_t


#执行python脚本,将结果传入到服务器的df表中
python2.7 $file_path/$python_name $file_path $file_path

#从服务器上读取表到HIVE表中
hive -e "LOAD  DATA LOCAL INPATH '$file_path/$file_name_y' OVERWRITE INTO TABLE tablename partition (d='${zdt.format("yyyy-MM-dd")}')"


echo "导入数据完成!"

3.建HIVE表时要注意:

1)文件的分割要用‘,’,因为是csv文件,否则数据读入就是空的

2)建表是要建成textfile,如果建成orc会报错:

Caused by: java.io.IOException: Malformed ORC file

原因是:ORC格式是列式存储的表,不能直接从本地文件导入数据,只有当数据源表也是ORC格式存储时,才可以直接加载,否则会出现上述报错。

USE database;
CREATE TABLE tablename(
     hotelid int COMMENT 'field1 comment',
     max_quantity int COMMENT 'field2 comment',
	 section_query_min int COMMENT 'field2 comment',
	 section_query_max int	COMMENT 'NULL'
	 )
 COMMENT 'owner:chen'
PARTITIONED BY (d string COMMENT 'date')
row format delimited fields terminated by ','
STORED AS textfile; 

4.Python脚本中有几点需要注意一下

1)输出的路径要与HIVE中的路径一致

2)传出的表df2要去掉列名,不然也会被读入到hive表中

3)传出csv文件要用‘,’分割,否则,如果用‘\’,会输出到一列中去,无法读入hive表中


# coding: utf-8

import pandas as pd
import numpy as np

file_path="/home/hotel/chen_lib/"
file_name_t="traindatas.csv"
file_name_y="df2.csv"

data_ctrip = pd.read_csv(file_path+file_name_t,header = 'infer')

ret1=[]
#循环取出每一行的最大值
for row in range(data_ctrip.shape[0]):
    ret = []
    quantitylist=[data_ctrip.loc[row,'quantity0_1'],data_ctrip.loc[row,'quantity1_2'],data_ctrip.loc[row,'quantity2_3'],data_ctrip.loc[row,'quantity3_4'],data_ctrip.loc[row,'quantity4_5'],data_ctrip.loc[row,'quantity5_6'],data_ctrip.loc[row,'quantity6_7'],data_ctrip.loc[row,'quantity7_8'],data_ctrip.loc[row,'quantity8_9'],data_ctrip.loc[row,'quantity9_10'],
      data_ctrip.loc[row,'quantity10_11'],data_ctrip.loc[row,'quantity11_12'],data_ctrip.loc[row,'quantity12_13'],data_ctrip.loc[row,'quantity13_14'],data_ctrip.loc[row,'quantity14_15'],data_ctrip.loc[row,'quantity15_16'],data_ctrip.loc[row,'quantity16_17'],data_ctrip.loc[row,'quantity17_18'],data_ctrip.loc[row,'quantity18_19'],data_ctrip.loc[row,'quantity19_20']]
 

    sectionlist=[data_ctrip.loc[row,'section0_1'],data_ctrip.loc[row,'section1_2'],data_ctrip.loc[row,'section2_3'],data_ctrip.loc[row,'section3_4'],data_ctrip.loc[row,'section4_5'],data_ctrip.loc[row,'section5_6'],data_ctrip.loc[row,'section6_7'],data_ctrip.loc[row,'section7_8'],data_ctrip.loc[row,'section8_9'],data_ctrip.loc[row,'section9_10'],
      data_ctrip.loc[row,'section10_11'],data_ctrip.loc[row,'section11_12'],data_ctrip.loc[row,'section12_13'],data_ctrip.loc[row,'section13_14'],data_ctrip.loc[row,'section14_15'],data_ctrip.loc[row,'section15_16'],data_ctrip.loc[row,'section16_17'],data_ctrip.loc[row,'section17_18'],data_ctrip.loc[row,'section18_19'],data_ctrip.loc[row,'section19_20']]
    
    max_quantity = max(quantitylist) #取出最大的间夜
    max_quantity_index = quantitylist.index(max_quantity)#取出最大间夜对应的索引
    
    section_query = sectionlist[max_quantity_index]#取出最大间夜对应的区间
    
    section_query_min = int(section_query.split("-", 1)[0])#取出价格区间的最小值
    section_query_max = int(section_query.split("-", 1)[1])#取出价格区间的最大值
    ret.append([data_ctrip.loc[row,'hotelid'],max_quantity,section_query_min,section_query_max]) 
    ret1.extend(ret)#追加行
df1 = pd.DataFrame(ret1)

df1.to_csv('file_path+file_name_y',sep=',',index=False,header=False)

以上就是在用堡垒机和HIVE来执行时的步骤和可能遇到的问题啦~

 

 

 

标签:loc,name,机上,ctrip,python,hive,file,data,row
来源: https://blog.csdn.net/Jarry_cm/article/details/93103248