首页 > 其他分享> > Hadoop 概述(二)

Hadoop 概述(二)

2022-05-04 14:34:51 作者：互联网

shell定时上传linux日志信息到hdfs

从标题可以分析出来，我们要使用到shell，还要推送日志信息到hdfs上。

定义出上传的路径和临时路径，并配置好上传的log日志信息。
这里我使用了上一节配置的nginx的error.log

#上传log日志文件的存放路径
/bigdata/logs/upload/log/
#上传log日志文件的临时路径
/bigdata/logs/upload/templog/

将nginx的error.log放在上传log日志的文件夹下边，如下图所示

在~/bin目录下创建我们的shell脚本文件

vi uploadFileToHdfs.sh

##复制如下内容，并保存退出
#!/bin/bash

#set java env
export JAVA_HOME=/bigdata/install/jdk1.8.0_141
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

#set hadoop env
export HADOOP_HOME=/bigdata/install/hadoop-3.1.4
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH


#日志文件存放的目录
log_src_dir=/bigdata/logs/upload/log/

#待上传文件存放的目录
log_toupload_dir=/bigdata/logs/upload/templog/


#日志文件上传到hdfs的根路径
date1=`date -d last-day +%Y_%m_%d`
hdfs_root_dir=/data/log/$date1/

#打印环境变量信息
echo "envs: hadoop_home: $HADOOP_HOME"


#读取日志文件的目录，判断是否有需要上传的文件
echo "log_src_dir:"$log_src_dir
ls $log_src_dir | while read fileName
do
    if [[ "$fileName" == *.log ]]; then
        date=`date +%Y_%m_%d_%H_%M_%S`
        #打印信息
        echo "moving $log_src_dir$fileName to $log_toupload_dir"temp_$fileName"$date"
        mv $log_src_dir$fileName $log_toupload_dir"temp_$fileName"$date
        #将待上传的文件path写入一个列表文件will
        echo $log_toupload_dir"temp_$fileName"$date >> $log_toupload_dir"will."$date
    fi

done
#找到列表文件will
ls $log_toupload_dir | grep will |grep -v "_DOING_" | grep -v "_DONE_" | while read line
do
    #打印信息
    echo "toupload is in file:"$line
    #将待上传文件列表will改名为will_DOING_
    mv $log_toupload_dir$line $log_toupload_dir$line"_DOING_"
    #读列表文件will_DOING_的内容（一个一个的待上传文件名）  ,此处的line 就是列表中的一个待上传文件的path
    cat $log_toupload_dir$line"_DOING_" |while read line
    do
        #打印信息
        echo "puting...$line to hdfs path.....$hdfs_root_dir"
        hadoop fs -mkdir -p $hdfs_root_dir
        hadoop fs -put $line $hdfs_root_dir
    done    
    mv $log_toupload_dir$line"_DOING_"  $log_toupload_dir$line"_DONE_"
done

执行脚本，查看结果
sh uploadFileToHdfs.sh

yarn核心配置参数说明

yarn作为hadoop的资源分配和调度的基础组件，有哪些相关的参数是和这个组件有关呢?

ResourceManager相关

yarn.resourcemanager.scheduler.class    #配置调度器，apache yarn默认容量调度器，CDH默认公平调度器
yarn.resourcemanager.scheduler.client.thread-count    # ResourceManager处理调度器请求的现场数量，默认50

NodeManager相关

yarn.nodemanager.resource.detect-hardware-capabilities    #是否让yarn自己检测硬件进行配置，默认false
yarn.nodemanager.resource.count-logical-processor-as-cores    #是否将虚拟核数当作CPU核数，默认false
yarn.nodemanager.resource.pcores-vcores-multiplier    #虚拟核数和物理核数乘数，默认为1.0
yarn.nodemanager.resource.memory-mb    # NodeManager使用内存，默认8G
yarn.nodemanager.resource.system-reserved-memory-mb     #NodeManager为系统保留多少内存，以上二个参数配置一个即可
yarn.nodemanager.resource.cpu-vcores     #NodeManager使用CPU核数，默认8个
yarn.nodemanager.pmem-check-enabled    #是否开启物流内存检查限制container，默认打开
yarn.nodemanager.vmem-check-enabled     #是否开启虚拟内存检查限制container，默认代开
yarn.nodemanager.vmem-pmem-ratio    #虚拟内存物理内存比例，默认2.1

Container相关

yarn.scheduler.minimum-allocation-mb    #容器最小内存，默认1G
yarn.scheduler.maximum-allocation-mb    #容器最大内存，默认8G
yarn.scheduler.minimum-allocation-vcores    #容器最小CPU核数，默认1个
yarn.scheduler.maximum-allocation-vcores    #容器最大CPU核数，默认4个

那在部署环境我们要怎么分配呢？
eg:3台服务器，每台配置4G内存，4核CPU，4线程，如果我们处理的是1G的文件进行数据的count统计，
那么就会有 1G/128M=8个MapTask 1个ReduceTask,1个MrAppMaster,平均下来就是每个节点有3个任务，
那我们按4 3 3 的比例分配10个任务。

那我们就按照上边的参数进行配置下我们的yarn-site.xml文件

<configuration>
    <property>
       <name>yarn.resourcemanager.hostname</name>
        <value>hadoop01</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 如果vmem、pmem资源不够，会报错，此处将资源监察置为false -->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <!-- 选择调度器。默认容量-->
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
    <!--ResourceManager处理调度器请求的现场数量，一共3*4=12，留出几个供其他使用-->
    <property>
        <name>yarn.resourcemanager.scheduler.client.thread-count</name>
        <value>8</value>
    </property>
    <!--是否让yarn自动检测硬件进行配置-->
    <property>
        <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
        <value>false</value>
    </property>
    <!--是否将虚拟核数当作CPU核数 默认false-->
    <property>
        <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
        <value>false</value>
    </property>
    <!--虚拟核数和物理核数乘数-->
    <property>
        <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
        <value>1.0</value>
    </property>
    <!--nodemanager使用内存数，默认8g，但是服务器只有4G，修改为4g-->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
    </property>
    <!--nodemanager的CPU核数，默认设置为8个，但是服务器只有4核，修改为4个-->
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>4</value>
    </property>
    <!--容器最小内存，默认1g-->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <!--容器最大内存，默认8g，但是服务器只有4G，修改为2g-->
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
    </property>
    <!--容器最小CPU核数，默认1个-->
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <!--容器最大CPU核数，默认4个，但是服务器只有4核，修改为2个-->
    <property>
        <name>yarn.schedluler.maximum-allocation-vcores</name>
        <value>2</value>
    </property>
    <!--虚拟内存和物理内存设置比例，默认2.1-->
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>2.1</value>
    </property>
</configuration>

SHELL 复制全屏

按上边的配置修改，我们会得到最后更加符合我们服务器硬件参数的yarn资源集群。

yarn自定义scheduler队列

yarn支持3种调度器，FIFO，容量，公平调度器。

FIFO调度算法

只有一个队列，任务执行是按照先进先出的顺序执行，无法支持多用户并发的场景。如下图

容器调度算法

yahoo开发的多用户调度器(apache yarn中默认使用）

特征

多队列：每个队列可配置一定的资源量，每个队列采用FIFO调度策略
容器保证：管理员可为每个队列设置资源最低保证和资源使用上限
灵活性：如果一个队列中的资源有剩余，可以暂时共享给需要资源的队列，而一旦该队列有新的应用程序提交，则其他队列借调的资源会归还给该队列。
多租户：支持多用户共享集群和多应用程序同时运行，为了防止同一个用户的作业独占队列中的资源，该调度器会对同一用户提交的作业所占资源量进行限定。

资源分配算法

队列资源分配
从root开始，使用深度优先算法，优先选择资源暂用率低的队列分配资源
作业资源分配
默认按照提交作业的优先级和提交时间按顺序分配资源
容器资源分配
按照容器的优先级分配资源
如果优先级相同，按照数据本地性原则
a. 任务和数据在同一节点
b. 任务和数据在同一机架
c. 任务和数据不在同一个节点也不在同一机架

公平调度器

特点

多队列：支持多队列多作业
容量保证：管理员可为每个队列设置资源最低保证和资源使用上限
灵活性：如果一个队列中的资源有剩余，可以暂时共享给那些需要资源的队列，而一旦该队列有新的应用程序提交，则其他队列借调的资源会归还给该队列
多租户：支持多用户共享集群和多应用程序同时运行，为了防止同一个用户的作业独占队列中的资源，该调度器会对同一个用户提交的作业所占资源量进行限定

公平调度器队列资源分配方式

FIFO策略
公平调度器每个队列资源分配策略如果是FIFO，此时就是相当于容量调度器
Fair策略
Fair策略是一种基于最大最小公平算法实现的资源多路复用方式，默认情况下，每个队列内部采用该方式分配资源，同队列所有任务共享资源，在时间尺度上获得公平的资源。意味着，如果一个队列中有两个应用程序同时运行则每个应用程序可得到1/2的资源，如果是三个同时运行，则每个可以得到1/3资源

公平调度器和容量调度器不同点

核心调度策略不同
容量调度器：优先选择资源利用率的队列
公平调度器：优先选择对资源的缺额比例大的
每个队列可以单独设置资源分配方式
容量调度器：FIFO,DRF
公平调度器：FIFO,AIR,DRF

那我们配置一个容量调度器的队列hive，我们只需要修改capacity-scheduler.xml就可以，具体配置如下：

<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>
      Maximum number of applications that can be pending and running.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.1</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare 
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare 
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>
  <!--配置root下的队列，hive和default-->
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default,hive</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>
  <!--default 占总资源40%-->
  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>40</value>
    <description>Default queue target capacity.</description>
  </property>
  <!--hive 占总资源60%-->
  <property>
    <name>yarn.scheduler.capacity.root.hive.capacity</name>
    <value>60</value>
    <description>Default queue target capacity.</description>
  </property>
  <!--default 最大可以使用的队列容量-->
  <property>
    <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
    <value>1</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>
  <!--hive 最大可以使用的队列容量-->
  <property>
    <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>
    <value>1</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>
  <!--default 最大资源容量占总资源60%-->
  <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>60</value>
    <description>
      The maximum capacity of the default queue. 
    </description>
  </property>
  <!--hive 最大资源容量占总资源80%-->
  <property>
    <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>
    <value>80</value>
    <description>
      The maximum capacity of the default queue. 
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.hive.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>
  <!--default队列有权限提交任务的人员-->
  <property>
    <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>
  <!--hive队列有权限提交任务的人员-->
  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>
  <!--default队列允许运行任务的最长时间，-1表示无限制-->
   <property>
     <name>yarn.scheduler.capacity.root.default.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>
  <!--hive队列允许运行任务的最长时间，-1表示无限制-->
   <property>
     <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>
  <!--default队列允许运行任务的默认时间，-1表示无限制-->
   <property>
     <name>yarn.scheduler.capacity.root.default.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>
  <!--hive队列允许运行任务的默认时间，-1表示无限制-->
   <property>
     <name>yarn.scheduler.capacity.root.hive.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>

  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>40</value>
    <description>
      Number of missed scheduling opportunities after which the CapacityScheduler 
      attempts to schedule rack-local containers.
      When setting this parameter, the size of the cluster should be taken into account.
      We use 40 as the default value, which is approximately the number of nodes in one rack.
      Note, if this value is -1, the locality constraint in the container request
      will be ignored, which disables the delay scheduling.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
    <value>-1</value>
    <description>
      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
    <value>1</value>
    <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.application.fail-fast</name>
    <value>false</value>
    <description>
      Whether RM should fail during recovery if previous applications'
      queue is no longer valid.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.workflow-priority-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to override application priority.
      The syntax for this list is
      [workflowId]:[full_queue_name]:[priority][,next mapping]*
      where an application submitted (or mapped to) queue "full_queue_name"
      and workflowId "workflowId" (as specified in application submission
      context) will be given priority "priority".
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.workflow-priority-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a priority mapping is present, will it override the value specified
      by the user? This can be used by administrators to give applications a
      priority that is different than the one specified by the user.
      The default is false.
    </description>
  </property>
</configuration>

重新启动yarn，我们可以看到如下2个队列，一个default队列，一个hive队列

标签：will,capacity,default,Hadoop,yarn,queue,概述,scheduler
来源： https://www.cnblogs.com/hanease/p/16221037.html