首页 > 其他分享> > 大数据学习教程SD版第九篇【Flume】

大数据学习教程SD版第九篇【Flume】

2021-12-26 14:58:16 作者：互联网

Flume 日志采集工具，既然是工具，还是以使用为主！

分布式采集处理和聚合流式框架

通过编写采集方案，即配置文件，来采集数据的工具，配置方案在官方文档

1. Flume 架构

在这里插入图片描述

Agent JVM进程

Source ：接收数据
Channel ：缓冲区
Sink：输出数据

Event 传输单元

2. Flume 安装

Java 和 Hadoop 的环境变量提前配置好，此时解压即用！

3. Flume 官方示例

不同的sink、channel、sink 配置官方文档都有示例

# example.conf : port -> console
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

bin/flume-ng agent -c conf -f jobs/example.conf -n a1 -Dflume.root.logger=INFO,console

传输数据

# yum install -y nc
nc localhost 44444

4. Flume 示例

4.1 File New Context -> HDFS

采集文件新增内容至HDFS，不能断点续传

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/test.log

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.fileType = DataStream

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

bin/flume-ng agent -c conf -f jobs/log2hdfs.conf -n a1

4.2 Dir New File -> HDFS

采集目录下新文件到HDFS，不能监控文件内容变化

a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = c1
a1.sources.src-1.spoolDir = /data/data1
a1.sources.src-1.fileHeader = true

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/file2hdfs.conf -n a1

4.3 Dir New FIle And Context -> HDFS

能够监控多目录下文件及文件内容变化至HDFS，能够断点续传，log4j下日志会更名，而文件更名则会重新上传

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/data2/.*file.*
a1.sources.r1.filegroups.f2 = /data/data3/.*log.*
a1.sources.ri.maxBatchCount = 1000

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events2/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

[{“inode”:786450,“pos”:1501,“file”:"/data/data2/file1.txt"} ] 源码是根据inode和file 共同定位到一个文件

如果处理文件更名的问题，修改 TailFile.java 123 和 ReliableTaildirEventReader.java 256 重新打包，替换libs下的tairdirsource的jar包

5. Flume 事务

Source 推送事件到Channel ，Sink从Channel拉取事件，都是先进临时缓冲区

Source -> Channel doPut putList 回滚是直接清空Channel队列数据，有可能丢数据，有位置记录则不会
Channel -> Sink doTake takeList 回滚是把拉取数据反向写回Channel队列，有可能数据重复

6. Flume Agent 原理

Source 接收数据
Source -> Channel Processor 处理事件
Channel Processor -> Interceptor 事件拦截与过滤
Channel Processor -> Channel Selector : 默认 replicating ，还有 multiplexing
Channel Processor -> Channel n : event 写入channel
Channel -> Sink Processor : 三种：默认 Default 【一个Sink】、LoadBalancing【负载均衡】、Failover【故障转移】
Sink Processor -> Sink : 写入Sink

7. Flume 拓扑结构

借助于 Avro 来连接多个Flume agent

轮询策略：Sink没拉到数据换Sink

简单串联：Sink -> Source
复制和多路复用: 多Channel -> 多Sink
负载均衡和故障转移：Channel -> 多Sink
聚合：多Sink -> Source

8. Flume 自定义Interceptor

自定义Interceptor 实现多路复用：

通过 Header 信息不同进入不同的Channel

采集到包含Error 和Exception 的信息，进入一个Channel，其他进入另一个Channel

各个Channel Sink输出到控制台

编码自定义Interceptor

package com.ipinyou.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    private List<Event> eventList;

    @Override
    public void initialize() {
        eventList = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.contains("Error") || body.contains("Exception")) {
            headers.put("type", "error");
        } else {
            headers.put("type", "normal");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        eventList.clear();
        for (Event event : list) {
            eventList.add(intercept(event));
        }
        return eventList;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

打包上传至Flume的lib目录下
编写采集方案

flume-s1-s2.conf

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 c2
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ipinyou.flume.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.error = c1
a1.sources.r1.selector.mapping.normal = c2

a1.channels = c1 c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 10000
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000


a1.sinks = k1 k2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 7771
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 7772

flume-console1.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 7771

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

flume-console2.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 7772

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

启动

# 依次启动在: hadoop103 hadoop104 hadoop102
bin/flume-ng agent -c conf -f jobs/flume-console1.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/flume-console2.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

9. Flume 自定义Source

编码实现

自定义类继承 AbstractSource ,实现 Configurable, PollableSource
实现 configure（）：读取配置文件
实现 process（）：接收外部数据，封装Event，写入Channel

打包到lib下
编写配置文件

source type : 全类名

启动

10. Flume 自定义Sink

编码实现

自定义类继承 AbstractSink，实现Configurable
实现 configure（）：读取配置文件
实现 process（）：接收Channel数据，开启事物，写入对应位置

后续和上述一致

11. Flume 监控

借助 Ganglia 第三方开源工具

Ganglia：web 展示数据、gmetad 存储数据、gmod 收集数据

11.1 Ganglia 安装

安装

# 102 103 104
yum install -y epel-release
# 102
yum install -y ganglia-gmetad
yum install -y ganglia-web
yum install -y ganglia-gmod
# 103 104
yum install -y ganglia-gmod

修改配置文件

/etc/httpd/conf.d/ganglia.conf

# 在 Location 下 配置WindowsIP
Require ip 192.168.xxx.xxx

/etc/ganglia/gmetad.conf

data_source "my cluster" hadoop102

/etc/ganglia/gmod.conf : hadoop102 103 104 分发

# 修改下列配置
name = "my cluster"
host = hadoop102
bind = 0.0.0.0

关闭 selinux： /etc/selinux/config ，重启才能生效或临时生效

SELINUX=disabled

# 临时生效
setenforce 0

11.2 Ganlia 启动

# 如果权限不足，则修改权限
chmod -R 777 /var/lib/ganglia
# hadoop102
systemctl start gmond
systemctl start httpd
systemctl start gmetad

# hadoop103 hadoop104
systemctl start gmond

浏览器打开Web UI:

http://hadoop102/ganglia

11.3 Flume 启动

bin/flume-ng agent -n a1 -c conf -f jobs/xxx
-Dflume.root.logger=INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hadoop102:8649

标签：Flume,channels,sinks,D版,第九篇,a1,sources,k1,c1
来源： https://blog.csdn.net/qq_41200768/article/details/122155427