大数据学习教程SD版第九篇【Flume】
作者:互联网
Flume 日志采集工具,既然是工具,还是以使用为主!
分布式 采集处理和聚合 流式框架
通过编写采集方案,即配置文件,来采集数据的工具,配置方案在官方文档
1. Flume 架构
- Agent JVM进程
- Source :接收数据
- Channel :缓冲区
- Sink:输出数据
- Event 传输单元
2. Flume 安装
Java 和 Hadoop 的环境变量提前配置好,此时解压即用!
3. Flume 官方示例
不同的sink、channel、sink 配置官方文档都有示例
# example.conf : port -> console
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动命令
bin/flume-ng agent -c conf -f jobs/example.conf -n a1 -Dflume.root.logger=INFO,console
传输数据
# yum install -y nc
nc localhost 44444
4. Flume 示例
4.1 File New Context -> HDFS
采集文件新增内容至HDFS,不能断点续传
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/test.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.fileType = DataStream
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动
bin/flume-ng agent -c conf -f jobs/log2hdfs.conf -n a1
4.2 Dir New File -> HDFS
采集目录下新文件到HDFS,不能监控文件内容变化
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = c1
a1.sources.src-1.spoolDir = /data/data1
a1.sources.src-1.fileHeader = true
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
启动
bin/flume-ng agent -c conf -f jobs/file2hdfs.conf -n a1
4.3 Dir New FIle And Context -> HDFS
能够 监控多目录下文件及文件内容变化至HDFS,能够断点续传,log4j下日志会更名,而文件更名则会重新上传
a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/data2/.*file.*
a1.sources.r1.filegroups.f2 = /data/data3/.*log.*
a1.sources.ri.maxBatchCount = 1000
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events2/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
启动
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1
[{“inode”:786450,“pos”:1501,“file”:"/data/data2/file1.txt"} ] 源码是根据inode和file 共同定位到一个文件
如果处理文件更名的问题,修改 TailFile.java 123 和 ReliableTaildirEventReader.java 256 重新打包,替换libs下的tairdirsource的jar包
5. Flume 事务
Source 推送事件到Channel ,Sink从Channel拉取事件,都是先进临时缓冲区
-
Source -> Channel doPut putList 回滚是直接清空Channel队列数据,有可能丢数据,有位置记录则不会
-
Channel -> Sink doTake takeList 回滚是把拉取数据反向写回Channel队列,有可能数据重复
6. Flume Agent 原理
- Source 接收数据
- Source -> Channel Processor 处理事件
- Channel Processor -> Interceptor 事件拦截与过滤
- Channel Processor -> Channel Selector : 默认 replicating ,还有 multiplexing
- Channel Processor -> Channel n : event 写入channel
- Channel -> Sink Processor : 三种 :默认 Default 【一个Sink】、LoadBalancing【负载均衡】、Failover【故障转移】
- Sink Processor -> Sink : 写入Sink
7. Flume 拓扑结构
借助于 Avro 来连接 多个Flume agent
轮询策略:Sink没拉到数据换Sink
- 简单串联:Sink -> Source
- 复制和多路复用: 多Channel -> 多Sink
- 负载均衡和故障转移:Channel -> 多Sink
- 聚合:多Sink -> Source
8. Flume 自定义Interceptor
自定义Interceptor 实现多路复用 :
通过 Header 信息不同进入不同的Channel
采集到包含Error 和Exception 的信息,进入一个Channel,其他进入另一个Channel
各个Channel Sink输出到控制台
- 编码自定义Interceptor
package com.ipinyou.flume.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
public class TypeInterceptor implements Interceptor {
private List<Event> eventList;
@Override
public void initialize() {
eventList = new ArrayList<>();
}
@Override
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
String body = new String(event.getBody());
if (body.contains("Error") || body.contains("Exception")) {
headers.put("type", "error");
} else {
headers.put("type", "normal");
}
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
eventList.clear();
for (Event event : list) {
eventList.add(intercept(event));
}
return eventList;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TypeInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
- 打包上传至Flume的lib目录下
- 编写采集方案
flume-s1-s2.conf
a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 c2
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ipinyou.flume.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.error = c1
a1.sources.r1.selector.mapping.normal = c2
a1.channels = c1 c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 10000
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000
a1.sinks = k1 k2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 7771
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 7772
flume-console1.conf
a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 7771
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
flume-console2.conf
a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 7772
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
启动
# 依次启动在: hadoop103 hadoop104 hadoop102
bin/flume-ng agent -c conf -f jobs/flume-console1.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/flume-console2.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1
9. Flume 自定义Source
- 编码实现
- 自定义类 继承 AbstractSource ,实现 Configurable, PollableSource
- 实现 configure():读取配置文件
- 实现 process():接收外部数据,封装Event,写入Channel
- 打包到lib下
- 编写配置文件
source type : 全类名
- 启动
10. Flume 自定义Sink
- 编码实现
-
自定义类 继承 AbstractSink,实现Configurable
-
实现 configure():读取配置文件
-
实现 process():接收Channel数据,开启事物,写入对应位置
- 后续和上述一致
11. Flume 监控
借助 Ganglia 第三方开源工具
Ganglia:web 展示数据、gmetad 存储数据、gmod 收集数据
11.1 Ganglia 安装
- 安装
# 102 103 104
yum install -y epel-release
# 102
yum install -y ganglia-gmetad
yum install -y ganglia-web
yum install -y ganglia-gmod
# 103 104
yum install -y ganglia-gmod
- 修改配置文件
/etc/httpd/conf.d/ganglia.conf
# 在 Location 下 配置WindowsIP
Require ip 192.168.xxx.xxx
/etc/ganglia/gmetad.conf
data_source "my cluster" hadoop102
/etc/ganglia/gmod.conf : hadoop102 103 104 分发
# 修改下列配置
name = "my cluster"
host = hadoop102
bind = 0.0.0.0
关闭 selinux: /etc/selinux/config ,重启才能生效或临时生效
SELINUX=disabled
# 临时生效
setenforce 0
11.2 Ganlia 启动
# 如果权限不足,则修改权限
chmod -R 777 /var/lib/ganglia
# hadoop102
systemctl start gmond
systemctl start httpd
systemctl start gmetad
# hadoop103 hadoop104
systemctl start gmond
浏览器打开Web UI:
http://hadoop102/ganglia
11.3 Flume 启动
bin/flume-ng agent -n a1 -c conf -f jobs/xxx
-Dflume.root.logger=INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hadoop102:8649
标签:Flume,channels,sinks,D版,第九篇,a1,sources,k1,c1 来源: https://blog.csdn.net/qq_41200768/article/details/122155427