首页 > 其他分享> > Flume 相关学习

Flume 相关学习

2022-06-09 17:04:21 作者：互联网

Flume 定义

　　Flume 是Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到HDFS

基础框架

1 Agent Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。Agent 主要有 3 个部分组成，Source、Channel、Sink。 2 Source Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、taildir、sequence generator、syslog、http、legacy。 3 Sink Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。 4 Channel Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。Flume 自带两种 Channel：Memory Channel 和 File Channel。Memory Channel 是内存中的队列。MemoryChannel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。 5 Event 传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。Event 由 Header 和 Body 两部分组成，Header 用来存放该 event 的一些属性，为K-V 结构，Body 用来存放该条数据，形式为字节数组。

入门

以官网的例子来看：Welcome to Apache Flume — Apache Flume 需要注意配置时，官网的加粗配置是必须有的

1.flume/job，在此目录下创建flume-netcat-logger.conf，添加如下内容

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

　　　　2.启动，分为两种方式

第一种：

[bawei@h1 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/flume-netcat-logger.conf - Dflume.root.logger=INFO,console

第二种：

[bawei@h1flume]$ bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明： --conf/-c：表示配置文件存储在 conf/目录 --name/-n：表示给 agent 起名为 a1 --conf-file/-f：flume 本次启动读取的配置文件是在 job 文件夹下的 flume-telnet.conf文件。 -Dflume.root.logger=INFO,console ：-D 表示 flume 运行时动态修改 flume.root.logger 参数属性值，并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、info、warn、error。

　　　　3.使用 netcat 工具向本机的 44444 端口发送内容

　　4.在Flume 监听页面观察接收数据情况

　　　5.由此我们可以看出Flume的运作规律以及方式，结构也并不复杂

　　入门1.1

1)创建配置文件flume-taildir-hdfs.conf
2)设置source为taildir
3)设置监控数据目录自定义
4)设置channel为内存模式
5)设置sink为hdfs
6)上传文件前缀设置为upload-
7)设置时间滚动文件夹为true
8)设置1小时创建一个新的文件夹
9)启动本地时间戳
10)积攒100个event才flush到hdfs一次
11)设置文件类型，可支持压缩
12)设置60s生成一个新的文件
13)设置每个文件的滚动大小为128M
14)设置文件的滚动和event的 数量无关
15)启动flume
16)往监听目录下插入数据
17)查看hdfs是否有数据

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/log/a.log
a1.sources.r1.headers.f1.headerKey1 = value1

a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = upload-

a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour

a1.sinks.k1.hdfs.useLocalTimeStamp =true
a1.sinks.k1.hdfs.batchSize= 100

a1.sinks.k1.hdfs.codeC = gzip
a1.sinks.k1.hdfs.fileType = CompressedStream

a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize= 131072
a1.sinks.k1.hdfs.rollCount=0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

执行

bin/flume-ng agent --conf conf --conf-file job/flume-taildir-hdfs.conf  --name a1 -Dflume.root.logger=INFO,console

效果

标签：Flume,a1,sinks,hdfs,学习,k1,conf,flume,相关
来源： https://www.cnblogs.com/lenny-z/p/16359933.html