Flume数据采集至HDFS的排雷日记
作者:互联网
文章目录
写在前面
本篇文章对于想了解Flume采集数据至HDFS的过程中有哪些需要注意的小伙伴有一定的帮助,这里为了模拟真实环境,临时搭建一台虚拟机,将数据存入TOMCAT中后,我们将数据从当前虚拟机传输至另外一台虚拟机的HDFS上。
环境所涉及版本:
- apache-tomcat-8.5.63
- flume-ng-1.6.0-cdh5.14.2
- hadoop-2.6.0-cdh5.14.2
一、Flume-agent配置
话不多说,直接上agent代码,简单的解释下每行的意义:
(如果还不够清楚,见官网手册 )
# 定义source,channel,和sink的名字
a1.channels = c1
a1.sources = s1
a1.sinks = k1
#设置source为Spooling Directory Source(专门对文件提取的一种source)
a1.sources.s1.type = spooldir
a1.sources.s1.channels = c1
#s设置提取文件目录位置
a1.sources.s1.spoolDir = /opt/software/tomcat8563/webapps/mycurd/log
#设置输入字符编码(Flume默认是UTF-8的,这里我的日志字符类型为GBK)
a1.sources.s1.inputCharset = GBK
#设置channel为File Channel
a1.channels.c1.type = file
#设置检查点目录
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
#设置数据目录
a1.channels.c1.dataDirs = /opt/flume/data
#设置sink为HDFS Sink
a1.sinks.k1.type = hdfs
#设置HDFS目录路径(后面加了转义序列)
a1.sinks.k1.hdfs.path = hdfs://192.168.237.130:9000/upload/%Y%m%d
#设置文件的开头
a1.sinks.k1.hdfs.filePrefix = upload-
#设置使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#设置刷写至HDFS的事件数
a1.sinks.k1.hdfs.batchSize = 100
#设置文件流类型
a1.sinks.k1.hdfs.fileType = DataStream
#设置滚动至下一个文件等待的秒数
a1.sinks.k1.hdfs.rollInterval = 600
#设置滚动至下一个文件时当前文件的最大文件大小(单位字节)
a1.sinks.k1.hdfs.rollSize = 134217700
#设置截断文件的事件数(设置为0就不因为event数量截断文件)
a1.sinks.k1.hdfs.rollCount = 0
#设置hdfs存放副本数
a1.sinks.k1.hdfs.minBlockReplicas = 1
#设置通道
a1.sinks.k1.channel = c1
TIPS:channel的checkpointDir和dataDirs目录需要提前在虚拟机上创建好!
二、连续报错排雷
上面配置完后,,博主和大家一样迫不及待的启动agent试了起来:
flume-ng agent --name a1 --conf /opt/software/flume160/conf/ -f /opt/flumeconf/file-hdfs.conf -Dflume.root.logger=DEBUG,console
然后一盆冷水接一盆冷水的浇来,我们来看看有哪些冷水打扰了我们的兴致:
org/apache/hadoop/io/SequenceFile$CompressionType
2021-03-10 23:58:20,087 (conf-file-poller-0) [ERROR - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:146)] Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType
at org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:235)
at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.SequenceFile$CompressionType
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;
jar包名:${HADOOP_HOME}share/hadoop/common/hadoop-common-2.6.0-cdh5.14.2.jar
org/apache/commons/configuration/Configuration
2021-03-11 08:45:13,867 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:139)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:259)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2979)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2971)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2834)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.configuration.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;
jar包名:${HADOOP_HOME}share/hadoop/common/lib/commons-configuration-1.6.jar
org/apache/hadoop/util/PlatformName
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
at org.apache.hadoop.security.UserGroupInformation.getOSLoginModuleName(UserGroupInformation.java:442)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:487)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2979)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2971)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2834)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;
jar包名:${HADOOP_HOME}share/hadoop/common/lib/hadoop-auth-2.6.0-cdh5.14.2.jar
org/apache/htrace/core/Tracer$Builder
2021-03-11 09:07:27,157 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoClassDefFoundError: org/apache/htrace/core/Tracer$Builder
at org.apache.hadoop.fs.FsTracer.get(FsTracer.java:42)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2803)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;
jar包名:${HADOOP_HOME}share/hadoop/common/lib/htrace-core4-4.0.1-incubating.jar
No FileSystem for scheme: hdfs
2021-03-11 09:14:59,911 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:443)] HDFS IO error
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2796)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;
jar包名:${HADOOP_HOME}share/hadoop/hdfs/hadoop-hdfs-2.6.0-cdh5.14.2.jar
java.nio.charset.MalformedInputException
2021-03-10 22:07:14,385 (pool-5-thread-1) [ERROR - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:280)] FATAL: Spool Directory source s1: { spoolDir: /opt/software/tomcat8563/webapps/mycurd/log }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at org.apache.flume.serialization.ResettableFileInputStream.readChar(ResettableFileInputStream.java:283)
at org.apache.flume.serialization.LineDeserializer.readLine(LineDeserializer.java:132)
at org.apache.flume.serialization.LineDeserializer.readEvent(LineDeserializer.java:70)
at org.apache.flume.serialization.LineDeserializer.readEvents(LineDeserializer.java:89)
at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readDeserializerEvents(ReliableSpoolingFileEventReader.java:343)
at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:318)
at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:250)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
解决方法:agent配置source需要添加字符集设置(Flume默认是UTF-8
):
a1.sources.s1.inputCharset = GBK
java.lang.OutOfMemoryError: GC overhead limit exceeded
问题原因:内存溢出
解决方法:首先进入此目录${FLUME_HOME/bin,编辑flume-ng
# set default params
FLUME_CLASSPATH=""
FLUME_JAVA_LIBRARY_PATH=""
JAVA_OPTS="-Xmx1024m" #调整JVM堆的设置
LD_LIBRARY_PATH=""
PS:作者是一枚刚入编程的小白,如果有写错或者写的不好的地方,欢迎各位大佬在评论区留下宝贵的意见或者建议,敬上!如果这篇博客对您有帮助,希望您可以顺手帮我点个赞!不胜感谢!
原创作者:wsjslient |
标签:Flume,HDFS,排雷,java,flume,hadoop,FileSystem,apache,org 来源: https://blog.csdn.net/wsjslient/article/details/114648813