zl程序教程

您现在的位置是:首页 >  工具

当前栏目

【Apache Flume】Apache Flume快速入门及基础组件概述

Apache组件基础入门 快速 概述 flume
2023-09-11 14:22:06 时间

Agent配置模板

# 声明组件信息
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink1>
<Agent>.channels = <Channel1> <Channel2>

# 组件配置
<Agent>.sources.<Source>.<someProperty> = <someValue>
<Agent>.channels.<Channel>.<someProperty> = <someValue>
<Agent>.sinks.<Sink>.<someProperty> = <someValue>

# 链接组件
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
<Agent>.sinks.<Sink>.channel = <Channel1>

模板结构是必须掌握的,掌握该模板的目的是为了便于后期的查阅和配置。

<Agent><Channel><Sink><Source>表示组件的名字,系统有哪些可以使用的组件需要查阅文档.

QuickStart

helloword.properties 单个Agent的配置,将该配置文件放置在flume安装目录下的conf目录下。

# 声明基本组件 Source Channel Sink
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

1、安装一下yum -y install nmap-ncat,这样方便后续的测试。
2、需要安装yum -y install telnet,方便做测试。

②启动a1 采集组件

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/helloword.properties -Dflume.root.logger=INFO,console

附注启动命令参数

Usage: ./bin/flume-ng <command> [options]...

commands:
  help                      display this help text
  agent                     run a Flume agent
  avro-client               run an avro Flume client
  version                   show Flume version info

global options:# 全局属性
  --conf,-c <conf>          use configs in <conf> directory
  --classpath,-C <cp>       append to the classpath
  --dryrun,-d               do not actually start Flume, just print the command
  --plugins-path <dirs>     colon-separated list of plugins.d directories. See the
                            plugins.d section in the user guide for more details.
                            Default: $FLUME_HOME/plugins.d
  -Dproperty=value          sets a Java system property value
  -Xproperty=value          sets a Java -X option

agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)
  --zkConnString,-z <str>   specify the ZooKeeper connection to use (required if -f missing)
  --zkBasePath,-p <path>    specify the base path in ZooKeeper for agent configs
  --no-reload-conf          do not reload config file if changed
  --help,-h                 display help text

avro-client options:
  --rpcProps,-P <file>   RPC client properties file with server connection params
  --host,-H <host>       hostname to which events will be sent
  --port,-p <port>       port of the avro source
  --dirname <dir>        directory to stream to avro source
  --filename,-F <file>   text file to stream to avro source (default: std input)
  --headerFile,-R <file> File containing event headers as key/value pairs on each new line
  --help,-h              display help text

  Either --rpcProps or both --host and --port must be specified.

Note that if <conf> directory is specified, then it is always included first
in the classpath.

③测试a1

[root@CentOS apache-flume-1.9.0-bin]# telnet CentOS 44444
Trying 192.168.52.134...
Connected to CentOS.
Escape character is '^]'.
hello world
2020-02-05 11:44:43,546 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64 0D             hello world. }

基础组件概述

Source-输入源

Avro Source

通常用于远程采集数据(RPC服务),内部启动一个Avro 服务器,用于接收来自Avro Client的请求,并且将接收数据存储到Chanel中。

属性默认值含义
channels需要对接Channel
type表示组件类型,必须给avro
bind绑定IP
port绑定监听端口
#声明组件
a1.sources = s1

# 配置组件
a1.sources.s1.type = avro
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 对接channel
a1.sources.s1.channels = c1
<Agent>.sources = <Source>
# 组件配置
<Agent>.sources.<Source>.<someProperty> = <someValue>
# 声明基本组件 Source Channel Sink  example2.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = avro
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example2.properties -Dflume.root.logger=INFO,console
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng avro-client --host CentOS --port 44444  --filename /root/t_employee

Exec Source

可以将指令在控制台输出采集过来。通常需要将Flume的agent目标采集服务部署在一起。

属性默认值描述
channels需要对接Channel
type必须指定为exec
command要执行的命令
# 声明基本组件 Source Channel Sink  example3.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/t_user

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example3.properties -Dflume.root.logger=INFO,console
[root@CentOS ~]# tail -f t_user

Spooling Directory Source

采集静态目录下,新增文本文件,采集完成后会修改文件后缀,但是不会删除采集的源文件,如果用户只想采集一次,可以修改该source默认行为。通常需要将Flume的agent目标采集服务部署在一起。

属性默认值说明
channels对接的Channel
type必须修改为spooldir
spoolDir给定需要采集的目录
fileSuffix.COMPLETED使用该值修改采集完成文件名
deletePolicynever可选值never/immediate
includePattern^.*$表示匹配所有文件
ignorePattern^$表示不匹配的文件
# 声明基本组件 Source Channel Sink  example4.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /root/spooldir
a1.sources.s1.fileHeader = true
a1.sources.s1.deletePolicy = immediate
# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example4.properties -Dflume.root.logger=INFO,console

Taildir Source

实时监测动态文本行的追加,并且记录采集的文件读取的位置了偏移量,即使下一次再次采集,可以实现增量采集。通常需要将Flume的agent目标采集服务部署在一起。

属性默认值说明
channels对接的通道
type必须指定为TAILDIR
filegroups以空格分隔的文件组列表。
filegroups.文件组的绝对路径。正则表达式(而非文件系统模式)只能用于文件名。
positionFile~/.flume/taildir_position.json记录采集文件的位置信息,实现增量采集
# 声明基本组件 Source Channel Sink  example5.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = g1 g2
a1.sources.s1.filegroups.g1 = /root/taildir/.*\.log$
a1.sources.s1.filegroups.g2 = /root/taildir/.*\.java$
a1.sources.s1.headers.g1.type = log
a1.sources.s1.headers.g2.type = java

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example5.properties -Dflume.root.logger=INFO,console

Kafka Source

参数默认值说明
channels
type必须为org.apache.flume.source.kafka.KafkaSource
kafka.topicsKafka使用者将从中读取消息的主题的逗号分隔列表。
kafka.bootstrap.servers来源使用的Kafka集群中的Broker列表
kafka.topics.regex正则表达式,用于定义订阅源的主题集。此属性的优先级高于kafka.topics,并且覆盖kafka.topics(如果存在)。
batchSize1000批量写入通道的最大消息数
# 声明基本组件 Source Channel Sink example9.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.batchSize = 100 
a1.sources.s1.batchDurationMillis = 2000
a1.sources.s1.kafka.bootstrap.servers = CentOS:9092
a1.sources.s1.kafka.topics = topic01
a1.sources.s1.kafka.consumer.group.id = g1

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example9.properties -Dflume.root.logger=INFO,console

Sink-输出

Logger Sink

通常用于测试/调试目的。

File Roll Sink

可以将采集的数据写入到本地文件

# 声明基本组件 Source Channel Sink example6.properties

a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll
a1.sinks.sk1.sink.rollInterval = 0

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/example6.properties

HDFS Sink

可以将数据写入到HDFS文件系统

# 声明基本组件 Source Channel Sink example7.properties

a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume-hdfs/%y-%m-%d
a1.sinks.sk1.hdfs.rollInterval = 0
a1.sinks.sk1.hdfs.rollSize = 0
a1.sinks.sk1.hdfs.rollCount = 0
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
a1.sinks.sk1.hdfs.fileType = DataStream

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

Kafka Sink

将数据写入Kafka的Topic中

# 声明基本组件 Source Channel Sink example8.properties

a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.sk1.kafka.bootstrap.servers = CentOS:9092
a1.sinks.sk1.kafka.topic = topic01
a1.sinks.sk1.kafka.flumeBatchSize = 20
a1.sinks.sk1.kafka.producer.acks = 1
a1.sinks.sk1.kafka.producer.linger.ms = 1
a1.sinks.sk1.kafka.producer.compression.type = snappy

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

Avro Sink: 将数据写出给Avro Source

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KtsySS4R-1605757146694)(assets/image-20201021114843362.png)]

# 声明基本组件 Source Channel Sink example10.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.batchSize = 100 
a1.sources.s1.batchDurationMillis = 2000
a1.sources.s1.kafka.bootstrap.servers = CentOS:9092
a1.sources.s1.kafka.topics = topic01
a1.sources.s1.kafka.consumer.group.id = g1

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = CentOS
a1.sinks.sk1.port = 44444

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

# 声明基本组件 Source Channel Sink example9.properties
a2.sources = s1
a2.sinks = sk1
a2.channels = c1

# 配置Source组件,从Socket中接收文本数据
a2.sources.s1.type = avro
a2.sources.s1.bind = CentOS 
a2.sources.s1.port = 44444


# 配置Sink组件,将接收数据打印在日志控制台
a2.sinks.sk1.type = file_roll
a2.sinks.sk1.sink.directory = /root/file_roll
a2.sinks.sk1.sink.rollInterval = 0

# 配置Channel通道,主要负责数据缓冲
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# 进行组件间的绑定
a2.sources.s1.channels = c1
a2.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/example10.properties --name a2
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/example10.properties --name a1
[root@CentOS kafka_2.11-2.2.0]# ./bin/kafka-console-producer.sh --broker-list CentOS:9092 --topic topic01

Channel-通道

Memory Channel

将Source数据直接写入内存,不安全,可能会导致数据丢失

参数默认值说明
type只可以写memory
capacity100通道中存储的最大事件数
transactionCapacity100每一次source或者Sink组件写入Channel或者读取Channel的批量大小

transactionCapacity <= capacity

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100

JDBC Channel

参数默认值说明
type组件类型名称,必须为jdbc
db.typeDERBY数据库供应商,必须是DERBY。

事件存储在数据库支持的持久性存储中。 JDBC通道当前支持嵌入式Derby。这是一种持久通道,非常适合可恢复性很重要的流程。-存储非常重要的数据,的时候可以使用jdbc channel

a1.channels.c1.type = jdbc

1、如果用户配置HIVE_HOME环境,需要用户移除hive的lib下的derby或者flume的lib下的derby(仅仅删除一方即可)

2、默认情况下,flume使用的是复制|广播模式的通道选择器。

Kafka Channel

参数默认值说明
type组件类型名称,必须为org.apache.flume.channel.kafka.KafkaChannel
kafka.bootstrap.servers该通道使用的Kafka集群中的Broker列表。
kafka.topicflume-channel该频道将使用的Kafka主题
kafka.consumer.group.idflumeConsumer用于向Kafka注册的消费者组ID

将Source采集的数据写入外围系统的Kafka集群。

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = CentOS:9092
a1.channels.c1.kafka.topic = topic_channel
a1.channels.c1.kafka.consumer.group.id = g1
# 声明基本组件 Source Channel Sink  example10.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1

# 配置Source组件,从Socket中接收文本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444

# 配置Sink组件,将接收数据打印在日志控制台
a1.sinks.sk1.type = logger

# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = CentOS:9092
a1.channels.c1.kafka.topic = topic_channel
a1.channels.c1.kafka.consumer.group.id = g1

# 进行组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

File Channel

参数默认值说明
type组件类型名称,必须是file
checkpointDir~/.flume/file-channel/checkpoint将存储检查点文件的目录
dataDirs~/.flume/file-channel/data用逗号分隔的目录列表,用于存储日志文件

使用文件系统作为通道的实现,能够实现对缓冲数据的持久化。

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /root/flume/checkpoint
a1.channels.c1.dataDirs = /root/flume/data