Hadoop : Load Log data into HDFS using Flume

Hi,

In this blog we will cover how to copy log file of any of our choice into hdfs ( we are suing single node cluster)

Hope we all know about flume , but lets have a quick  go through of basics of flume :-

Flume :- 

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

and that is what we are going to do in this blog

Flume consist of 3 main parts :- ( which combined called flume agent) or flume agent consist of following :-

1. Source

2. Channel

3. Sink

An Event is a unit of data that flows through a Flume agent. The Event flows from Source to Channel to Sink, and is represented by an implementation of the Event interface. An Event carries a payload (byte array) that is accompanied by an optional set of headers (string attributes). A Flume agent is a process (JVM) that hosts the components that allow Events to flow from an external source to a external destination.

flume

A Source consumes Events having a specific format, and those Events are delivered to the Source by an external source like a web server. For example, an AvroSource can be used to receive Avro Events from clients or from other Flume agents in the flow. When a Source receives an Event, it stores it into one or more Channels. The Channel is a passive store that holds the Event until that Event is consumed by a Sink. One type of Channel available in Flume is the FileChannel which uses the local filesystem as its backing store. A Sink is responsible for removing an Event from the Channel and putting it into an external repository like HDFS (in the case of an HDFSEventSink) or forwarding it to the Source at the next hop of the flow. The Source and Sink within the given agent run asynchronously with the Events staged in the Channel.

Lets come to example now :-

lets say we have a log file/text file  of employee name in folder /home/user/Desktop/hadoop/

Command :-

ls
access_log    employee    input     sample.txt
apache-flume-1.6.0-bin   metastore_db    table.txt

cat employee

cat employee
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000

Next Step :-

we need to configure our flume agent conf file  like below :-

flume-conf.properties.template ( imp :– here my flume agent name is “agent” u can change it to anything and change your template file , as well your flume agent run command accordingly)

agent.channels = channel1

agent.channels.channel1.type=memory
agent.channels.channel1.capacity = 1000

# Source definition
agent.sources = source1
agent.sources.source1.channels = channel1
agent.sources.source1.type = exec
agent.sources.source1.command = tail -f /home/user/Desktop/hadoop/employee

#sink definition
agent.sinks=sink1
agent.sinks.sink1.channel=channel1
agent.sinks.sink1.type=hdfs
agent.sinks.sink1.hdfs.useLocalTimeStamp = true
agent.sinks.sink1.hdfs.path = /employeedata/%y-%m-%d
agent.sinks.sink1.hdfs.writeFormat=Text
agent.sinks.sink1.hdfs.fileType = DataStream

 

After saving above file , lets run the flume agent

user@ubuntuvm:~/Desktop/hadoop/apache-flume-1.6.0-bin/bin$ ./flume-ng agent -n agent -c conf -f /home/user/Desktop/hadoop/apache-flume-1.6.0-bin/conf/flume-conf.properties.template

which will result into :-

only last lines of result :-

15/07/08 19:11:13 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: source1: Successfully registered new MBean.
15/07/08 19:11:13 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: source1 started
15/07/08 19:11:17 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
15/07/08 19:11:17 INFO hdfs.BucketWriter: Creating /employeedata/15-07-08/FlumeData.1436397077308.tmp
15/07/08 19:11:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

now open another terminal :- 

and go type the command given in agent.sources.source1.command which is 

tail -f /home/user/Desktop/hadoop/employee

tail -f /home/user/Desktop/hadoop/employee
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000

after running this stop the agent using ctrl + c

now in a terminal use  hadoop fs -ls /

to check whether new directory craeted by flume agent or not and whose name will be :-

agent.sinks.sink1.hdfs.path = /employeedata/%y-%m-%d

hadoop fs -ls /
15/07/08 19:18:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 20 items
-rw-r–r– 1 user supergroup 26219 2015-06-26 18:04 /arc
drwxr-xr-x – user supergroup 0 2015-06-26 17:58 /arcout
-rw-r–r– 3 user supergroup 26219 2015-06-25 18:01 /blocks
-rw-r–r– 1 user supergroup 26219 2015-06-25 17:45 /blockstest
-rw-r–r– 1 user supergroup 3791282 2015-06-25 17:48 /blocktest
drwxr-xr-x – user supergroup 0 2015-07-08 18:21 /data
drwxr-xr-x – user supergroup 0 2015-07-08 19:22 /employeedata

(NOTE :-

hadoop fs -ls /emp*/15*
15/07/08 19:23:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 1 items
-rw-r–r– 1 user supergroup 254 2015-07-08 19:22 /employeedata/15-07-08/FlumeData.1436397741102

)

Now run the following command and get the results :- 

hadoop fs -cat /emp*/15*/F*
15/07/08 19:25:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000

Hence this is what we were expecting . you can find more on  http://flume.apache.org/FlumeUserGuide.html

This a a basic practical learning example for flume .

Hope u guys like this .

Cheers .

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s