How to know that a new data is been added to HDFS?
作者:互联网
I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n’t find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don’t want to use HCatalog, I want to implement my own tool to do this.?
What you are looking for is Oozie Coordinator.
HDFS is a file system, so something must be built on top of HDFS to check for file availability. HBase has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :
标签:HDFS,added,jobs,system,been,Oozie,new,data 来源: https://blog.csdn.net/hayaqi0504/article/details/100671331