hadoop M/R 实现倒排索引

2023-09-27 14:22:12


以倒排索引,词项-文档列表的形式,针对统计文件中的单词的个数,以下格式打印输出:”单词  文件路径->统计数;文件路径->统计数;........”的格式

Aili hdfs://>1;hdfs://>1;

baidu hdfs://>1;hdfs://>1;



baidu top1

aili top2

tengxun top3

xiaomi top4

ultrapower top5

java top6

python top7


 c top2

java top1

python top5

c++ top4

aili top0

tengxun top1

c++ top5


java top1

baidu top2

c top3

java top0


2.1  调用主代码

2.2  mapper代码

2.3  combiner代码

2.4  reducer代码





[root@naidong sbin]# hadoop fs -mkdir  /inverseindex


[root@naidong jurf_temp_data]# hadoop fs -put a.txt b.txt c.txt /inverseindex

[root@naidong jurf_temp_data]# hadoop jar hadoop-demo-inverseindex.jar  /inverseindex  /inverseindexout

2019-01-16 21:12:40,752 INFO client.RMProxy: Connecting to ResourceManager at /

2019-01-16 21:12:43,883 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2019-01-16 21:12:44,027 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1547639631603_0002

2019-01-16 21:12:48,398 INFO input.FileInputFormat: Total input files to process : 3

2019-01-16 21:12:50,021 INFO mapreduce.JobSubmitter: number of splits:3

2019-01-16 21:12:50,520 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled

2019-01-16 21:12:51,580 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547639631603_0002

2019-01-16 21:12:51,584 INFO mapreduce.JobSubmitter: Executing with tokens: []

