您现在的位置是：首页 > 其它

当前栏目

34 MAPREDUCE自定义inputFormat

自定义 MapReduce 34

2023-09-11 14:15:40 时间

需求

无论hdfs还是mapreduce，对于小文件都有损效率，实践中，又难免面临处理大量小文件的场景，此时，就需要有相应解决方案。

分析

小文件的优化无非以下几种方式：

在数据采集的时候，就将小文件或小批数据合成大文件再上传HDFS。
在业务处理之前，在HDFS上使用mapreduce程序对小文件进行合并。
在mapreduce处理时，可采用combineInputFormat提高效率。

实现

本节实现的是上述第二种方式

程序的核心机制：

自定义一个InputFormat
改写RecordReader，实现一次读取一个完整文件封装为KV
在输出时使用SequenceFileOutPutFormat输出合并文件

代码如下：

自定义InputFromat

public class WholeFileInputFormat extends
		FileInputFormat<NullWritable, BytesWritable> {
	//设置每个小文件不可分片,保证一个小文件生成一个key-value键值对
	@Override
	protected boolean isSplitable(JobContext context, Path file) {
		return false;
	}

	@Override
	public RecordReader<NullWritable, BytesWritable> createRecordReader(
			InputSplit split, TaskAttemptContext context) throws IOException,
			InterruptedException {
		WholeFileRecordReader reader = new WholeFileRecordReader();
		reader.initialize(split, context);
		return reader;
	}
}

自定义RecordReader

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
	private FileSplit fileSplit;
	private Configuration conf;
	private BytesWritable value = new BytesWritable();
	private boolean processed = false;

	@Override
	public void initialize(InputSplit split, TaskAttemptContext context)
			throws IOException, InterruptedException {
		this.fileSplit = (FileSplit) split;
		this.conf = context.getConfiguration();
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		if (!processed) {
			byte[] contents = new byte[(int) fileSplit.getLength()];
			Path file = fileSplit.getPath();
			FileSystem fs = file.getFileSystem(conf);
			FSDataInputStream in = null;
			try {
				in = fs.open(file);
				IOUtils.readFully(in, contents, 0, contents.length);
				value.set(contents, 0, contents.length);
			} finally {
				IOUtils.closeStream(in);
			}
			processed = true;
			return true;
		}
		return false;
	}

	@Override
	public NullWritable getCurrentKey() throws IOException,
			InterruptedException {
		return NullWritable.get();
	}

	@Override
	public BytesWritable getCurrentValue() throws IOException,
			InterruptedException {
		return value;
	}

	@Override
	public float getProgress() throws IOException {
		return processed ? 1.0f : 0.0f;
	}

	@Override
	public void close() throws IOException {
		// do nothing
	}
}

定义mapreduce处理流程

public class SmallFilesToSequenceFileConverter extends Configured implements
		Tool {
	static class SequenceFileMapper extends
			Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
		private Text filenameKey;

		@Override
		protected void setup(Context context) throws IOException,
				InterruptedException {
			InputSplit split = context.getInputSplit();
			Path path = ((FileSplit) split).getPath();
			filenameKey = new Text(path.toString());
		}

		@Override
		protected void map(NullWritable key, BytesWritable value,
				Context context) throws IOException, InterruptedException {
			context.write(filenameKey, value);
		}
	}

	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		System.setProperty("HADOOP_USER_NAME", "hdfs");
		String[] otherArgs = new GenericOptionsParser(conf, args)
				.getRemainingArgs();
		if (otherArgs.length != 2) {
			System.err.println("Usage: combinefiles <in> <out>");
			System.exit(2);
		}
		
		Job job = Job.getInstance(conf,"combine small files to sequencefile");
//		job.setInputFormatClass(WholeFileInputFormat.class);
		job.setOutputFormatClass(SequenceFileOutputFormat.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(BytesWritable.class);
		job.setMapperClass(SequenceFileMapper.class);
		return job.waitForCompletion(true) ? 0 : 1;
	}

	public static void main(String[] args) throws Exception {
		int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(),
				args);
		System.exit(exitCode);
		
	}
}

猜你喜欢

Opencv项目实战：01 文字检测OCR（1）
linux 查看tomcat状态和日志
(deepin)Ubuntu for 人工智能
华为运营商级路由器配置示例 | SRv6 SFC承载IPv4业务功能（SFF到SF使用二层转发）
[Typescript] 45. Easy - PickValue
js防抖和节流实现
Eolink使用需要掌握的知识路线
Jmeter跨线程组调用token
【Xilinx Vivado时序分析/约束系列1】FPGA开发时序分析/约束-寄存器间时序分析
DFU工作过程中USB机制
C++之stringstream(字符串与数字相互转换)(七十四)

相关主题

自定义异常
7 自定义 Git
js自定义事件
Qt自定义控件
Qt-自定义控件
自定义配置
SpringBoot 自定义 starter
SpringBoot自定义Starter
Dockerfile自定义镜像
自定义方法
自定义MVC
自定义异常类
Hive自定义函数
开发自定义View
自定义View
自定义
自定义信号
控件自定义
209.自定义注解
css3 自定义

zl程序教程

当前栏目

34 MAPREDUCE自定义inputFormat

需求

分析

实现

相关文章