您现在的位置是：首页 > 大数据

当前栏目

hadoop 使用Avro排序

hadoop 排序使用

2023-09-14 09:02:30 时间

在上例中，使用Avro框架求出数据的最大值，本例使用Avro对数据排序，输入依然是之前的样本，输出使用文本（也可以输出Avro格式）。

1、在Avro的Schema中直接设置排序方向。

dataRecord.avsc，放入resources目录下：

{
    "type":"record",
    "name":"WeatherRecord",
    "doc":"A weather reading",
    "fields":[
        {"name":"year","type":"int"},
        {"name":"temperature","type":"int","order":"descending"}
    ]    
}

原常量类：

public class AvroSchemas {
    private Schema currentSchema;

    //本例中不使用常量，修改成资源中加载
    public static final Schema SCHEMA = new Schema.Parser().parse("{\n" +
            "\t\"type\":\"record\",\n" +
            "\t\"name\":\"WeatherRecord\",\n" +
            "\t\"doc\":\"A weather reading\",\n" +
            "\t\"fields\":[\n" +
            "\t\t{\"name\":\"year\",\"type\":\"int\"},\n" +
            "\t\t{\"name\":\"temperature\",\"type\":\"int\",\"order\":\"descending\"}\n" +
            "\t]\t\n" +
            "}");

    public AvroSchemas() throws IOException {
        Schema.Parser parser = new Schema.Parser();
        //采用从资源文件中读取Avro数据格式
        this.currentSchema = parser.parse(getClass().getResourceAsStream("dataRecord.avsc"));
    }


    public Schema getCurrentSchema() {
        return currentSchema;
    }
}

2、mapper

public class AvroMapper extends Mapper<LongWritable,Text,AvroKey<GenericRecord>,AvroValue<GenericRecord>> {
    private RecordParser parser = new RecordParser();
//    private GenericRecord record = new GenericData.Record(AvroSchemas.SCHEMA);
    private AvroSchemas schema;
    private GenericRecord record;

    public AvroMapper() throws IOException {
        schema =new AvroSchemas();
        record = new GenericData.Record(schema.getCurrentSchema());
    }


    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        parser.parse(value.toString());
        if(parser.isValid()){
            record.put("year",parser.getYear());
            record.put("temperature",parser.getData());
            context.write(new AvroKey<>(record),new AvroValue<>(record));
        }
    }
}

3、reducer

public class AvroReducer extends Reducer<AvroKey<GenericRecord>,AvroValue<GenericRecord>,IntPair,NullWritable> {
    //多文件输出，本例中每年一个文件
    private MultipleOutputs<IntPair,NullWritable> multipleOutputs;

    /**
     * Called once at the start of the task.
     *
     * @param context
     */
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        multipleOutputs = new MultipleOutputs<>(context);
    }


    @Override
    protected void reduce(AvroKey<GenericRecord> key, Iterable<AvroValue<GenericRecord>> values, Context context) throws IOException, InterruptedException {
        //在混洗阶段完成排序，reducer只需直接输出数据
        for (AvroValue<GenericRecord> value : values){
            GenericRecord record = value.datum();
            //多文件输出，每年一个文件。
            multipleOutputs.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get(),record.get("year").toString());
//            context.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get());
        }
    }

}

4、job

public class AvroSort extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        conf.set("mapreduce.job.ubertask.enable","true");

        Job job = Job.getInstance(conf,"Avro sort");
        job.setJarByClass(AvroSort.class);

        //通过AvroJob直接设置Avro key和value的输入和输出，而不是使用Job来设置
        AvroJob.setMapOutputKeySchema(job, AvroSchemas.SCHEMA);
        AvroJob.setMapOutputValueSchema(job,AvroSchemas.SCHEMA);
//        AvroJob.setOutputKeySchema(job,AvroSchemas.SCHEMA);

        job.setMapperClass(AvroMapper.class);
        job.setReducerClass(AvroReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
//        job.setOutputFormatClass(AvroKeyOutputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        Path outPath = new Path(args[1]);
        FileSystem fileSystem = outPath.getFileSystem(conf);
        //删除输出路径
        if(fileSystem.exists(outPath))
        {
            fileSystem.delete(outPath,true);
        }

        return job.waitForCompletion(true) ? 0:1;
    }

    public static void main(String[] args) throws Exception{
        int exitCode = ToolRunner.run(new AvroSort(),args);
        System.exit(exitCode);
    }
}

猜你喜欢

基于深度学习和传统打分函数的配体构象优化框架
java标识符与关键字_4、Java标识符和关键字
【说站】php Zend引擎如何执行代码
我与 SAP 成都研究院吴院长的二三事
在 Linux 中使用 Xcb 进行图形界面开发（xcblinux）
LinuxXWindow应用问答(下)
VLAN的基本配置_划分不全的例子
Qt官方示例-双向按钮
最大子阵
如何在Linux中查看头文件？（linux头文件查看）
Linux控制台：实现无限可能（linux控制台是什么）
【Flutter】Flutter 自定义字体 ( 下载 TTF 字体 | pubspec.yaml 配置字体资源 | 同步资源 | 全局应用字体 | 局部应用字体 )
MySQL Error number: MY-013000; Symbol: ER_IB_MSG_1175; SQLSTATE: HY000 报错故障修复远程处理
MySQL 如何实现不使用事务处理（mysql不使用事务）
面临 10 年监禁：前苹果工程师认罪，承认窃取商业机密跳槽小鹏汽车
Redis设计理念与实践应用（redis 设计与实践）

相关主题

流计算与Hadoop
Hadoop 排序
09-排序1 排序
Hadoop Mapreduce刮
Hadoop 简介
hadoop的块
hadoop的概念
Hadoop的启动
Hadoop 概述
Hadoop实战
Hadoop RPC
Hadoop的配置
hadoop-组件
hadoop安装详解
Hadoop| HDFS

zl程序教程

当前栏目

hadoop 使用Avro排序

相关文章