MaxCompute平台非标准日期和气象数据处理方法--以电力AI赛为例
摘要:MaxCompute平台支持的日期格式通常是对齐的日期格式诸如20170725或2017/07/25这种,而本次电力AI赛提供的日期格式却是未对齐的非标准(相对MaxCompute平台来说)的日期格式2016/1/1这种,使得无法直接使用ODPS SQL中的日期函数来进行处理。同时,电力AI赛提供的气象数据并不是已经数值化的数据,也使得很多团队未能将气象数据利用起来(现已公开解决方案的团队,基本上天气数据中的气象、风速和风向信息等都未使用),而气象数据通常来说对短期负荷预测具有较大的影响。本文将详细介绍利用MaxCompute的ODPS SQL处理电力AI赛的非标准日期数据的方法和利用OPEN_MR来处理天气数据的详细方法,并给出在MaxCompute平台上使用ODPS SQL、OPEN_MR和PAI命令来完成从数据预处理、特征提取到预测结果的全过程,供大家参考,同时欢迎各位批评指正。
阿里云的MaxCompute平台具有非常强大的功能和开放式的接口,使得可以非常方便的处理各类数据并快速高效的完成数据分析和预测。本文介绍的内容,除了气象部分的数据是之前利用零散时间处理的之外(大概花了不到1天的时间),其他代码都是在电力AI赛的复赛换数据后开始的2天内临时赶出来的,组件MaxCompute的强大。实际上,除了因为比赛平台的OPEN_MR部分目前无法集成到ODPS SQL,所以运行时需要中断一次,其他的代码都可以只需点击一次“运行“按钮就可以批量运行完成,直接完成从原始数据到提交结果的全过程。需要注意的是,本文使用的平台是天池比赛平台,这是阿里云MaxCompute平台为了确保比赛数据安全而做了裁剪(限制)的比赛专用平台,阿里云对外开放的MaxCompute平台限制更少,功能更为强大。
一、 赛题说明
本次竞赛主要数据源为企业用电量表Tianchi_power,抽取了扬中市高新区的1000多家企业的用电量(数据进行了脱敏),包括企业ID(匿名化处理),日期和用电量。具体字段如下表:
tianchi_power
二、赛题解读
这是一个短期负荷预测(short-term load forecasting)问题,国家电网于2010年曾出台过 国家电网企业标准 Q/GDW 552-2010 《电网短期超短期负荷预测技术规范》,在规范中对相关的术语、预测内容、误差计算公式、常用的预测算法等都做了介绍。在本次比赛中,由于负荷预测的用途不一样,因此并未完全遵守国家电网的企业标准中规定的预测内容(时间粒度和待预测时长),并且预测误差评价公式也采用了自定义的公式,但问题的本质并未改变,仍然是一个短期负荷预测问题。
我们在前期做光伏电站超短期发电功率预测时,发现缺失值和数值天气预报数据对预测精度的影响最大,并且国网的企业标准中对负荷预测的影响因素也有个大致的介绍:
由于社会事件等不可知,因此本次比赛中我们侧重解决缺失值和气象数据的问题,将主要工作集中在三个地方:
1)对官方给定的气象数据进行编码、变换等,构建完善的气象数据特征;
2)构建过拟合的模型来填充缺失值;
3)用修订数据构建模型一来预测趋势,原始数据构建模型二来预测用电量水平(大致值),再对两个模型进行加权融合;
三、 数据预处理
3.1 非标准日期的处理方法
利用ODPS SQL提供的字符串正则处理函数regexp_extract,分别提取年、月、日的数据,然后转换成标准日期格式,代码如下:
-- 产生每日用电量总和 DROP TABLE IF EXISTS t_netivs_daily_sum_consumption; CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_consumption AS SELECT ,(year*10000+month*100+day) as day_int -- 转化成 20160101 这种格式 ,(month*100+day) as month_day ,(year*100+month) as year_month ,((year-2015)*12+month) as month_index SELECT ,cast(regexp_extract(record_date,(.*)/(.*)/(.*),1) as bigint) as year -- 提取年 ,cast(regexp_extract(record_date,(.*)/(.*)/(.*),2) as bigint) as month -- 提取月 ,cast(regexp_extract(record_date,(.*)/(.*)/(.*),3) as bigint) as day -- 提取日 FROM SELECT record_date ,sum(power_consumption) as power_consumption FROM odps_tc_257100_f673506e024.tianchi_power2 GROUP BY record_date ;
利用这个代码,可以方便的将2016/1/1这种非标准的日期数据转化为bigint类型的20160101这类数据,后续可以非常方便的用 to_data(cast(xxx as string),yyyymmdd) 函数来将这类数据转化成日期类型,在利用ODPS SQL内置的函数来提取日期特征。
3.2 节假日的实现
由于比赛过程中原则上是不允许上传和下载数据的,因此正规的做法是通过ODPS SQL中的case when来实现节假日的处理。这里给出节假日及日期特征的处理代码:
-- 产生扩展日期 DROP TABLE IF EXISTS t_netivs_date_features; CREATE TABLE IF NOT EXISTS t_netivs_date_features AS SELECT day_int ,day_index ,month_index ,year_index ,month ,day ,(month*100+day) as month_day ,(year*100+month) as year_month ,case when (weekday in (6,7) and special_workday == 0) or holiday==1 then 0 else 1 end as workday ,weekofyear ,day_to_lastday ,month_day_num ,weekday ,holiday ,special_workday ,special_holiday ,day1_before_special_holiday ,day2_before_special_holiday ,day3_before_special_holiday ,day1_before_holiday ,day2_before_holiday ,day3_before_holiday ,day1_after_special_holiday ,day2_after_special_holiday ,day3_after_special_holiday ,day1_after_holiday ,day2_after_holiday ,day3_after_holiday SELECT day_int ,datediff(dt,to_date(2015-01-01,yyyy-mm-dd),dd)+1 as day_index ,datediff(dt,to_date(2015-01-01,yyyy-mm-dd),mm)+1 as month_index ,datepart(dt,yyyy)-2015+1 as year_index ,datepart(dt,yyyy) as year ,datepart(dt,mm) as month ,datepart(dt,dd) as day ,datepart(lastday(dt),dd) as month_day_num ,weekofyear(dt) as weekofyear ,datediff(lastday(dt),dt,dd) as day_to_lastday ,weekday(dt) as weekday ,holiday ,special_workday ,special_holiday ,case when cast(to_char(dateadd(dt,-1,dd),yyyymmdd) as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_special_holiday ,case when cast(to_char(dateadd(dt,-2,dd),yyyymmdd) as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_special_holiday ,case when cast(to_char(dateadd(dt,-3,dd),yyyymmdd) as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_special_holiday ,case when cast(to_char(dateadd(dt,-1,dd),yyyymmdd) as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_holiday ,case when cast(to_char(dateadd(dt,-2,dd),yyyymmdd) as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_holiday ,case when cast(to_char(dateadd(dt,-3,dd),yyyymmdd) as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_holiday ,case when cast(to_char(dateadd(dt,1,dd),yyyymmdd) as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_after_special_holiday ,case when cast(to_char(dateadd(dt,2,dd),yyyymmdd) as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_after_special_holiday ,case when cast(to_char(dateadd(dt,3,dd),yyyymmdd) as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_after_special_holiday ,case when cast(to_char(dateadd(dt,1,dd),yyyymmdd) as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day1_after_holiday ,case when cast(to_char(dateadd(dt,2,dd),yyyymmdd) as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day2_after_holiday ,case when cast(to_char(dateadd(dt,3,dd),yyyymmdd) as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day3_after_holiday FROM SELECT day_int ,to_date(to_char(day_int),yyyymmdd) as dt ,case when day_int in (20150101,20150102,20150103,20150218,20150219,20150220,20150221,20150222,20150223,20150224,20150404,20150405,20150406,20150501,20150502,20150503,20150620,20150621,20150622,20150903,20150904,20150905,20150927,20151001,20151002,20151003,20151004,20151005,20151006,20151007,20160101,20160207,20160208,20160209,20160210,20160211,20160212,20160213,20160404,20160501,20160502,20160609,20160610,20160611,20160915,20160916,20160917,20161001,20161002,20161003,20161004,20161005,20161006,20161007) then 1 else 0 end as holiday ,case when day_int in (20150104,20150215,20150228,20150906,20151010,20160206,20160214,20160612,20160918,20161008,20161009) then 1 else 0 end as special_workday ,case when day_int in (20150101,20150218,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as special_holiday FROM t_netivs_tianchi_weather_data ;
这段代码中基本包括了利用MaxCompute平台的ODPS SQL代码来整理日期数据格式和提取日期特征的全部常用操作,借助MaxCompute来进行时间序列特征分析和预测的朋友都可以考虑借鉴和完善这段代码来提取自己的日期特征。在这里,由于 通过对比赛数据的分析,可以很容易的发现节假日对每日用电总量的影响非常大,而且节假日对每日用电总量的影响有一定的延续性,比如某些节日快到的时候,用电量会突然增加或者下降,有些节日结束后,会有连续几天的用电量增加或者下降,因此这里对节假日做了比较细致的处理,增加了节前1/2/3天和节后1/2/3天的特征。
3.3 气象数据处理
从上面的气象数据可以看出来,其中的气象、风速、风向等都是字符串数据,需要转换成数值型的数据才能用于机器学习模型。由于这里用的字符串可能的类型有限,其中一种方法是将字符串排序,用序号代表该字符串的编码,直接用于机器学习模型的输入特征。这种方式的好处是处理简单,借助ODPS SQL内置的row_number函数可以很方便的进行实现。但是这种时间的缺点也很明显:没有充分的利用不同气象类型之间的关联关系,比如大雨跟大到暴雨的关系。因此,我们这里采用了OPEN_MR来对气象数据进行了详细的处理,主要的处理思路为:
1)将数据表中所有的数据类型都找出来,观察其构成情况及类别;
2)考虑到部分气象只有一种类型,比如“大雨、中雨、小雨”,而有的气象是两种气象类型,如“大到暴雨、多云转阴”等,因此,将所有气象进行统一:只有一种类型的,就用两个一样的类型来表示;
3)对于每个类型的气象,设计 气象类型(晴、雪、雨等)、气象等级(小雨、中雨、大雨、暴雨等分别从1开始编号)、气象组合(气象类型+气象等级);
按这种思路处理后的气象数据的格式可以用如下的ODPS SQL语句来创建,并且用于OPEN_MR的输出表:
-- map reduce来处理气象数据的输出表 -- 线上给的12月份的气象数据已经一起完成了,所以不需要再更改 -- DROP TABLE IF EXISTS t_netivs_encode_weather; CREATE TABLE IF NOT EXISTS t_netivs_encode_weather ( day_int bigint ,temperature_high bigint ,temperature_low bigint ,weather1 bigint ,weather1_level bigint ,weather1_type bigint ,weather2 bigint ,weather2_level bigint ,weather2_type bigint ,wind_direction bigint ,wind_speed double ,wind_speed1 double ,wind_speed2 double ;
为了实现对气象数据的解析,编写了一个OPEN_MR来进行处理,其核心代码如下:
从mapper总获得原始数据,然后进行处理,再将结果输出到reducer中去,其主流程代码如下:
// 气象数据处理主流程 public void weather_encode(long day_int, long temperature_high, long temperature_low, String weather, String wind_direction, String wind_speed, Record vals){ m_output_vals = vals; m_day_int = day_int; m_temp_high = temperature_high; m_temp_low = temperature_low; reset(); weather_parser(weather); wind_direction_parser(wind_direction); wind_speed_parse(wind_speed); // 输出特征 output();其中,气象数据转化为编码的代码如下:
// -------------- 对气象进行重新编码 ---------------------------------------------// private void weather_parser(String weather){ String weather1,weather2; // 如果最后一个字母是 ~ ,应该是不数据不完整,直接去掉 ~ if(weather.endsWith("~")){ weather = weather.substring(0, weather.length()-2); weather = weather.replace("转", "~"); // 解析a1的数据 if(weather.contains("~")){ weather1 = weather.split("~")[0]; weather2 = weather.split("~")[1]; else { weather1= weather; weather2 = weather; // 开始解析weather1和weather2 // 小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、阵雨、雷雨、雷阵雨、小雪、中雪、大雪、雨夹雪、晴、阴、多云 m_weather1 = get_weather_index(weather1); // 对气象进行重新编码 m_weather1_level = get_weather_level(weather1); m_weather1_type = get_weather_type(weather1); m_weather2 = get_weather_index(weather2); m_weather2_level = get_weather_level(weather2); m_weather2_type = get_weather_type(weather2); }
3.4 过拟合模型实现缺失数据填充
通过前面两个部分的代码,可以快速的完成电力负荷数据的格式转化、日期和气象特征提取等。通过分析2016年11月的每日总用电量可以发现,1416这个大客户存在2天用电缺失的情况,从而导致那两天的用电量异常偏低。由此可以想到:
1)对用户进行分类,按不同的类别分别处理;
2)对这类大客户的缺失用电量进行填充,抵消偶然事件对用电趋势的影响,从而构建模型来预测每日用电量的趋势,再配合用真实用电量(未填充)模型的预测结果来获得最终预测结果;
由于这里构建的模型是用于填充缺失数据,有别于用来预测未来数据的模型,这应该有意的利用同一用户缺失值附近两侧的用电信息以及不同用户在同一时期的用电量等信息,构建“穿越”待预测日的过拟合模型,更好的填充缺失值。这里用于缺失值填充的过拟合模型的特征提取及预测的全过程代码如下所示:
-- 经过详细分析,拟定采用的缺失数据填充规则: -- 1. 11月份缺失值为30,所有历史用电量改成1; -- 2. 除了11月份缺失值为30天的,其他non_default_power_consumption_median 2500的都不处理; -- 3. 总缺失天数大于30的不处理; DROP TABLE IF EXISTS t_netivs_user_missing_info; CREATE TABLE IF NOT EXISTS t_netivs_user_missing_info AS select case when t11.user_id is not null then t11.user_id else t2.user_id end as user_id ,case when t11.missing_day_cnt is null then 0 else t11.missing_day_cnt end as missing_day_cnt ,case when t11.first_default_day_int is null then 0 else t11.first_default_day_int end as first_default_day_int ,case when t11.last_default_day_int is null then 0 else t11.last_default_day_int end as last_default_day_int ,case when t11.last1month_default_day_cnt is null then 0 else t11.last1month_default_day_cnt end as last1month_default_day_cnt ,case when t11.last2month_default_day_cnt is null then 0 else t11.last2month_default_day_cnt end as last2month_default_day_cnt ,case when t11.last3month_default_day_cnt is null then 0 else t11.last3month_default_day_cnt end as last3month_default_day_cnt ,case when t2.power_consumption_avg is null then 0 else t2.power_consumption_avg end as power_consumption_avg ,case when t2.power_consumption_median is null then 0 else t2.power_consumption_median end as power_consumption_median ,case when t2.power_consumption_max is null then 0 else t2.power_consumption_max end as power_consumption_max ,case when t2.power_consumption_min is null then 0 else t2.power_consumption_min end as power_consumption_min ,case when t2.first_non_default_day_int is null then 0 else t2.first_non_default_day_int end as first_non_default_day_int ,case when t2.last_non_default_day_int is null then 0 else t2.last_non_default_day_int end as last_non_default_day_int select from select user_id ,count(*) as missing_day_cnt ,min(day_int) as first_default_day_int ,max(day_int) as last_default_day_int ,SUM(case when day_int =20161101 and day_int 20161201 then 1 else 0 end) as last1month_default_day_cnt ,SUM(case when day_int =20161001 and day_int 20161101 then 1 else 0 end) as last2month_default_day_cnt ,SUM(case when day_int =20160901 and day_int 20161001 then 1 else 0 end) as last3month_default_day_cnt from t_netivs_ext_power where power_consumption=1 group by user_id where missing_day_cnt 1 FULL OUTER JOIN select user_id ,avg(power_consumption) as power_consumption_avg ,median(power_consumption) as power_consumption_median ,max(power_consumption) as power_consumption_max ,min(power_consumption) as power_consumption_min ,min(day_int) as first_non_default_day_int ,max(day_int) as last_non_default_day_int from t_netivs_ext_power where power_consumption 1 group by user_id ON t11.user_id = t2.user_id -- 产生要用xgboost来填充的user_id的列表 DROP TABLE IF EXISTS t_netivs_xgb_fill_user_day_list; DROP TABLE IF EXISTS t_netivs_gbdt_fill_user_day_list; CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_user_day_list AS SELECT user_id ,day_int FROM t_netivs_ext_power WHERE power_consumption =1 and user_id in SELECT user_id FROM t_netivs_user_missing_info WHERE power_consumption_median 2500 and missing_day_cnt 30 and missing_day_cnt 0
-- 产生要用来训练xgboost模型的user_id列表 DROP TABLE IF EXISTS t_netivs_xgb_fill_train_user_list; DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_user_list; CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_user_list AS SELECT user_id ,day_int FROM t_netivs_ext_power WHERE power_consumption 1 and user_id in SELECT user_id FROM t_netivs_user_missing_info WHERE power_consumption_median 2500 and missing_day_cnt 30
-- 产生要把历史数据全部清0的user_id的列表 DROP TABLE IF EXISTS t_netivs_clear_historical_data_user_list; CREATE TABLE IF NOT EXISTS t_netivs_clear_historical_data_user_list AS SELECT user_id FROM t_netivs_user_missing_info WHERE last1month_default_day_cnt=30
DROP TABLE IF EXISTS t_netivs_gbdt_fill_consumption_features; CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_consumption_features AS SELECT t1.user_id ,t1.day_int ,case when t2.weekly_power_consumption_avg is null then 0 else t2.weekly_power_consumption_avg end as weekly_power_consumption_avg ,case when t2.weekly_power_consumption_median is null then 0 else t2.weekly_power_consumption_median end as weekly_power_consumption_median ,case when t3.monthly_power_consumption_avg is null then 0 else t3.monthly_power_consumption_avg end as monthly_power_consumption_avg ,case when t3.monthly_power_consumption_median is null then 0 else t3.monthly_power_consumption_median end as monthly_power_consumption_median ,case when t4.last_weekly_power_consumption_avg is null then 0 else t4.last_weekly_power_consumption_avg end as last_weekly_power_consumption_avg ,case when t4.last_weekly_power_consumption_median is null then 0 else t4.last_weekly_power_consumption_median end as last_weekly_power_consumption_median ,case when t5.last_monthly_power_consumption_avg is null then 0 else t5.last_monthly_power_consumption_avg end as last_monthly_power_consumption_avg ,case when t5.last_monthly_power_consumption_median is null then 0 else t5.last_monthly_power_consumption_median end as last_monthly_power_consumption_median FROM t_netivs_ext_power t1 LEFT OUTER JOIN SELECT user_id ,weekofyear ,avg(power_consumption) as weekly_power_consumption_avg ,median(power_consumption) as weekly_power_consumption_median FROM t_netivs_ext_power WHERE power_consumption 1 GROUP BY user_id,weekofyear ON t1.user_id = t2.user_id and t1.weekofyear = t2.weekofyear LEFT OUTER JOIN SELECT user_id ,year_month ,avg(power_consumption) as monthly_power_consumption_avg ,median(power_consumption) as monthly_power_consumption_median FROM t_netivs_ext_power WHERE power_consumption 1 GROUP BY user_id,year_month ON t1.user_id = t3.user_id and t1.year_month = t3.year_month LEFT OUTER JOIN SELECT user_id ,weekofyear ,avg(power_consumption) as last_weekly_power_consumption_avg ,median(power_consumption) as last_weekly_power_consumption_median FROM t_netivs_ext_power WHERE power_consumption 1 GROUP BY user_id,weekofyear ON t1.user_id = t4.user_id and t1.weekofyear = t4.weekofyear+1 LEFT OUTER JOIN SELECT user_id ,year_month ,avg(power_consumption) as last_monthly_power_consumption_avg ,median(power_consumption) as last_monthly_power_consumption_median FROM t_netivs_ext_power WHERE power_consumption 1 GROUP BY user_id,year_month ON t1.user_id = t5.user_id and t1.year_month = t5.year_month+1
DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_features; CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_features AS SELECT t1.user_id ,t1.day_int ,t2.temperature_high ,t2.temperature_low ,t2.weather1 ,t2.weather1_level ,t2.weather1_type ,t2.weather2 ,t2.weather2_level ,t2.weather2_type ,t2.wind_direction ,t2.wind_speed ,t2.wind_speed1 ,t2.wind_speed2 ,t3.day_index ,t3.month_index ,t3.year_index ,t3.month ,t3.day ,t3.workday ,t3.weekday ,t3.holiday ,t3.special_workday ,t3.special_holiday ,t3.day1_before_special_holiday ,t3.day2_before_special_holiday ,t3.day3_before_special_holiday ,t3.day1_before_holiday ,t3.day2_before_holiday ,t3.day3_before_holiday ,t3.day1_after_special_holiday ,t3.day2_after_special_holiday ,t3.day3_after_special_holiday ,t3.day1_after_holiday ,t3.day2_after_holiday ,t3.day3_after_holiday ,t4.weekly_power_consumption_avg ,t4.weekly_power_consumption_median ,t4.monthly_power_consumption_avg ,t4.monthly_power_consumption_median ,t4.last_weekly_power_consumption_avg ,t4.last_weekly_power_consumption_median ,t4.last_monthly_power_consumption_avg ,t4.last_monthly_power_consumption_median ,t5.power_consumption FROM t_netivs_gbdt_fill_train_user_list t1 LEFT OUTER JOIN t_netivs_encode_weather t2 ON t1.day_int = t2.day_int LEFT OUTER JOIN t_netivs_date_features t3 ON t1.day_int = t3.day_int LEFT OUTER JOIN t_netivs_gbdt_fill_consumption_features t4 ON t1.user_id = t4.user_id and t1.day_int = t4.day_int LEFT OUTER JOIN t_netivs_ext_power t5 ON t1.user_id = t5.user_id and t1.day_int = t5.day_int -- 产生gbdt填充的测试集 DROP TABLE IF EXISTS t_netivs_gbdt_fill_test_features; CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_test_features AS SELECT t1.user_id ,t1.day_int ,t2.temperature_high ,t2.temperature_low ,t2.weather1 ,t2.weather1_level ,t2.weather1_type ,t2.weather2 ,t2.weather2_level ,t2.weather2_type ,t2.wind_direction ,t2.wind_speed ,t2.wind_speed1 ,t2.wind_speed2 ,t3.day_index ,t3.month_index ,t3.year_index ,t3.month ,t3.day ,t3.workday ,t3.weekday ,t3.holiday ,t3.special_workday ,t3.special_holiday ,t3.day1_before_special_holiday ,t3.day2_before_special_holiday ,t3.day3_before_special_holiday ,t3.day1_before_holiday ,t3.day2_before_holiday ,t3.day3_before_holiday ,t3.day1_after_special_holiday ,t3.day2_after_special_holiday ,t3.day3_after_special_holiday ,t3.day1_after_holiday ,t3.day2_after_holiday ,t3.day3_after_holiday ,t4.weekly_power_consumption_avg ,t4.weekly_power_consumption_median ,t4.monthly_power_consumption_avg ,t4.monthly_power_consumption_median ,t4.last_weekly_power_consumption_avg ,t4.last_weekly_power_consumption_median ,t4.last_monthly_power_consumption_avg ,t4.last_monthly_power_consumption_median FROM t_netivs_gbdt_fill_user_day_list t1 LEFT OUTER JOIN t_netivs_encode_weather t2 ON t1.day_int = t2.day_int LEFT OUTER JOIN t_netivs_date_features t3 ON t1.day_int = t3.day_int LEFT OUTER JOIN t_netivs_gbdt_fill_consumption_features t4 ON t1.user_id = t4.user_id and t1.day_int = t4.day_int -- 用xgb来产生填充值 DROP TABLE IF EXISTS t_netivs_xgb_fill_prediction_result; DROP OFFLINEMODEL IF EXISTS m_xgb_fill_model; -- train -name xgboost -project algo_public -Deta="0.01" ---Dobjective="reg:linear" -Dobjective="reg:linear" -DitemDelimiter="," -Dseed="0" -Dnum_round="3500" -DlabelColName="power_consumption" -DinputTableName="t_netivs_gbdt_fill_train_features" -DenableSparse="false" -Dmax_depth="8" -Dsubsample="0.4" -Dcolsample_bytree="0.6" -DmodelName="m_xgb_fill_model" -Dgamma="0" -Dlambda="50" -DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median" -Dbase_score="0.11" -Dmin_child_weight="100" -DkvDelimiter=":";
-DoutputTableName="t_netivs_xgb_fill_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median" -DinputTableName="t_netivs_gbdt_fill_test_features" -DenableSparse="false"; SELECT * FROM t_netivs_xgb_fill_prediction_result ORDER BY prediction_result desc limit 100; -- 产生修订后的每日用电量详单 -- t_netivs_clear_historical_data_user_list内的user_id全部清零 DROP TABLE IF EXISTS t_netivs_fixed_ext_power; CREATE TABLE IF NOT EXISTS t_netivs_fixed_ext_power AS SELECT t1.user_id ,t1.day_int --,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) when t3.user_id is not null then 1 else power_consumption end as bigint) as power_consumption ,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) else power_consumption end as bigint) as power_consumption FROM t_netivs_ext_power t1 LEFT OUTER JOIN t_netivs_gbdt_fill_prediction_result t2 ON t1.user_id = t2.user_id and t1.day_int = t2.day_int LEFT OUTER JOIN t_netivs_clear_historical_data_user_list t3 ON t1.user_id = t3.user_id -- 产生修订后的每日用电量总和 DROP TABLE IF EXISTS t_netivs_fixed_daily_sum_consumption; CREATE TABLE IF NOT EXISTS t_netivs_fixed_daily_sum_consumption AS SELECT t1.day_int ,t1.power_consumption ,t2.fixed_power_consumption ,t3.day_index ,t3.month_index ,t3.year_index ,t3.month ,t3.day ,t3.month_day ,t3.year_month ,t3.workday ,t3.weekofyear ,t3.day_to_lastday ,t3.weekday ,t3.holiday ,t3.special_workday ,t3.special_holiday ,t3.day1_before_special_holiday ,t3.day2_before_special_holiday ,t3.day3_before_special_holiday ,t3.day1_before_holiday ,t3.day2_before_holiday ,t3.day3_before_holiday ,t3.day1_after_special_holiday ,t3.day2_after_special_holiday ,t3.day3_after_special_holiday ,t3.day1_after_holiday ,t3.day2_after_holiday ,t3.day3_after_holiday t_netivs_daily_sum_consumption t1 LEFT OUTER JOIN SELECT day_int ,SUM(power_consumption) as fixed_power_consumption FROM t_netivs_fixed_ext_power GROUP BY day_int ON t1.day_int = t2.day_int LEFT OUTER JOIN t_netivs_date_features t3 ON t1.day_int = t3.day_int
四、模型构建与融合
在做这个赛题的时候,确定解题思路是用两个模型来分别预测趋势和用电量水平,然后再进行融合,其思路如下图所示:
其中模型一的特征提取及模型构建的实现代码如下:
-- 提取每日用电总量的特征 DROP TABLE IF EXISTS t_netivs_daily_sum_features; CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_features AS SELECT t1.day_int ,t1.last_month_same_day_consumption --,t1.last_year_same_day_consumption ,t2.last_month_power_consumption_avg ,t2.last_month_power_consumption_median ,t2.last_month_power_consumption_stddev ,t2.last_month_weekday1_avg ,t2.last_month_weekday1_median ,t2.last_month_weekday0_avg ,t2.last_month_weekday0_median ,t2.last_month_workday1_avg ,t2.last_month_workday1_median ,t2.last_month_workday0_avg ,t2.last_month_workday0_median ,t2.last_month_last3day_avg ,t2.last_month_last3day_median ,t2.last_month_last7day_avg ,t2.last_month_last7day_median ,t2.last_month_first3day_avg ,t2.last_month_first3day_median ,t2.last_month_first7day_avg ,t2.last_month_first7day_median ,t2.last_month_middle_avg ,t2.last_month_middle_median FROM SELECT t11.day_int ,t21.power_consumption as last_month_same_day_consumption --,t31.power_consumption as last_year_same_day_consumption FROM SELECT day_int ,day ,day_to_lastday ,case when day =15 then cast(to_char(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,mm),yyyymmdd) as bigint) else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,mm)),-day_to_lastday,dd),yyyymmdd) as bigint) end as last_month_same_day --,cast(to_char(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,yyyy),yyyymmdd) as bigint) as last_year_same_day FROM t_netivs_date_features WHERE day_int =20150201 )t11 LEFT OUTER JOIN t_netivs_fixed_daily_sum_consumption t21 ON t11.last_month_same_day = t21.day_int --LEFT OUTER JOIN -- t_netivs_fixed_daily_sum_consumption t31 --ON t11.last_year_same_day = t31.day_int LEFT OUTER JOIN SELECT t1.day_int ,t2.last_month_power_consumption_avg ,t2.last_month_power_consumption_median ,t2.last_month_power_consumption_stddev ,t2.last_month_weekday1_avg ,t2.last_month_weekday1_median ,t2.last_month_weekday0_avg ,t2.last_month_weekday0_median ,t2.last_month_workday1_avg ,t2.last_month_workday1_median ,t2.last_month_workday0_avg ,t2.last_month_workday0_median ,t2.last_month_last3day_avg ,t2.last_month_last3day_median ,t2.last_month_last7day_avg ,t2.last_month_last7day_median ,t2.last_month_first3day_avg ,t2.last_month_first3day_median ,t2.last_month_first7day_avg ,t2.last_month_first7day_median ,t2.last_month_middle_avg ,t2.last_month_middle_median FROM SELECT * FROM t_netivs_date_features WHERE month_index 1 LEFT OUTER JOIN SELECT month_index ,avg(fixed_power_consumption) as last_month_power_consumption_avg ,median(fixed_power_consumption) as last_month_power_consumption_median ,stddev(fixed_power_consumption) as last_month_power_consumption_stddev ,avg(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_avg ,median(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_median ,avg(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_avg ,median(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_median ,avg(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_avg ,median(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_median ,avg(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_avg ,median(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_median ,avg(case when day_to_lastday =3 then fixed_power_consumption else null end) as last_month_last3day_avg ,median(case when day_to_lastday =3 then fixed_power_consumption else null end) as last_month_last3day_median ,avg(case when day_to_lastday =7 then fixed_power_consumption else null end) as last_month_last7day_avg ,median(case when day_to_lastday =7 then fixed_power_consumption else null end) as last_month_last7day_median ,avg(case when day =3 then fixed_power_consumption else null end) as last_month_first3day_avg ,median(case when day =3 then fixed_power_consumption else null end) as last_month_first3day_median ,avg(case when day =7 then fixed_power_consumption else null end) as last_month_first7day_avg ,median(case when day =7 then fixed_power_consumption else null end) as last_month_first7day_median ,avg(case when day =14 and day_to_lastday =14 then fixed_power_consumption else null end) as last_month_middle_avg ,median(case when day =14 and day_to_lastday =14 then fixed_power_consumption else null end) as last_month_middle_median FROM SELECT t1.day_int ,t1.power_consumption ,t1.fixed_power_consumption ,t2.day_index ,t2.month_index ,t2.workday ,t2.day_to_lastday ,t2.day ,t2.weekday ,t2.holiday FROM t_netivs_fixed_daily_sum_consumption t1 LEFT OUTER JOIN t_netivs_date_features t2 ON t1.day_int = t2.day_int )t2_1 GROUP BY month_index ON t1.month_index = t2.month_index+1 ON t1.day_int = t2.day_int -- 合并特征 DROP TABLE IF EXISTS t_netivs_all_online_features; CREATE TABLE IF NOT EXISTS t_netivs_all_online_features AS SELECT t1.* ,t2.temperature_high ,t2.temperature_low ,t2.weather1 ,t2.weather1_level ,t2.weather1_type ,t2.weather2 ,t2.weather2_level ,t2.weather2_type ,t2.wind_direction ,t2.wind_speed ,t2.wind_speed1 ,t2.wind_speed2 ,t3.day_index ,t3.month_index ,t3.year_index ,t3.month ,t3.day ,t3.workday ,t3.weekday ,t3.holiday ,t3.special_workday ,t3.special_holiday ,t3.day1_before_special_holiday ,t3.day2_before_special_holiday ,t3.day3_before_special_holiday ,t3.day1_before_holiday ,t3.day2_before_holiday ,t3.day3_before_holiday ,t3.day1_after_special_holiday ,t3.day2_after_special_holiday ,t3.day3_after_special_holiday ,t3.day1_after_holiday ,t3.day2_after_holiday ,t3.day3_after_holiday FROM t_netivs_daily_sum_features t1 LEFT OUTER JOIN t_netivs_encode_weather t2 ON t1.day_int = t2.day_int LEFT OUTER JOIN t_netivs_date_features t3 ON t1.day_int = t3.day_int -- 产生训练集 DROP TABLE IF EXISTS t_netivs_online_train_features; CREATE TABLE IF NOT EXISTS t_netivs_online_train_features AS SELECT t1.* ,t2.power_consumption ,t2.fixed_power_consumption FROM SELECT * FROM t_netivs_all_online_features WHERE day_int 20161201 LEFT OUTER JOIN t_netivs_fixed_daily_sum_consumption t2 ON t1.day_int = t2.day_int
DROP TABLE IF EXISTS t_netivs_online_test_features; CREATE TABLE IF NOT EXISTS t_netivs_online_test_features AS SELECT * FROM t_netivs_all_online_features WHERE day_int =20161201 -- 用xgb来跑 DROP TABLE IF EXISTS t_netivs_online_xgb_prediction_result; DROP OFFLINEMODEL IF EXISTS m_online_xgb_model; -- train -name xgboost -project algo_public -Deta="0.01" ---Dobjective="reg:linear" -Dobjective="reg:linear" -DitemDelimiter="," -Dseed="0" -Dnum_round="3500" -DlabelColName="power_consumption" -DinputTableName="t_netivs_online_train_features" -DenableSparse="false" -Dmax_depth="8" -Dsubsample="0.4" -Dcolsample_bytree="0.6" -DmodelName="m_online_xgb_model" -Dgamma="0" -Dlambda="50" -DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" -Dbase_score="0.11" -Dmin_child_weight="100" -DkvDelimiter=":";
-DoutputTableName="t_netivs_online_xgb_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" -DinputTableName="t_netivs_online_test_features" -DenableSparse="false"; select * from t_netivs_online_xgb_prediction_result ORDER BY day_int limit 100; select avg(prediction_result) from t_netivs_online_xgb_prediction_result;
模型二的特征提取、模型构建及模型融合代码如下:
-- 根据历史同期信息来构建特征工程 DROP TABLE IF EXISTS t_netivs_same_period_feature; CREATE TABLE IF NOT EXISTS t_netivs_same_period_feature AS SELECT t1.day_int ,t1.day ,t1.day_to_lastday ,t2.power_consumption as last_month_same_day_consumption ,t3.power_consumption as last_year_same_day_consumption ,t4.last_month_power_consumption_median ,t4.last_month_weekday1_median ,t4.last_month_weekday0_median ,t4.last_month_workday1_median ,t4.last_month_last3day_median ,t4.last_month_last7day_median ,t4.last_month_first3day_median ,t4.last_month_first7day_median ,t4.last_month_middle_median ,t4.last_month_power_consumption_avg ,t4.last_month_weekday1_avg ,t4.last_month_weekday0_avg ,t4.last_month_workday1_avg ,t4.last_month_last3day_avg ,t4.last_month_last7day_avg ,t4.last_month_first3day_avg ,t4.last_month_first7day_avg ,t4.last_month_middle_avg ,t5.power_consumption ,t5.fixed_power_consumption ,t6.temperature_high ,t6.temperature_low ,t6.weather1 ,t6.weather1_level ,t6.weather1_type ,t6.weather2 ,t6.weather2_level ,t6.weather2_type ,t6.wind_direction ,t6.wind_speed ,t6.wind_speed1 ,t6.wind_speed2
,day_to_lastday ,case when day =15 then cast(to_char(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,mm),yyyymmdd) as bigint) else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,mm)),-day_to_lastday,dd),yyyymmdd) as bigint) end as last_month_same_day ,cast(to_char(dateadd(to_date(cast(day_int as string),yyyymmdd),-1,yyyy),yyyymmdd) as bigint) as last_year_same_day FROM t_netivs_date_features WHERE day_int =20160101 LEFT OUTER JOIN t_netivs_fixed_daily_sum_consumption t2 ON t1.last_month_same_day = t2.day_int LEFT OUTER JOIN t_netivs_fixed_daily_sum_consumption t3 ON t1.last_year_same_day = t3.day_int LEFT OUTER JOIN t_netivs_dail_sum_features t4 ON t1.day_int = t4.day_int LEFT OUTER JOIN t_netivs_fixed_daily_sum_consumption t5 ON t1.day_int = t5.day_int LEFT OUTER JOIN t_netivs_encode_weather t6 ON t1.day_int = t6.day_int LEFT OUTER JOIN t_netivs_date_features t7 ON t1.day_int = t7.day_int -- 产生训练集 DROP TABLE IF EXISTS t_netivs_online_historical_train_features; CREATE TABLE IF NOT EXISTS t_netivs_online_historical_train_features AS SELECT * FROM t_netivs_same_period_feature WHERE day_int 20161201 -- 产生测试集 DROP TABLE IF EXISTS t_netivs_online_historical_test_features; CREATE TABLE IF NOT EXISTS t_netivs_online_historical_test_features AS SELECT * FROM t_netivs_same_period_feature WHERE day_int =20161201
DROP TABLE IF EXISTS t_netivs_online_historical_xgb_prediction_result; DROP OFFLINEMODEL IF EXISTS m_online_historical_xgb_model; -- train -name xgboost -project algo_public -Deta="0.01" ---Dobjective="reg:linear" -Dobjective="reg:linear" -DitemDelimiter="," -Dseed="0" -Dnum_round="4000" -DlabelColName="power_consumption" -DinputTableName="t_netivs_online_historical_train_features" -DenableSparse="false" -Dmax_depth="8" -Dsubsample="0.8" -Dcolsample_bytree="0.8" -DmodelName="m_online_historical_xgb_model" -Dgamma="0" -Dlambda="50" -DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" ---Dbase_score="0.11" -Dmin_child_weight="50" -DkvDelimiter=":";
-DoutputTableName="t_netivs_online_historical_xgb_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" -DinputTableName="t_netivs_online_historical_test_features" -DenableSparse="false"; select * from t_netivs_online_historical_xgb_prediction_result ORDER BY day_int limit 100; -- 使用没有调整的power_consumption DROP TABLE IF EXISTS t_netivs_online_historical_fixed_xgb_prediction_result; DROP OFFLINEMODEL IF EXISTS m_online_historical_fixed_xgb_model; -- train -name xgboost -project algo_public -Deta="0.01" ---Dobjective="reg:linear" -Dobjective="reg:linear" -DitemDelimiter="," -Dseed="0" -Dnum_round="4000" -DlabelColName="fixed_power_consumption" -DinputTableName="t_netivs_online_historical_train_features" -DenableSparse="false" -Dmax_depth="8" -Dsubsample="0.8" -Dcolsample_bytree="0.8" -DmodelName="m_online_historical_fixed_xgb_model" -Dgamma="0" -Dlambda="50" -DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" ---Dbase_score="0.11" -Dmin_child_weight="50" -DkvDelimiter=":";
-Dlifecycle="28" -DoutputTableName="t_netivs_online_historical_fixed_xgb_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday" -DinputTableName="t_netivs_online_historical_test_features" -DenableSparse="false"; select * from t_netivs_online_historical_fixed_xgb_prediction_result ORDER BY day_int limit 100; DROP TABLE IF EXISTS t_netivs_xgb_ensemble_result; CREATE TABLE IF NOT EXISTS t_netivs_xgb_ensemble_result AS SELECT t1.day_int ,t1.prediction_result + t2.prediction_result*0.05 as prediction_result t_netivs_online_xgb_prediction_result t1 LEFT OUTER JOIN t_netivs_online_historical_fixed_xgb_prediction_result t2 ON t1.day_int = t2.day_int ORDER BY day_int limit 61 SELECT avg(prediction_result) FROM t_netivs_xgb_ensemble_result; SELECT * FROM t_netivs_xgb_ensemble_result ORDER BY day_int limit 100; INSERT OVERWRITE TABLE tianchi_power_answer SELECT concat(to_char(datepart(ds,yyyy)),/,to_char(datepart(ds,mm)),/,to_char(datepart(ds,dd))) as predict_date ,cast(round(power_consumption,0) as bigint) as power_consumption SELECT to_date(cast(day_int as string),yyyymmdd) as ds ,prediction_result as power_consumption FROM t_netivs_xgb_ensemble_result
五、总结与展望
本文以阿里云天池大数据平台上举办的电力AI赛(https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.0.0.3f6e7d83UaNT4W raceId=231602)为例,介绍了借助阿里云MaxCompute平台实现电力系统负荷预测的整个流程,并给出了全部核心代码。代码面前了无秘密可言,通过对这些代码的分析,可以很容易看出来阿里云的MaxCompute强大的功能和灵活的开放接口。实际上,由于比赛平台的限制,阿里云 MaxCompute平台上还有很多可以辅助开发的功能尚未展示到,比如可视化用的DataV、商业智能引擎Quik BI等等,通过将负荷预测跟这些产品的结合,可以很方便的实现出界面美观功能强大的电力系统应用。
喜马拉雅基于DeepRec构建AI平台实践 快速落地大模型训练和推理能力,带来业务指标和后续算法优化空间的显著提升。喜马拉雅AI云,是面向公司人员提供的一套从数据、特征、模型到服务的全流程一站式算法工具平台。
人工智能,丹青圣手,全平台(原生/Docker)构建Stable-Diffusion-Webui的AI绘画库教程(Python3.10/Pytorch1.13.0) 世间无限丹青手,遇上AI画不成。最近一段时间,可能所有人类画师都得发出一句“既生瑜,何生亮”的感叹,因为AI 绘画通用算法Stable Diffusion已然超神,无需美术基础,也不用经年累月的刻苦练习,只需要一台电脑,人人都可以是丹青圣手。
阿里云机器学习 PAI 年度发布:持续锻造云原生的 AI 工程平台 刚刚结束的 2022 云栖大会上,阿里云机器学习平台 PAI 发布了在开发者服务、企业级能力、工程性能优化三个方向的一系列新特性和功能。从支撑达摩院上云,到服务金融、汽车、互联网、制造等多个行业的创新实践,机器学习 PAI 不断夯实云原生的 AI 工程平台能力。
相关文章
- @RequestBody接收json字符串,自动将日期字符串转换为java.util.Date
- 备忘录上怎样计算某个固定日期距离现在日期的天数?
- java-日期取特定值
- Java实现 蓝桥杯 算法提高 日期计算
- new Date()导致日期增加了一天
- JS 日期补0
- file.listFiles()按文件大小、名称、日期排序方法
- Java 中日期的几种常见操作 —— 取值、转换、加减、比较
- mysql--SQL编程(关于mysql中的日期,关于重叠) 学习笔记2.2
- file.listFiles()按文件大小、名称、日期排序方法
- Java知识回顾 (5)数组、日期与时间, StringBuffer和StringBuilder
- atitit.日期,星期,时候的显示方法ISO 8601标准
- atitit.日期,星期,时候的显示方法ISO 8601标准
- [h5棋牌项目]-04-时间戳与日期格式的相互转换
- python pytz 结合时区的日期操作
- (十)js获取日期
- 【Nginx】如何按日期分割Nginx日志?看这一篇就够了!!
- WPF-数据绑定:日期时间格式