zl程序教程

您现在的位置是:首页 >  其他

当前栏目

57 Hive案例(数据ETL)

案例数据 hive ETL 57
2023-09-11 14:15:40 时间

需求

对web点击流日志基础数据表进行etl(按照仓库模型设计)

按各时间维度统计来源域名top10

已有数据表 “t_orgin_weblog” :

col_namedata_typecomment
validstring
remote_addrstring
remote_userstring
time_localstring
requeststring
statusstring
body_bytes_sentstring
http_refererstring
http_user_agentstring

数据示例

| true|1.162.203.134| - | 18/Sep/2013:13:47:35| /images/my.jpg                        | 200| 19939 | "http://www.angularjs.cn/A0d9"                      | "Mozilla/5.0 (Windows   |

| true|1.202.186.37 | - | 18/Sep/2013:15:39:11| /wp-content/uploads/2013/08/windjs.png| 200| 34613 | "http://cnodejs.org/topic/521a30d4bee8d3cb1272ac0f" | "Mozilla/5.0 (Macintosh;|

实现步骤

1、对原始数据进行抽取转换
–将来访url分离出host path query query id

drop table if exists t_etl_referurl;
create table t_etl_referurl as
SELECT a.*,b.*
FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id 

2、从前述步骤进一步分离出日期时间形成ETL明细表“t_etl_detail” day tm

drop table if exists t_etl_detail;
create table t_etl_detail as 
select b.*,substring(time_local,0,11) as daystr,
substring(time_local,13) as tmstr,
substring(time_local,4,3) as month,
substring(time_local,0,2) as day,
substring(time_local,13,2) as hour
from t_etl_referurl b;

3、对etl数据进行分区(包含所有数据的结构化信息)

drop table t_etl_detail_prt;
create table t_etl_detail_prt(
valid                   string,
remote_addr            string,
remote_user            string,
time_local               string,
request                 string,
status                  string,
body_bytes_sent         string,
http_referer             string,
http_user_agent         string,
host                   string,
path                   string,
query                  string,
query_id               string,
daystr                 string,
tmstr                  string,
month                  string,
day                    string,
hour                   string) 
partitioned by (mm string,dd string);

导入数据

insert into table t_etl_detail_prt partition(mm='Sep',dd='18')
select * from t_etl_detail where daystr='18/Sep/2013';

insert into table t_etl_detail_prt partition(mm='Sep',dd='19')
select * from t_etl_detail where daystr='19/Sep/2013';

分个时间维度统计各referer_host的访问次数并排序

create table t_refer_host_visit_top_tmp as
select referer_host,count(*) as counts,mm,dd,hh from t_display_referer_counts group by hh,dd,mm,referer_host order by hh asc,dd asc,mm asc,counts desc;

4、来源访问次数topn各时间维度URL
取各时间维度的referer_host访问次数topn

select * from (select referer_host,counts,concat(hh,dd),row_number() over (partition by concat(hh,dd) order by concat(hh,dd) asc) as od from t_refer_host_visit_top_tmp) t where od<=3;