zl程序教程

您现在的位置是:首页 >  其它

当前栏目

Hive Essential (6):Data Aggregation and Sampling

and Data hive Aggregation Essential
2023-09-27 14:27:40 时间

1. Basic aggregation

Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HQL offers several built-in aggregate functions, such as max(...) , min(...) , and avg(...) . It also supports advanced aggregation using keywords such as GROUPING SETS , ROLLUP , and CUBE , and different types of window function.

The basic built-in aggregate functions are usually used with the GROUP BY clause. If there is no GROUP BY clause specified, it aggregates over the whole row (all columns) by default. Besides aggregate functions, all columns selected must also be included in the GROUP BY clause. The following are a few examples involving the built-in aggregate functions:

--1. Aggregation without GROUP BY columns
SELECT
count(*) as rowcnt1,
count(1) as rowcnt2 -- same to count(*)
FROM employee;

--2. Aggregation with GROUP BY columns
SELECT
gender_age.gender, count(*) as row_cnt
FROM employee
GROUP BY gender_age.gender;

-- The column name selected is not a group by columns causes error
SELECT
name, gender_age.gender, count(*) as row_cnt
FROM employee GROUP BY gender_age.gender;

Error: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 2:1 Expression not in GROUP BY key 'name' (state=42000,code=10025)

--NOTE: If we have to select columns that are not GROUP BY columns, one way is to use window functions

An aggregate function can be used with other aggregate functions in the same SELECT statement. It can also be used with other functions, such as conditional functions, in a nested way. However, nested aggregate functions are not supported. See the following examples for more details:

--1. Multiple aggregate functions in the same SELECT statement
SELECT
gender_age.gender, avg(gender_age.age) as avg_age, count(*) as row_cnt
FROM employee 
GROUP BY gender_age.gender;

--2. Aggregate functions can also be used with CASE WHEN THEN ELSE END , coalesce(...) , or if(...)
SELECT
sum(CASE WHEN gender_age.gender = 'Male'  THEN gender_age.age ELSE 0 END)/count(CASE WHEN gender_age.gender = 'Male' THEN 1 ELSE NULL END) as male_age_avg
FROM employee;

SELECT
sum(coalesce(gender_age.age,0)) as age_sum,
sum(if(gender_age.gender = 'Female',gender_age.age,0)) as  female_age_sum
FROM employee;

--3. GROUP BY can also apply to expressions
SELECT
if(name = 'Will', 1, 0) as name_group,
count(name) as name_cnt
FROM employee
GROUP BY if(name = 'Will', 1, 0);

--4. Verify that nested aggregate functions are not allowed
SELECT avg(count(*)) as row_cnt FROM employee;
Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:11 Not yet supported place for UDAF 'count' (state=42000,code=10128)

--5. Aggregate functions such as max(...) or min(...) apply to NULL and return NULL. 
--   However, functions such as sum() and avg(...) cannot apply to NULL . 
--   The count(null) returns 0.
SELECT max(null), min(null), count(null);
+------+------+-----+
| _c0  | _c1  | _c2 |
+------+------+-----+
| NULL | NULL | 0   |
+------+------+-----+

SELECT sum(null), avg(null);
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Only numeric or string type arguments are accepted but void is passed. (state=42000,code=40000)

In addition, we may encounter a very special behavior when dealing with aggregation across columns with a NULL value. The entire row (if one column has NULL as a value in the row) will be ignored. To avoid this, we can use coalesce(...) to assign a default value when the column value is NULL. See the following example:

-- Create a table t for testing
CREATE TABLE t (val1 int, val2 int);

INSERT INTO TABLE t VALUES (1, 2),(null,2),(2,3);

-- Check the rows in the table created
SELECT * FROM t;
+---------+---------+
| t.val1  | t.val2  |
+---------+---------+
| 1       | 2       |
| NULL    | 2       |
| 2       | 3       |
+---------+---------+

--!!!!!!!!!!!!!!!!!!!! -- The 2nd row (NULL, 2) is ignored when doing sum(val1 + val2) , be care to compare the results with query below and see the difference!!! SELECT sum(val1), sum(val1 + val2) FROM t; +------+------+ | _c0 | _c1 | +------+------+ | 3 | 8 | +------+------+ SELECT sum(coalesce(val1,0)), sum(coalesce(val1,0) + val2) FROM t; +------+------+ | _c0 | _c1 | +------+------+ | 3 | 10 | +------+------+

 

Aggregate functions can also be used with the DISTINCT keyword to aggregate on unique values:

SELECT
count(DISTINCT gender_age.gender) as gender_uni_cnt,
count(DISTINCT name) as name_uni_cnt
FROM employee;
+-----------------+---------------+
| gender_uni_cnt  | name_uni_cnt  |
+-----------------+---------------+
| 2               | 5             |
+-----------------+---------------+

 

When we use COUNT and DISTINCT together, it always ignores the setting (such as mapred.reduce.tasks = 20 ) for the number of reducers used and may use only one reducer. In this case, the single reducer becomes the bottleneck when processing large volumes of data. The workaround is to use a subquery as follows:

-- May trigger single reducer during the whole processing
SELECT count(distinct gender_age.gender) as gender_uni_cnt FROM employee;

-- Use subquery to select unique value before aggregations
SELECT
count(*) as gender_uni_cnt
FROM 
(
  SELECT DISTINCT gender_age.gender FROM employee
) a;

In this case, the first stage of the query implementing DISTINCT can use more than one reducer. In the second stage, the mapper will have less output just for the COUNT purpose, since the data is already unique after implementing DISTINCT . As a result, the reducer will not be overloaded.

Sometimes, we may need to find the max. or min. value of particular columns as well as other columns, for example, to answer this question: who are the oldest males and females with ages in the employee table? To achieve this, we can also use max/min on a struct as follows, instead of using subqueries/window functions:

SELECT gender_age.gender,
max(struct(gender_age.age, name)).col1 as age,
max(struct(gender_age.age, name)).col2 as name
FROM employee
GROUP BY gender_age.gender;
+-------------------+-----+------+
| gender_age.gender | age | name |
+-------------------+-----+------+
| Female            | 57  | Lucy |
| Male              | 35  | Will |
+-------------------+-----+------+

Although it still needs to use the GROUP BY clause, this job is more efficient than a regular GROUP BY or subquery, as it only triggers one job.

The hive.map.aggr property controls aggregations in the map task. The default value for this setting is true , so Hive will do the first-level aggregation directly in the map task for better performance, but consume more memory. Turn it off if you run out of memory in the map phase.

 

2. Enhanced aggregation

2.1 Grouping sets

GROUPING SETS implements advanced multiple GROUP BY operations against the same set of data. Actually, GROUPING SETS are a shorthand way of connecting several GROUP BY result sets with UNION ALL . The GROUPING SETS keyword completes all processes in a single stage of the job, which is more efficient. A blank set () in the GROUPING SETS clause calculates the overall aggregation. The following are a few examples to show the equivalence of GROUPING SETS . For better understanding, we can say that the outer level (brace) of GROUPING SETS defines what data UNION ALL is to be implemented. The inner level (brace) defines what GROUP BY data is to be implemented in each UNION ALL .

--1. Grouping set with one element of column pairs:
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date));
--||-- equals to
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date;
+---------+------------+---------+
| name    | start_date | sin_cnt |
+---------+------------+---------+
| Lucy    | 2010-01-03 | 1       |
| Michael | 2014-01-29 | 1       |
| Steven  | 2012-11-03 | 1       |
| Will    | 2013-10-02 | 1       |
+---------+------------+---------+

--2. Grouping set with two elements:
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS(name, start_date);
--||-- equals to
SELECT
name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name
UNION ALL
SELECT
null as name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY start_date;
----------+------------+---------+
| name    | start_date | sin_cnt |
+---------+------------+---------+
| NULL    | 2010-01-03 | 1       |
| NULL    | 2012-11-03 | 1       |
| NULL    | 2013-10-02 | 1       |
| NULL    | 2014-01-29 | 1       |
| Lucy    | NULL       | 1       |
| Michael | NULL       | 1       |
| Steven  | NULL       | 1       |
| Will    | NULL       | 1       |
+---------+------------+---------+

--3. Grouping set with two elements, a column pair, and a column:
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date), name);
--||-- equals to
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT
name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name;
+---------+------------+---------+
| name    | start_date | sin_cnt |
+---------+------------+---------+
| Lucy    | NULL       | 1       |
| Lucy    | 2010-01-03 | 1       |
| Michael | NULL       | 1       |
| Michael | 2014-01-29 | 1       |
| Steven  | NULL       | 1       |
| Steven  | 2012-11-03 | 1       |
| Will    | NULL       | 1       |
| Will    | 2013-10-02 | 1       |
+---------+------------+---------+

--4. Grouping set with four elements, including all combinations of columns:
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date), name, start_date, ());
--||-- equals to
SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT
name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name
UNION ALL
SELECT
null as name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY start_date
UNION ALL
SELECT
null as name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
+---------+------------+---------+
| name    | start_date | sin_cnt |
+---------+------------+---------+
| NULL    | NULL       | 4       |
| NULL    | 2010-01-03 | 1       |
| NULL    | 2012-11-03 | 1       |
| NULL    | 2013-10-02 | 1       |
| NULL    | 2014-01-29 | 1       |
| Lucy    | NULL       | 1       |
| Lucy    | 2010-01-03 | 1       |
| Michael | NULL       | 1       |
| Michael | 2014-01-29 | 1       |
| Steven  | NULL       | 1       |
| Steven  | 2012-11-03 | 1       |
| Will    | NULL       | 1       |
| Will    | 2013-10-02 | 1       |
+---------+------------+---------+

2.2 Rollup and Cube

The ROLLUP statement enables a SELECT statement to calculate multiple levels of aggregations across a specified group of dimensions. The ROLLUP statement is a simple extension of the GROUP BY clause with high efficiency and minimal overhead for a query. Compared to GROUPING SETS , which creates specified levels of aggregations, ROLLUP creates n+1 levels of aggregations, where n is the number of grouping columns. First, it calculates the standard aggregate values specified in the GROUP BY clause. Then, it creates higher-level subtotals, moving from right to left through the list of combinations of grouping columns. For example, GROUP BY a,b,c WITH ROLLUP is equivalent to GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),()) .

The CUBE statement takes a specified set of grouping columns and creates aggregations for all of their possible combinations. If n columns are specified for CUBE, there will be 2 n combinations of aggregations returned. For example, GROUP BY a,b,c WITH CUBE is equivalent to GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),()) .

The GROUPING__ID function works as an extension to distinguish entire rows from each other. It returns the decimal equivalent of the BIT vector for each column specified after GROUP BY . The returned decimal number is converted from a binary of ones and zeros, which represents whether the column is aggregated (0) in the row or not (1). On the other hand, the grouping(...) function also indicates whether a column in a GROUP BY clause is aggregated or not by returning the binary of 1 or 0 directly. In the following example, the order of columns starts from counting the nearest column (such as name ) from GROUP BY . The first row in the result set indicates that none of the columns are being used in GROUP BY . Compare the following example with the last example in the GROUPING SETS section for a better understanding of GROUPING_ID and grouping(...) :

SELECT
name, start_date, count(employee_id) as emp_id_cnt,
GROUPING__ID,
grouping(name) as gp_name,
grouping(start_date) as gp_sd
FROM employee_hr
GROUP BY name, start_date
WITH CUBE ORDER BY name, start_date;
+---------+------------+------------+-----+---------+-------+
| name    | start_date | emp_id_cnt | gid | gp_name | gp_sd |
+---------+------------+------------+-----+---------+-------+
| NULL    | NULL       | 4          | 3   | 1       | 1     |
| NULL    | 2010-01-03 | 1          | 2   | 1       | 0     |
| NULL    | 2012-11-03 | 1          | 2   | 1       | 0     |
| NULL    | 2013-10-02 | 1          | 2   | 1       | 0     |
| NULL    | 2014-01-29 | 1          | 2   | 1       | 0     |
| Lucy    | NULL       | 1          | 1   | 0       | 1     |
| Lucy    | 2010-01-03 | 1          | 0   | 0       | 0     |
| Michael | NULL       | 1          | 1   | 0       | 1     |
| Michael | 2014-01-29 | 1          | 0   | 0       | 0     |
| Steven  | NULL       | 1          | 1   | 0       | 1     |
| Steven  | 2012-11-03 | 1          | 0   | 0       | 0     |
| Will    | NULL       | 1          | 1   | 0       | 1     |
| Will    | 2013-10-02 | 1          | 0   | 0       | 0     |
+---------+------------+------------+-----+---------+-------+

3. Aggregation condition

Since v0.7.0, HAVING has been added to support the conditional filtering of aggregation results directly. By using HAVING , we can avoid using a subquery after the GROUP BY statement. See the following example:

SELECT
gender_age.age
FROM employee
GROUP BY gender_age.age
HAVING count(*)=1;
+----------------+
| gender_age.age |
+----------------+
| 27             |
| 30             |
| 35             |
| 57             |
+----------------+

SELECT
gender_age.age,
count(*) as cnt     -- Support use column alias in HAVING, like ORDER BY
FROM employee
GROUP BY gender_age.age HAVING cnt=1;
+----------------+-----+
| gender_age.age | cnt |
+----------------+-----+
| 27             | 1   |
| 30             | 1   |
| 35             | 1   |
| 57             | 1   |
+----------------+-----+

HAVING supports filtering on regular columns too. However, it is recommended to use such a filter type after a WHERE clause rather than HAVING for better performance.

 

If we do not use HAVING , we can use a subquery instead as follows:

SELECT
    a.age
FROM 
(
    SELECT 
        count(*) as cnt, gender_age.age
    FROM 
        employee 
    GROUP BY 
        gender_age.age
) a 
WHERE a.cnt <= 1;
+--------+
| a.age  |
+--------+
| 57     |
| 27     |
| 35     |
+--------+

 

4. Window functions

 Window functions, available since Hive v0.11.0, are a special group of functions that scan multiple input rows to compute each output value. Window functions are usually used with OVER , PARTITION BY , ORDER BY , and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause, and limited to one result value per group, window functions operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER and PARTITION clause. Window functions give aggregate results, but they do not group the result set. They return the group value multiple times with each record. Window functions offer great flexibility and functionalities compared term the regular GROUP BY clause and make special aggregations by HQL easier and more powerful. The syntax for a window function is as follows:

Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>])

Function (arg1,..., argn) can be any function in the following three categories:

  • Aggregate Functions: Regular aggregate functions, such as sum(...) , and max(...)
  • Sort Functions: Functions for sorting data, such as rank(...) , and row_number(...)
  • Analytics Functions: Functions for statistics and comparisons, such as lead(...) , lag(...) , and first_value(...)

The OVER [PARTITION BY <...>] clause is similar to the GROUP BY clause. It divides the rows into groups containing identical values in one or more partitions by columns. These logical groups are known as partitions, which is not the same term as used for partition tables. Omitting the PARTITION BY statement applies the operation to all the rows in the table.

The [ORDER BY <....>] clause is the same as the regular ORDER BY clause. It makes sure the rows produced by the PARTITION BY clause are ordered by specifications, such as ascending or descending order. Next, we'll learn more details of each category of window functions through examples.

4.1 Window aggregate functions

Using regular aggregate functions in window functions brings more flexibility than GROUP BY , which requires all grouped columns in the select list. Since Hive v2.2.0, DISTINCT has been supported for use with aggregate functions in window functions:

--1. Prepare the table and data for demonstration:
CREATE TABLE IF NOT EXISTS employee_contract (
name string,
dept_num int,
employee_id int,
salary int,
type string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED as TEXTFILE;

LOAD DATD INPATH '/tmp/hivedemo/data/employee_contract.txt'
OVERWRITE INTO TABLE employee_contract;

--2. The regular aggregations are used as window functions:
SELECT
name,
dept_num as deptno,
salary,
count(*) OVER (PARTITION BY dept_num) as cnt,
count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,
sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as
sum1,
sum(salary) OVER(ORDER BY dept_num) as sum2,
sum(salary) OVER(ORDER BY dept_num, name) as sum3
FROM employee_contract
ORDER BY deptno, name;
+---------+--------+--------+-----+-----+-------+-------+-------+
| name    | deptno | salary | cnt | dcnt| sum1  | sum2  | sum3  |
+---------+--------+--------+-----+-----+-------+-------+-------+
| Lucy    | 1000   | 5500   | 5   | 1   | 24900 | 24900 | 5500  |
| Michael | 1000   | 5000   | 5   | 1   | 24900 | 24900 | 10500 |
| Steven  | 1000   | 6400   | 5   | 1   | 24900 | 24900 | 16900 |
| Wendy   | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 20900 |
| Will    | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 24900 |
| Jess    | 1001   | 6000   | 3   | 1   | 17400 | 42300 | 30900 |
| Lily    | 1001   | 5000   | 3   | 1   | 17400 | 42300 | 35900 |
| Mike    | 1001   | 6400   | 3   | 1   | 17400 | 42300 | 42300 |
| Richard | 1002   | 8000   | 3   | 1   | 20500 | 62800 | 50300 |
| Wei     | 1002   | 7000   | 3   | 1   | 20500 | 62800 | 57300 |
| Yun     | 1002   | 5500   | 3   | 1   | 20500 | 62800 | 62800 |
+---------+--------+--------+-----+-----+-------+-------+-------+

 

4.2 Window sort functions

Window sort functions provide the sorting data information, such as row number and rank, within specific groups as part of the data returned. The most commonly used sort functions are as follows:

  • row_number : Assigns a unique sequence number starting from 1 to each row, according to the partition and order specification.
  • rank : Ranks items in a group, such as finding the top N rows for specific conditions.
  • dense_rank : Similar to rank , but leaves no gaps in the ranking sequence when there are ties. For example, if we rank a match using dense_rank and have two players tied for second place, we would see that the two players were both in second place and that the next person is ranked third. However, the rank function would rank two people in second place, but the next person would be in fourth place.
  • percent_rank : Uses rank values rather than row counts in its numerator as (current rank - 1)/(total number of rows - 1). Therefore, it returns the percentage rank of a value relative to a group of values.
  • ntile : Divides an ordered dataset into a number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.


Here are some examples using window sort functions in HQL:

SELECT
    name,
    dept_num as deptno,
    salary,
    row_number() OVER () as rnum, -- sequence in orginal table
    rank() OVER (PARTITION BY dept_num ORDER BY salary) as rk,
    dense_rank() OVER (PARTITION BY dept_num ORDER BY salary) as drk,
    percent_rank() OVER(PARTITION BY dept_num ORDER BY salary) as prk,
    ntile(4) OVER(PARTITION BY dept_num ORDER BY salary) as ntile
FROM employee_contract
ORDER BY deptno, name;
+---------+--------+--------+------+----+-----+------+-------+
| name    | deptno | salary | rnum | rk | drk | prk  | ntile |
+---------+--------+--------+------+----+-----+------+-------+
| Lucy    | 1000   | 5500   | 7    | 4  | 3   | 0.75 | 3     |
| Michael | 1000   | 5000   | 11   | 3  | 2   | 0.5  | 2     |
| Steven  | 1000   | 6400   | 8    | 5  | 4   | 1.0  | 4     |
| Wendy   | 1000   | 4000   | 9    | 1  | 1   | 0.0  | 1     |
| Will    | 1000   | 4000   | 10   | 1  | 1   | 0.0  | 1     |
| Jess    | 1001   | 6000   | 5    | 2  | 2   | 0.5  | 2     |
| Lily    | 1001   | 5000   | 6    | 1  | 1   | 0.0  | 1     |
| Mike    | 1001   | 6400   | 4    | 3  | 3   | 1.0  | 3     |
| Richard | 1002   | 8000   | 1    | 3  | 3   | 1.0  | 3     |
| Wei     | 1002   | 7000   | 3    | 2  | 2   | 0.5  | 2     |
| Yun     | 1002   | 5500   | 2    | 1  | 1   | 0.0  | 1     |
+---------+--------+--------+------+----+-----+------+-------+

Since Hive v2.1.0, we have been able to use aggregate functions in the OVER clause as follows:

SELECT
dept_num,
rank() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
FROM employee_contract
GROUP BY dept_num;
+----------+----+
| dept_num | rk |
+----------+----+
| 1000     | 1  |
| 1001     | 1  |
| 1002     | 1  |
+----------+----+

 

4.3 Window analytics functions

Window analytics functions provide extended data analytics, such as getting lag, lead, last, or first rows in the ordered set. The most commonly used analytics functions are as follows:

  • cume_dist : Computes the number of rows whose value is smaller than or equal to, the value of the total number of rows divided by the current row, such as (number of rows ≤ current row)/(total number of rows)lead : This function, lead(value_expr[,offset[,default]]) , is used to return data from the next row. The number (offset) of rows to lead can optionally be specified, one is by default. The function returns [ default] or NULL when the default is not specified. In addition, the lead for the current row extends beyond the end of the window.
  • lag : This function, lag(value_expr[,offset[,default]]) , is used to access data from a previous row. The number (offset) of rows to lag can optionally be specified, one is by default. The function returns [ default or NULL when the default is not specified. In addition, the lag for the current row extends beyond the end of the window.
  • first_value : It returns the first result from an ordered set.
  • last_value : It returns the last result from an ordered set.


Here are some examples using window analytics functions in HQL:

SELECT
    name,
    dept_num as deptno,
    salary,
    cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume,
    lead(salary, 2) OVER (PARTITION BY dept_num ORDER BY salary) as lead,
    lag(salary, 2, 0) OVER (PARTITION BY dept_num ORDER BY salary) as lag,
    first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval,
    last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as lval,
    last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
    BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval2
FROM employee_contract
ORDER BY deptno, salary;
+--------+------+--------+------+------+-----+------+------+-------+
| name   |deptno| salary | cume | lead | lag | fval |lvalue|lvalue2|
+--------+------+--------+------+------+-----+------+------+-------+
| Will   | 1000 | 4000   | 0.4  | 5500 | 0   | 4000 | 4000 | 6400  |
| Wendy  | 1000 | 4000   | 0.4  | 5000 | 0   | 4000 | 4000 | 6400 |
| Michael| 1000 | 5000   | 0.6  | 6400 | 4000| 4000 | 5000 | 6400 |
| Lucy   | 1000 | 5500   | 0.8  | NULL | 4000| 4000 | 5500 | 6400 |
| Steven | 1000 | 6400   | 1.0  | NULL | 5000| 4000 | 6400 | 6400 |
| Lily   | 1001 | 5000   | 0.33 | 6400 | 0   | 5000 | 5000 | 6400 |
| Jess   | 1001 | 6000   | 0.67 | NULL | 0   | 5000 | 6000 | 6400 |
| Mike   | 1001 | 6400   | 1.0  | NULL | 5000| 5000 | 6400 | 6400 |
| Yun    | 1002 | 5500   | 0.33 | 8000 | 0   | 5500 | 5500 | 8000 |
| Wei    | 1002 | 7000   | 0.67 | NULL | 0   | 5500 | 7000 | 8000 |
| Richard| 1002 | 8000   | 1.0  | NULL | 5500| 5500 | 8000 | 8000 |
+--------+------+--------+------+------+-----+------+------+-------+

For last_value , the result (the lval column) is a little bit unexpected. This is because the default window clause (introduced in the next section) used is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , which, in the example, means the current  row will always be the last value. Changing the windowing clause to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING gives us the expected result (see the lval2 column).

 

4.4 Window expression

 

5. Sampling

When the data volume is extra large, we may need to find a subset of data to speed up data analysis. This is sampling, a technique used to identify and analyze a subset of data in order to discover patterns and trends in the whole dataset. In HQL, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

5.1 Random sampling

Random sampling uses the rand() function and LIMIT keyword to get the sampling of data, as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY rand() statement can also achieve the same purpose, but the performance is not good:

SELECT name FROM employee_hr
DISTRIBUTE BY rand() SORT BY rand() LIMIT 2;
+--------+
| name   |
+--------+
| Will   |
| Steven |
+--------+

 

5.2 Bucket table sampling

This is a special sampling method, optimized for bucket tables, as shown in the following example. The SELECT clause specifies the columns to sample data from. The rand() function can also be used when sampling entire rows. If the sample column is also the CLUSTERED BY column, the sample will be more efficient:

-- Sampling based on the whole row
SELECT name FROM employee_trans
TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;
+--------+
| name   |
+--------+
| Steven |
+--------+

-- Sampling based on the bucket column, which is efficient
SELECT name FROM employee_trans
TABLESAMPLE(BUCKET 1 OUT OF 2 ON emp_id) a;
+---------+
| name    |
+---------+
| Lucy    |
| Steven  |
| Michael |
+---------+

 

5.3 Block sampling

This type of sampling allows a query to randomly pick up n rows of data, n percentage of the data size, or n bytes of data. The sampling granularity is the HDFS block size. Refer to the following examples:

-- Sample by number of rows
SELECT name
FROM employee TABLESAMPLE(1 ROWS) a;
+----------+
| name     |
+----------+
| Michael  |
+----------+

-- Sample by percentage of data size
SELECT name
FROM employee TABLESAMPLE(50 PERCENT) a;
+----------+
|name      |
+----------+
| Michael  |
| Will     |
+----------+

-- Sample by data size
-- Support b/B, k/K, m/M, g/G
SELECT name FROM employee TABLESAMPLE(1B) a;
+----------+
|name      |
+----------+
| Michael  |
+----------+