[Solved-1 Solution] Max/Min for whole sets of records in PIG ?
What is max()
- The Pig Latin MAX() function is used to calculate the highest value for a column (numeric values or chararrays) in a single-column bag. While calculating the maximum value, the Max() function ignores the NULL values.
- To get the global maximum value, we need to perform a Group All.operation, and calculate the maximum value using the MAX() function.
- To get the maximum value of a group, we need to group it using the Group By operator and proceed with the maximum function.
Syntax
- Given below is the syntax of the Max() function.
grunt> Max(expression)
What is min()
- The MIN() function of Pig Latin is used to get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag. While calculating the minimum value, the MIN() function ignores the NULL values.
- To get the global minimum value, we need to perform a Group All operation, and calculate the minimum value using the MIN() function.
- To get the minimum value of a group, we need to group it using theGroup By operator and proceed with the minimum function.
Syntax
- Here is the syntax of the MIN() function.
grunt> MIN(expression)
Problem:
- Here is an example of set of records that we loading from a file and the first thing we need to do is get the max and min of a column.
- In SQL we would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
- It has a MAX and MIN function but when we tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
- This one is not working. Is there any solution?
Solution 1:
- We need to group all the data together but no extra column is required if we can use GROUP ALL
Pig:
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input:
CA 10
VA 5
WI 2
Output:
(CA,10,10)
(VA,5,10)
(WI,2,10)