[Solved-1 Solution] Find average by joining two datasets in pig ?
What is average()
- The Pig-Latin AVG() function is used to compute the average of the numerical values within a bag. While calculating the average value, the AVG() function ignores the NULL values.
Syntax
- Given below is the syntax of the AVG() function.
grunt> AVG(expression)
- We can find average by using average function which is available in apache pig.
What is foreach ?
- The FOREACH operator is used to generate specified data transformations based on the column data.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Problem:
How to find average by joining two datasets in pig ?
Solution 1:
We can also solve using foreach statement
- Group command will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;
Output for the above program:
(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)