[Solved-1 Solution] Find average by joining two datasets in pig ?



What is average()

  • The Pig-Latin AVG() function is used to compute the average of the numerical values within a bag. While calculating the average value, the AVG() function ignores the NULL values.

Syntax

  • Given below is the syntax of the AVG() function.
grunt> AVG(expression) 
  • We can find average by using average function which is available in apache pig.

What is foreach ?

  • The FOREACH operator is used to generate specified data transformations based on the column data.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

Problem:

How to find average by joining two datasets in pig ?

Solution 1:

We can also solve using foreach statement

  • Group command will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;

Output for the above program:

(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)

Related Searches to Find average by joining two datasets in pig