pig tutorial - apache pig tutorial - Apache Pig - DIFF() Function - pig latin - apache pig - pig hadoop
What is DIFF() Function in Apache Pig ?
- The DIFF() function used in Apache Pig is used to compare two bags in a tuple.
- The specification which is given on DIFF() function is the name of the existing series and is also known as the degree of differencing, in parentheses.
- The degree of differencing used in DIFF() function must be specified as it is not default.
- System-missing values which is used in DIFF() will appear at the beginning of the new series.
- We can specify one degree of differencing which is done per DIFF function.
Syntax
Example
- We can assume that we have two files namely wikitechy_employee_sales.txt and wikitechy_employee_bonus.txt which is given in the HDFS directory /pig_data/ which is given below:
wikitechy_employee_sales.txt
wikitechy_employee_bonus.txt
We have loaded the files into Pig, with the relation names which are called employee_sales and employee_bonus.
employee_bonus
- We need to group the records/tuples of the relations employee_sales and employee_bonus with the key sno, which is done using the COGROUP operator which is given below:
Verify the relation cogroup_data by using the DUMP operator which is given below:
Calculating the Difference between Two Relations
We need to calculate the difference between the two relations by using DIFF() function and we need to store it in the relation diff_data which is given below: