pig tutorial - apache pig tutorial - Apache Pig Distinct Operator - pig latin - apache pig - pig hadoop
What is Distinct Operator in Apache Pig ?
- The DISTINCT Operator is used to remove duplicated records and it works only on entire records, which does not work on individual fields.
- The DISTINCT operators which are used in a SELECT statement filter the result set to remove duplicates
- We can use DISTINCT operator in combination with an aggregation function, which is typically COUNT ().
- The distinct operator is used to get the unique values by removing duplicates.
- The DISTINCT operator is used to remove redundant tuples from a relation.
Pig Operations - Deduplication

- Only preserves unique tuples

Syntax
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example:
wikitechy_student_details.txt
001,Sabrina,Reddy,9848022337,Hyderabad
002,Arvin,Battacharya,9848022338,Kolkata
002,Arvin,Battacharya,9848022338,Kolkata
003,Arun,Khanna,9848022339,Delhi
003,Arun,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Sruti,Mohanthy,9848022336,Bhuwaneshwar
006,Vanitha,Mishra,9848022335,Chennai
006,Vanitha,Mishra,9848022335,Chennai
- And we have loaded this file into Pig with the relation name wikitechy_student_details which is given below:
grunt> wikitechy_student_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
- We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
- We remove the redundant tuples from the relation which is name wikitechy_student_details using the DISTINCT operator, and store it as another relation which is called distinct_data which is given below:
grunt> distinct_data = DISTINCT wikitechy_student_details;
Verification
grunt> Dump distinct_data;
Output:
(1,Sabrina,Reddy,9848022337,Hyderabad)
(2,Arvin,Battacharya,9848022338,Kolkata)
(3,Arun,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Sruti,Mohanthy,9848022336,Bhuwaneshwar)
(6,Vanitha,Mishra,9848022335,Chennai)