[Solved-1 Solution] Removing duplicates using PigLatin ?
What is pig latin ?
- Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.
- Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.
Problem:
- If you using PigLatin. And you may want to remove the duplicates from the bags and want to retain the last element of the particular key.
Input:
User1 7 LA
User1 8 NYC
User1 9 NYC
User2 3 NYC
User2 4 DC
Output:
User1 9 NYC
User2 4 DC
- Here the first filed is a key. And if you want the last record of that particular key to be retained in the output.
- we know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
Solution 1:
- If order by one of the fields in descending order. Its possible to get the last record. In the below code, have ordered by second field of input
Input :
User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC
Pig snippet :
user_details = LOAD 'user_details.csv' USING PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
user_details_grp_user = GROUP user_details BY user_name;
required_user_details = FOREACH user_details_grp_user {
user_details_sorted_by_no = ORDER user_details BY no DESC;
top_record = LIMIT user_details_sorted_by_no 1;
GENERATE FLATTEN(top_record);
}
Output : DUMP required_user_details
(User1,9,NYC )
(User2,4,DC)