WRITE FOR US

[Solved-1 Solution] Removing duplicates using PigLatin ?

What is pig latin ?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.
Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Problem:

If you using PigLatin. And you may want to remove the duplicates from the bags and want to retain the last element of the particular key.

Input:

User1  7 LA
User1  8 NYC
User1  9 NYC
User2  3 NYC
User2  4 DC

Output:

User1  9 NYC
User2  4 DC

Here the first filed is a key. And if you want the last record of that particular key to be retained in the output.
we know how to retain the first element. It is as below. But not able to retain the last element.

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

Solution 1:

If order by one of the fields in descending order. Its possible to get the last record. In the below code, have ordered by second field of input

Input :

User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC

Pig snippet :

user_details = LOAD 'user_details.csv'  USING  PigStorage(',') AS (user_name:chararray,no:long,city:chararray);

user_details_grp_user = GROUP user_details BY user_name;

required_user_details = FOREACH user_details_grp_user {
    user_details_sorted_by_no = ORDER user_details BY no DESC;
    top_record = LIMIT user_details_sorted_by_no 1;
    GENERATE FLATTEN(top_record);
}

Output : DUMP required_user_details

(User1,9,NYC )
(User2,4,DC)

Apache Pig Basics

Apache Pig - Filtering

Apache Pig - Operators

Apache Pig - Functions

Eval Functions

Bag-Tuple Functions

DateTime Function

User Defined Function

Load-store Function

Math-function

Apache Pig- Regex

Apache Pig - Running Scripts

Apache pig - Execution

Apache Pig - How to

Related Searches to Removing duplicates using PigLatin

pig distinctpig join remove duplicate columnshow to remove duplicate values in pigpig distinct multiple columnspig count distinctgroup by in pigpig group by countpig is null