[Solved-2 Solutions] Pig: Get top n values per group ?
Problem :
The below data is already grouped and aggregated.
user value count
---- -------- ------
wiki third 5
wiki first 11
wiki second 10
wiki fourth 2
...
tiki second 20
tiki third 18
tiki first 21
tiki fourth 8
- For every user (wiki and tiki), we want to retrieve their top n values (let's say 2), sorted terms of 'count'. So the desired output want it to be:
Wiki first 11
Wiki second 10
Tiki first 21
Tiki second 20
How can we accomplish that?
Solution 1:
- The below code is helps to get n values
records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;
top3 = foreach grpd {
sorted = order records by counter desc;
top = limit sorted 2;
generate group, flatten(top);
};
Input :
wiki third 5
wiki first 11
wiki second 10
wiki fourth 2
tiki second 20
tiki third 18
tiki first 21
tiki fourth 8
Output :
(wiki,wiki,first,11)
(wiki,wiki,second,10
(tiki,tiki,first,21)
(tiki,tiki ,second,20)
Solution 2:
Here is an example
top = limit sorted 2;
- top is an inbuilt function and may throw an error so the only thing which we did was changed the name of the relation in this case and instead of
generate group, flatten(top);
Output:
(wiki,wiki,first,11)
(wiki,wiki,second,10)
(tiki,tiki,first,21)
(tiki,tiki,second,20)
Modified that as shown below -
records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
sorted = order records by count desc;
top1 = limit sorted 2;
generate flatten(top1);
};
Output:
(wiki,first,11)
(wiki,second,10)
(tiki,first,21)
(tiki,second,20)