pig tutorial - apache pig tutorial - Apache Pig - Top() - pig latin - apache pig - pig hadoop
What is TOP() function in Apache Pig ?
- The TOP() function of Pig Latin is used to get the top N tuples of a bag.
- To this function, as inputs, we have to pass a relation, the number of tuples you need, and the column name whose values are being compared.
- This function will return a bag containing the required columns.
Syntax
grunt> TOP(topN,column,relation)
Example
- Ensure we have a file named wikitechy_emp_details.txt in the HDFS directory /pig_data/, with the following content.
Wikitechy_emp_details.txt
111,Anu,22,newyork
112,Bastin,23,Kolkata
113,Cimen,23,Tokyo
114,Darathy,25,London
115,Enba,23,Bhuwaneshwar
116,Favin,22,Chennai
117,Robert,22,newyork
118,Syam,23,Kolkata
119,Mary,25,Tokyo
120,Vincent,25,London
121,Preethi,25,Bhuwaneshwar
122,Antony,22,Chennai
- You have loaded this file into Pig with the relation name emp_data as given below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/ wikitechy_emp_details.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
- Group the relation emp_data by age, and store it in the relation emp_group.
grunt> emp_group = Group emp_data BY age;
Now verify the relation emp_group using the Dump operator as given below.
grunt> Dump emp_group;
(22,{(122,Antony,22,Chennai),(117,Robert,22,newyork),(116,Favin,22,Chennai),(111,Anu,22,newyork)})
(23,{(118,Syam,23,Kolkata),(115,David,23,Bhuwaneshwar),(113,Cimen,23,Tokyo),(112,Bastin,23, Kolkata)})
(25,{(111,Anu,25,Bhuwaneshwar),(120,Vincent,25,London),(119,Mary,25,Tokyo),(114,Darathy, 25,London)})
Now, you can get the top two records of each group arranged in ascending order (based on id) as given below.
grunt> data_top = FOREACH emp_group {
top = TOP(2, 0, emp_data);
GENERATE top;
}
- In this instance we are retriving the top 2 tuples of a group having greater id.
- Then we are retriving top 2 tuples basing on the id, we are passing the index of the column name id as second parameter of TOP() function.
Verification
You can verify the contents of the data_top relation using the Dump operator as given below.
grunt> Dump data_top;
({(117,Robert,22,newyork),(122,Antony,22,Chennai)})
({(115,David,23,Bhuwaneshwar),(118,Syam,23,Kolkata)})
({(120,Vincent,25,London),(111,Anu,25,Bhuwaneshwar)})