pig tutorial - apache pig tutorial - Apache Pig - PluckTuple() Function - pig latin - apache pig - pig hadoop
What is PluckTuple() Function in Apache Pig ?
- PluckTuple() which is used in Apache Pig is an regex pattern to pluck by
- We can use the function PluckTuple() after performing operations like join to differentiate the columns of the two schemas.
- We need to define a string Prefix and we need to filter for the columns in the relation that begin with the prefix.
- It will allow the user to specify a string prefix, and it will filter for the columns in a relation that begin match that give us the regex pattern.
- We can include flag 'false' to filter for the columns that do not match that prefix which is given for regex pattern.
Syntax
DEFINE pluck PluckTuple(expression1)
DEFINE pluck PluckTuple(expression1,expression3)
pluck(expression2)
Example
- We can assume that we have two files namely wikitechy_employee_sales.txt and wikitechy_employee_bonus.txt in the HDFS directory /pig_data/.
<b>wikitechy_employee_sales.txt</b>
1,Joseph,22,25000,sales
2,BOB,23,30000,sales
3,Saya,23,25000,sales
4,Sarah,25,40000,sales
5,John,23,45000,sales
6,Vanitha,22,35000,sales
wikitechy_employee_bonus.txt
<b>wikitechy_employee_bonus.txt</b>
1,Joseph,22,25000,sales
2,Jaya,23,20000,admin
3,Saya,23,25000,sales
4,Preethi,25,50000,admin
5,John,23,45000,sales
6,Sruti,30,30000,admin
- We have loaded these files into Pig, with the relation names called employee_sales and employee_bonus
employee_sales
grunt> employee_sales = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_sales.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
employee_bonus
grunt> employee_bonus = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_bonus.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
We need to join these two relations by using the join operator which is given below.
grunt> join_data = join employee_sales by sno, employee_bonus by sno;
- We can verify the relation join_data by using the Dump operator which is given below:
<b>grunt> Dump join_data;</b>
(1,Joseph,22,25000,sales,1,Joseph,22,25000,sales)
(2,BOB,23,30000,sales,2,Jaya,23,20000,admin)
(3,Saya,23,25000,sales,3,Saya,23,25000,sales)
(4,Sarah,25,40000,sales,4,Preethi,25,50000,admin)
(5,John,23,45000,sales,5,John,23,45000,sales)
(6,Vanitha,22,35000,sales,6,Sruti,30,30000,admin)
Using PluckTuple() Function
- We need to define the required expression by which we want to differentiate the columns by using PluckTupe() function.
grunt> DEFINE pluck PluckTuple('a::');
- We need to filter the columns in the join_data relation which is given below:
grunt> data = foreach join_data generate FLATTEN(pluck(*));
- We need to describe the relation named data by using the grunt operator which is given below:
<b>grunt> Describe data;</b>
data: {employee_sales::sno: int, employee_sales::name: chararray, employee_sales::age: int,
employee_sales::salary: int, employee_sales::dept: chararray, employee_bonus::sno: int,
employee_bonus::name: chararray, employee_bonus::age: int, employee_bonus::salary: int,
employee_bonus::dept: chararray}