pig tutorial - apache pig tutorial - Apache Pig Split Operator - pig latin - apache pig - pig hadoop
What is Split Operator Apache Pig ?
- The SPLIT operator is used to split a relation into two or more relations.
- The Split operator can be an operator within the reachability graph of a consistent region.
- The Split operator is configurable with a single input port. The input port is non-mutating and its punctuation mode is Oblivious Output Ports.
- The Split operator is configurable with one or more output ports.
- Splits a relation into multiple relations based on conditions


- SPLIT users into kids if age < 18, adults if age >= 18 and age <65, seniors otherwise;
- SPLIT data into testing if RANDOM() <= 0.10, training otherwise;<
- SPLIT operator cannot handle non deterministic functions (such as RANDOM).
DEFINE split_into_training_testing(inputData, split_percentage)
RETURNS training, testing{
data = foreach $inputData generate RANDOM() as random_assignment, *;
SPLIT data into testing_data if random_assignment <= $split_percentage, training_data otherwise;
$training = foreach training_data generate $1..;
$testing = foreach testing_data generate $1..;
};
inData = load ''some_files.txt‘ USING PigStorage(‘\t’);
training, testing = split_into_training_testing(inData, 0.1);
Syntax for Macro definition:-
DEFINE macro_name (param [, param ...]) RETURNS {void | alias [, alias ...]} { pig_latin_fragment };
Syntax for Macro expansion:-
alias [, alias ...] = macro_name (param [, param ...]) ;
Syntax
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
Example
Ensure that we have a file named wikitechy_employee_details.txt in the HDFS directory /pig_data/ as given below. wikitechy_employee_details.txt
111,Anu,Shankar,23,9876543210,Chennai
112,Barvathi,Nambiayar,24,9876543211,Chennai
113,Kajal,Nayak,24,9876543212,Trivendram
114,Preethi,Antony,21,9876543213,Pune
115,Raj,Gopal,21,9876543214,Hyderabad
116,Yashika,Kannan,22,9876543215,Delhi
117,siddu,Narayanan,22,9876543216,Kolkata
118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar
- And we have loaded this file into Pig with the relation name wikitechy_employee_details as given below.
Wikitechy_employee_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
- Now split the relation into two, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25.
SPLIT wikitechy_employee_details into wikitechy_employee _details1 if age<23, wikitechy_employee_details2 if (22<age and age>25);
Verification
Now verify the relations wikitechy_employee_details1 and wikitechy_employee_details2using the DUMP operator as shown below.
grunt> Dump wikitechy_employee_details1;
grunt> Dump wikitechy_employee _details2;
Output
- The following output, display the contents of the relations wikitechy_employee_details1 and wikitechy_employee _details2 respectively.
grunt> Dump wikitechy_employee_details1;
114,Preethi,Antony,21,9876543213,Pune
115,Raj,Gopal,21,9876543214,Hyderabad
116,Yashika,Kannan,22,9876543215,Delhi
117,siddu,Narayanan,22,9876543216,Kolkata
grunt> Dump wikitechy_employee_details2;
111,Anu,Shankar,23,9876543210,Chennai
112,Barvathi,Nambiayar,24,9876543211,Chennai
113,Kajal,Nayak,24,9876543212,Trivendram
118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar