pig tutorial - apache pig tutorial - Apache Pig TOKENIZE() Function - pig latin - apache pig - pig hadoop
What is TOKENIZE() function in Apache Pig ?
- The TOKENIZE() function used in Apache Pig is used to split a string in a single tuple and returns a bag which contains the output of the split operation.
- The TOKENIZE() function is used to break an input string into tokens separated by a regular expression pattern.
- The TOKENIZE() function is when the Token elements are placed under the element
- The TOKENIZE() function will returns one token element, which contains the input string.
- The TOKENIZE() function has each substring value which is found between the separator matches is placed inside elements with the name token and the namespace mhub
Syntax
grunt> TOKENIZE(expression [, 'field_delimiter'])
Example
wikitechy_student_details.txt
111,Suresh Reddy,21,Hyderabad
112,Arvin Battacharya,22,Kolkata
113,Ramesh Khanna,22,Delhi
114,Preethi Agarwal,21,Pune
115,Sruthi Mohanthy,23,Bhuwaneshwar
116,Vanitha Mishra,23 ,Chennai
117,Kamala Nayak,24,trivendram
118,Bhargavi Nambiayar,24,Chennai
We have loaded the file into Pig with the relation name wikitechy_student_details which is given below:
grunt> wikitechy_student_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_student_details.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Tokenizing a String
We can use the TOKENIZE() function to split into a string.
grunt> student_name_tokenize = foreach wikitechy_student_details Generate TOKENIZE(name);
Verification
grunt> Dump student_name_tokenize;
Output
({(Suresh),(Reddy)})
({(Arvin),(Battacharya)})
({(Ramesh),(Khanna)})
({(Preethi),(Agarwal)})
({(Sruthi),(Mohanthy)})
({(Vanitha),(Mishra)})
({(Kamala),(Nayak)})
({(Bhargavi),(Nambiayar)})