[Solved-3 Solutions] Pig Script REPLACE with pipe symbol ?
Problem:
- If you want to strip characters outside of the curly brackets in rows that look like the following.
35|{......}|
- Stripping the '35|' from the front and the trailing '|' from the end.
{.....}
- Initially working on the first 3 characters, If you try the following but it removes everything.
a = LOAD '/file' as (line1:chararray);
b = FOREACH x generate REPLACE(line1, '35|','');
Solution 1:
|
and{
and}
are special characters in regular expressions and the second parameter forREPLACE
is a regular expression.- Try to escape the characters:
b = FOREACH x generate REPLACE(line1, '35\\|','');
Solution 2:
- Function (UDF) which takes your data as input and returns the processed data. If we want to transform data into a more complex form which cant be achieved simply by REPLACE , we can create a Javascript/Java/Jython/Ruby/Groovy/Python User Defined
Example of Javascript UDF:
Pig Script:
--including the js file containing the UDF
register 'test.js' using javascript as myfuncs;
a = LOAD '/file' as (line1:chararray);
--Processing each line1 by calling UDF
b = FOREACH x generate myfuncs.processData(line1);
dump b;
test.js
processData.outputSchema = "word:chararray,num:long";
function processData(word){
return {word:word, num:word.length};
}
Solution 3:
- We could use REGEX_EXTRACT :
REGEX_EXTRACT(line1, '.*(\\{.*\\}).*', 1);