[Solved-1 Solutions] Apache Pig load entire relationship into UDF ?
UDF:
- Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them.
- The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
- For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
- Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
- Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
- In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.
Problem:
We have a pig script that pertains to 2 Pig relations, let’s say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently we do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
Every machine load from 'templocation' to get A. This works, but we have two problems with it.
- How to load a relationship directly into the HDFS cache ?
- When we reload the file in UDF we got to write logic to parse the output from A that was outputted to file when we did rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done ?
Solution 1:
- We can GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them.
- This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
Here is the code that used to load entire relationship into udf
add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
- This is replicated join, it's also only map-side join.