[Solved-1 Solutions] Apache Pig load entire relationship into UDF ?

UDF:

Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.

Problem:

We have a pig script that pertains to 2 Pig relations, let’s say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently we do it like this.

A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);

Every machine load from 'templocation' to get A. This works, but we have two problems with it.

How to load a relationship directly into the HDFS cache ?
When we reload the file in UDF we got to write logic to parse the output from A that was outputted to file when we did rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).

Does anyone know how it should be done ?

Solution 1:

We can GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them.
This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.

Here is the code that used to load entire relationship into udf

add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);