Skewed join in Pig
- Joining skewed data using apache Pig skewed join.In a distributed processing environment Data skew is a serious problem,and occurs when the data is not evenly divided among the key tuples from the map phase.
- To help the data skew issue with joins Apache Pig is used.
- Using two-table skewed join works.
- Construct the join Used “skewed”‘ to force it used skewed join.
pig.skewed join.reduce.memusage
- specifies the reducer to perform the join.
- Pig forces low fraction for more reducer but increases copying cost.
- Difficult to presence Parallel joins for underlying data.
- The underlying data is sufficiently skewed, load too much of the parallelism gains.
- Skewed join does not have restriction on the size of the input keys.
- It accomplishes by dividing one of the input on the join and other input.
Implementation:
- Skewed join it translates into two map/reduce jobs.
- The root job samples the input records and computes the underlying key space.
- The second job modules the input table and performs a join on the predicate.
- In order to join two tables, the first tables is partitioned and another is streamed to the reducer.
- The map task uses the pig.keydist file to define the number of reducers per key.
- It sends the key to each of the reducers in a round robin(RR)fashion. Skewed joins happen in the reduce phase of the join job.