pig tutorial - apache pig tutorial - apache pig with apache tez - pig latin - apache pig - pig hadoop
Apache pig with Apache tez
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - mapreduce vs tez](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/mapreduce-vs-tez.png)
Tez DAG - Directed Acyclic Graph
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - pig tez dag](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/pig-tez-dag.png)
- 2 DISTINCT + JOIN + 2 GROUP BY
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - tez directed acyclic graph](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/tez-directed-acyclic-graph.png)
High Depth DAG - Directed Acyclic Graph
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - tez high depth directed acyclic graph](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/tez-high-depth-directed-acyclic-graph.png)
Wide DAG - Directed Acyclic Graph
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - tez wide directed acyclic graph](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/tez-wide-directed-acyclic-graph.png)
Disjoint Trees DAG - Directed Acyclic Graph
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - tez disjoint trees acyclic graph](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/tez-disjoint-trees-acyclic-graph.png)
Bloom Filter in TEZ
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache tez bloom filter](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/apache-tez-bloom-filter.png)
Pig Script - Bloom UDF
define bb BuildBloom('128', '3', 'jenkins');
small = load 'S' as (x, y, z);
grpd = group small all;
fltrd = foreach grpd generate bb(small.x);
store fltrd in ’ mybloom';
exec;
define bloom Bloom('mybloom');
large = load 'L' as (a, b, c);
flarge = filter large by bloom(L.a);
joined = join small by x, flarge by a;
store joined into ’ results';
Pig Script - Bloom Join
large = load 'L' as (a, b, c);
small = load 'S' as (x, y, z);
joined = join large by a, small by x using 'bloom';
store joined into 'results';
Bloom Filter Tuning
- The size in bytes of the bit vector to be used for the bloom filter.
- A bigger vector size will be needed when the number of distinct keys is higher. Default value is 1048576 (1MB).
- The type of hash function to use.
- Valid values are 'jenkins' and 'murmur'. Default is murmur.
- The number of hash functions to be used in bloom computation.
- It determines the probability of false positives. Higher the number lower the false positives. Too high a value can increase the CPU time.
- Default value is 3.
Apache PIG Hash Join
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache pig hash join](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/pig-hash-join.png)
Apache tez - Bloom Join - Map Strategy
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache tez bloom join](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/apache-tez-bloom-join.png)
Apache pig - apache tez - Bloom Join - Reduce Strategy
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache pig apache tez reduce strategy](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/apache-pig-apache-tez-reduce-strategy.png)
Apache Tez - Partitioned Bloom Filters
![learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache tez partitioned bloom filters](https://wikitechy.com/tutorials/apache-pig/img/apache-pig-images/apache-tez-partitioned-bloom-filters.png)
Apache pig - apache tez - Bloom Join - Execution Tuning
- Valid values are 'map' and 'reduce'. Default value is map
- Map strategy creates bloom filters in each map and combines them in the reducer. Fast and ideal for small to medium datasets or distinct join keys.
- Reduce strategy sends the join keys to a reducer and creates the bloom filter there. Ideal for large datasets or repeating join keys.
- The number of bloom filters that will be created
- Will use that many reducers to create the bloom filters in parallel
- Default is 1 for map strategy and 11 for reduce strategy
- Used to turn off the combiner with the reduce strategy when the keys are mostly distinct
- Default is false