pig tutorial - apache pig tutorial - Apache Pig - Join Operator - pig latin - apache pig - pig hadoop
What is Join?
- JOIN keyword is used to combine rows from two or more logs, based on a common fields .
- Left join returns all logs from left log(table) and matching logs from the right log.

Learn Apache Pig - Apache Pig tutorial - customer id column - Apache Pig examples - Apache Pig programs
Pig Operations - Joining

- Many different join implementations
- Left, right and full outer joins are supported
- Joining on multiple keys is supported
- sets are pre-sorted by the join key
- sets are pre-sorted and one set has few ( < 1% of its total) matching keys
- one set is very large, while other sets are small enough to fit into memory
- when a large number of records for some values of the join key is expected
- Our classic database operator for relations!

- Our classic database operator for relations!

How to use join operator in Apache Pig ?
- The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped.
- Self-join
- Inner-join
- Outer-join − left join, right join, and full join
wikitechy_customers.txt
1,Raj,32,Ahmedabad,2000.00
2,Keerthana,25,Delhi,1500.00
3,karthi,23,Kota,2000.00
4,Shyam,25,Mumbai,6500.00
5,Hari,27,Bhopal,8500.00
6,Kannan,22,MP,4500.00
7,Mukil,24,Indore,10000.00
orders.txt
102,2017-01-08 00:00:00,3,3000
100,2017-01-08 00:00:00,3,1500
101,2017-02-20 00:00:00,2,1560
103,2016-05-20 00:00:00,4,2060
And these two files into Pig with the relations wikitechy_customers and orders as given below.
grunt> wikitechy_customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
- To perform various Join operations on these two relations.
Self - join
- Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
- Usually, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names).
- Therefore load the contents of the file wikitechy_customers.txt as two tables as given below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax
- Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
- Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as given below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verification
- Now verify the relation customers3 using the DUMP operator as given below.
grunt> Dump customers3;
Output
- The following output, displaying the contents of the relation customers
(1,Raj,32,Ahmedabad,2000,1,Raj,32,Ahmedabad,2000)
(2,Keerthana,25,Delhi,1500,2,Keerthana,25,Delhi,1500)
(3,karthi,23,Kota,2000,3,karthi,23,Kota,2000)
(4,Shyam,25,Mumbai,6500,4,Shyam,25,Mumbai,6500)
(5,Hari,27,Bhopal,8500,5,Hari,27,Bhopal,8500)
(6,Kannan,22,MP,4500,6,Kannan,22,MP,4500)
(7,Mukil,24,Indore,10000,7,Mukil,24,Indore,10000)
Inner Join
- Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.
- It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate.
- The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate.
- When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.

Learn Apache Pig - Apache Pig tutorial - inner join - Apache Pig examples - Apache Pig programs
Syntax
- Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
- Let us perform inner join operation on the two relations wikitechy_customers and orders as given below.
grunt> wikitechy_coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verification
- Verify the relation wikitechy_coustomer_orders using the DUMP operator as given below.
grunt> Dump wikitechy_coustomer_orders;
Output
- The following output that will the contents of the relation named wikitechy_coustomer_orders.
(2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
(3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
(3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
(4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
Note
- Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways ,
- Left outer join
- Right outer join
- Full outer join
Left Outer Join
- The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.

Learn Apache Pig - Apache Pig tutorial - left outer join - Apache Pig examples - Apache Pig programs
Syntax
- Let us perform left outer join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example
- Let us perform left outer join operation on the two relations customers and orders as given below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
- Verify the relation outer_left using the DUMP operator as given below.
grunt> Dump outer_left;
Output
- The following output, displaying the contents of the relation outer_left.
(1,Raj,32,Ahmedabad,2000,,,,)
(2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
(3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
(3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
(4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
(6,Kannan,22,MP,4500,,,,)
(7,Mukil,24,Indore,10000,,,,)
Right Outer Join
- The right outer join operation returns all rows from the right table, even if there are no matches in the left table.

Learn Apache Pig - Apache Pig tutorial - right outer join - Apache Pig examples - Apache Pig programs
Syntax
- Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example
- Let us perform right outer join operation on the two relations wikitechy_customers and orders as given below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Verification
- Verify the relation outer_right using the DUMP operator as given below.
grunt> Dump outer_right
Output
- The following output, displaying the contents of the relation outer_right.
(2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
(3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
(3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
(4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
- The full outer join operation returns rows when there is a match in one of the relations

Learn Apache Pig - Apache Pig tutorial - full join new - Apache Pig examples - Apache Pig programs
Syntax
- Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example
- Let us perform full outer join operation on the two relations wikitechy_customers and orders as given below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Verification
- Verify the relation outer_full using the DUMP operator as given below.
grun> Dump outer_full;
Output
- The following output, displaying the contents of the relation outer_full.
(1,Raj,32,Ahmedabad,2000,,,,)
(2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
(3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
(3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
(4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
(6,Kannan,22,MP,4500,,,,)
(7,Mukil,24,Indore,10000,,,,)
Using Multiple Keys
- Let us perform JOIN operation using multiple keys are given below.
Syntax
- Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
- Ensure that we have two files namely wikitechy_employee.txt and wikitechy_employee_contact.txt in the /pig_data/ directory of HDFS as given below.
wikitechy_employee.txt
111,Anu,Shankar,21,programmer,003
112,Barvathi,Nambiayar,22,programmer,003
113,Kajal,Nayak,22,programmer,003
114,Preethi,Antony,21,programmer,003
115,Raj,Gopal,23,programmer,003
116,Yashika,Kannan,23,programmer,003
117,siddu,Narayanan,24,teamlead,002
118,Timple,Mohanthy,24,manager,001
wikitechy_employee_contact.txt
111,9876543210,anu@gmail.com,Hyderabad,003
112,9876543211,barvathi@gmail.com,Kolkata,003
113,9876543212,kajal@gmail.com,Delhi,003
114,9876543213,preethi@gmail.com,Pune,003
115,9876543214,raj@gmail.com,Bhuwaneshwar,003
116,9876543215,yashika@gmail.com,Chennai,003
117,9876543216,siddu@gmail.com,trivendram,002
118,9876543217,timple@gmail.com,Chennai,001
- And we have loaded these two files into Pig with relations wikitechy_employee and wikitechy_employee_contact as given below.
grunt> wikitechy_employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',')
as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
- Here join the contents of these two relations using the JOIN operator as given below.
grunt> emp = JOIN wikitechy_employee BY (id,jobid), wikitechy_employee_contact BY (id,jobid);
Verification
- To verify the relation emp using the DUMP operator as shown below.
grunt> Dump emp;
Output
- The following output, displaying the contents of the relation named emp as given below.
(111,Anu,Shankar,21,programmer,113,111,9876543210,anu@gmail.com,Hyderabad,113)
(112,Barvathi,Nambiayar,22,programmer,113,112,9876543211,barvathi@gmail.com,Kolka ta,113)
(113,Kajal,Nayak,22,programmer,113,113,9876543212,kajal@gmail.com,Delhi,113)
(114,Preethi,Antony,21,programmer,113,114,9876543213,preethi@gmail.com,Pune,113)
(115,Raj,Gopal,23,programmer,113,115,9876543214,raj@gmail.com,Bhuwaneshw ar,113)
(116,Yashika,Kannan,23,programmer,113,116,9876543215,yashika@gmail.com,Chennai,113)
(117,siddu,Narayanan,24,teamlead,112,117,9876543216,siddu@gmail.com,trivendram,112)
(118,Timple,Mohanthy,24,manager,111,118,9876543217,timple@gmail.com,Chennai,111)