pig tutorial - apache pig tutorial - Pig Example - Apache pig example - pig latin - apache pig - pig hadoop

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig.

Problem Statement : 1

Example: movies_data.csv

1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956

Load data

$ pig -x local
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration)
grunt> dump movies; it displays the contents

Filter data

grunt> movies_greater_than_35 = FILTER movies BY (float)rating > 3.5;
grunt> dump movies_greater_than_35;

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - apache grunt filter data

Store the results data from pig

grunt> store movies_greater_than_35 into 'my_movies';
It stores the result in local file system directory named 'my_movies'.

Display the result

Now display the result from local file system.

cat my_movies/part-m-00000

Load command

The load command specified only the column names. We can modify the statement as follows to include the data type of the columns:

  grunt> movies = LOAD 

'movies_data.csv' USING

PigStorage(',') as (id:int,

name:chararray, year:int,

rating:double, duration:int);

Check the filters

List the movies that were released between 1950 and 1960

 grunt> movies_between_90_95 = FILTER

      movies by year > 1990 and year < 1995;

List the movies that start with the Alpahbet D

 grunt> movies_starting_with_D = FILTER   
       movies by name matches 'D.*';

List the movies that have duration greater that 2 hours

grunt> movies_duration_2_hrs = FILTER 
movies by duration > 7200;

Pig Code Output

Foreach statement in Pig

FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example.

List the movie names its duration in minutes

grunt> movie_duration = FOREACH movies
GENERATE name, (double)(duration/60);

The above statement generates a new alias that has the list of movies and it duration in minutes.

You can check the results using the DUMP command.

Output of the foreach statement in PIG

Group Statement in PIG

The GROUP keyword is used to group fields in a relation.

List the years and the number of movies released each year.

grunt> grouped_by_year = group movies
by year;
grunt> count_by_year = FOREACH
grouped_by_year GENERATE group,
COUNT(movies);

Order by in PIG Statement

Let us question the data to illustrate the ORDER BY operation.

List all the movies in the ascending order of year.

grunt> desc_movies_by_year = ORDER
movies BY year ASC;
grunt> DUMP desc_movies_by_year;

List all the movies in the descending order of year.

grunt> asc_movies_by_year = ORDER movies
by year DESC;
grunt> DUMP asc_movies_by_year;

Ascending by year

Limit operator in pig

Use the LIMIT keyword to get only a limited number for results from relation.

grunt> top_5_movies = LIMIT movies 5;
grunt> DUMP top_10_movies;

Pig: Modes of Execution

Pig programs can be run in three methods which work in both local and MapReduce mode. They are

Script Mode
Grunt Mode
Embedded Mode

Script mode in pig

Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:

$ vim scriptfile.pig
A = LOAD 'script_file';
DUMP A;
$ pig x
local scriptfile.pig

Grunt mode in pig

Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell.

It is started when no file is specified for pig to run.

$ pig -x local
grunt> A = LOAD 'grunt_file';
grunt> DUMP A;

You can also run pig scripts from grunt using run and exec commands.

grunt> run scriptfile.pig
grunt> exec scriptfile.pig

Embedded mode in PIG

You can embed pig programs in Java, python and ruby and can run from the same.

Problem Statement 2:

Using Pig find the most occurred start letter.

Solution:

Case 1:

Load the data into bag named "lines". The entire line is stuck to element line of type character array.

grunt> lines  = LOAD "/user/Desktop/data.txt" AS (line: chararray);  

Case 2:

The text in the bag lines needs to be tokenized this produces one word per row.

grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line))   As token: chararray;

Case 3:

To retain the first letter of each word type the below command .This commands uses substring method to take the first character.

grunt>letters = FOREACH tokens  GENERATE SUBSTRING(0,1)   as letter : chararray;  

Case 4:

Create a bag for unique character where the grouped bag will contain the same character for each occurrence of that character.

grunt>lettergrp = GROUP letters by letter;

Case 5:

The number of occurrence is counted in each group.

grunt>countletter  = FOREACH  lettergrp  GENERATE group  , COUNT(letters);

Case 6:

Arrange the output according to count in descending order using the commands below.

grunt>OrderCnt = ORDER countletter  BY  $1  DESC;  

Case 7:

Limit to One to give the result.

grunt> result  =LIMIT    OrderCnt    1;  

Case 8:

Store the result in HDFS . The result is saved in output directory under sonoo folder.

grunt> STORE   result   into 'home/sonoo/output';  

pig tutorial - apache pig tutorial - Pig Example - Apache pig example - pig latin - apache pig - pig hadoop

What is pig?

Problem Statement : 1

Store the results data from pig

Display the result

Load command

Check the filters

Pig Code Output

Foreach statement in Pig

Output of the foreach statement in PIG

Group Statement in PIG

Order by in PIG Statement

Pig: Modes of Execution

Script mode in pig

Grunt mode in pig

Embedded mode in PIG

Problem Statement 2:

Solution:

Case 1:

Case 2:

Case 3:

Case 4:

Case 5:

Case 6:

Case 7:

Case 8:

Related Searches to Pig Example

Wikitechy

Workshop

Join our Community

Other Languages

pig tutorial - apache pig tutorial - Pig Example - Apache pig example - pig latin - apache pig - pig hadoop

What is pig?

Problem Statement : 1

Store the results data from pig

Display the result

Load command

Check the filters

Pig Code Output

Foreach statement in Pig

Output of the foreach statement in PIG

Group Statement in PIG

Order by in PIG Statement

Pig: Modes of Execution

Script mode in pig

Grunt mode in pig

Embedded mode in PIG

Problem Statement 2:

Solution:

Case 1:

Case 2:

Case 3:

Case 4:

Case 5:

Case 6:

Case 7:

Case 8:

Related Searches to Pig Example

Summer Offline Internship

Summer Online Internship

Internship in Chennai

Programming / Technology Internship in Chennai

Wikitechy

Workshop

Join our Community

Other Languages