what is apache pig - apache pig tutorial - What is Apache Pig - pig latin - apache pig - pig hadoop

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information

It’s very difficult to manage such huge data……

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig code - apache pig program - apache pig download - apache pig example

Hadoop and its Characteristics

Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model

It is an Open-source Data Management technology with scale-out storage and distributed processing

Hadoop Ecosystem

Need for Pig

Where to use Pig?

Pig is a Data Flow language, thus it is most suitable for:

Quickly changing data processing requirements
Processing data from multiple channels
Quick hypothesis testing
Time sensitive data refreshes
Data profiling using sampling

What is Pig ?

It is an open source data flow language

Pig Latin is used to express the queries and data manipulation operations in simple scripts

Pig converts the scripts into a sequence of underlying Map Reduce jobs

What does it mean to be Pig?

Pigs Eats Everything

Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

Pigs Live Everywhere

Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. Check for Pig on Tez

Pigs Are Domestic Animals

Pig is designed to be easily controlled and modified by its users.
Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals.
Pig supports user provided load and store functions.
It supports external executables via its stream command and Map Reduce jars via its MapReduce command.
It allows users to provide a custom partitioner for their jobs in some circumstances and to set the level of reduce parallelism for their jobs.

Pigs fly

Pig processes data quickly. Designers want to consistently improve its performance, and not implement features in ways that weigh pig down so it can't fly.

Apache Pig - Platforms

Platform for easier analyzing large data sets

PigLatin: Simple but powerful data flow language similar to scripting languages
PigLatin is a high level and easy to understand data flow programming language
Provides common data operations (e.g. filters, joins, ordering) and nested types (e.g. tuples, bags, maps)
It's more natural for analysts than MapReduce
Opens Hadoop to non-Java programmers
Pig Engine: Parses, optimizes and automatically executes PigLatin scripts as series of MapReduce jobs on Hadoop cluster

Where does pig live

Pig is installed on user machine

No need to install anything on the Hadoop cluster

Pig and Hadoop versions must be compatible

Pig submits and executes jobs to the Hadoop cluster

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example - pig client machine

How does Pig work?

Apache pig - data model

Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.
Example: (wikitechy, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in th same position (column) have the same type.
Example: {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example: {wikitechy, 30, {984xxxxx338, wikitechy.com@gmail.com,}}

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).

Map

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’
Example: [name#wikitechy, age#30]

Internalizing Pig

Let’s find out people who “overall” visit “highly ranked” pages

pig in real time

Since Pig is a data flow language, it naturally suits for:

Data factory operations
Typically data is brought from multiple servers to HDFS
Pig is used for cleaning the data and preprocessing it
It helps data analysts and researchers for quickly prototyping their theories
Since Pig is extensible, it becomes way easier for data analysts to spawn their scripting language programs (like Ruby, Python programs) effectively against large data sets

Ways to Handle Pig

Grunt Mode:

It’s interactive mode of Pig
Very useful for testing syntax checking and ad-hoc data exploration

Script Mode:

Runs set of instructions from a file
Similar to a SQL script file

Embedded Mode:

Executes Pig programs from a Java program
Suitable to create Pig Scripts on the fly

Modes of Pig

All of the different Pig invocations can run in the following modes:

Local

In this mode, entire Pig job runs as a single JVM process
Picks and stores data from local Linux path

/* local mode */
pig –x local …
java -cp pig.jar org.apache.pig.Main -x local …

Map Reduce

In this mode, Pig job runs as a series of map reduce jobs
Input and output paths are assumed as HDFS paths

 /* mapreduce mode */
pig or pig –x mapreduce …
java -cp pig.jar org.apache.pig.Main ...
java -cp pig.jar org.apache.pig.Main -x mapreduce ...

Pig Components

Working with Data in pig

Pig Programs Execution

Pig is just a wrapper on top of Map Reduce layer

It parses, optimizes and converts the Pig script to a series of Map Reduce jobs

Apache Pig Sample Script

LOAD

Loads data from the file system.
LOAD 'data' [USING function] [AS schema];
If you specify a directory name, all the files in the directory are loaded.
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);

STORE

Stores or saves results to the file system.
STORE alias INTO 'directory' [USING function];
A = LOAD ‘t.txt' USING PigStorage('\t');
STORE A INTO USING PigStorage(‘*') AS (f1:int, f2:int);

LIMIT

Limits the number of output tuples.
alias = LIMIT alias n;
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
B = LIMIT A 5;

FILTER

Selects tuples from a relation based on some condition..
alias = FILTER alias BY expression;
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
B = FILTER A f2 > 2;

 A = LOAD '/user/mapr/training/pig/emp.csv' USING
PigStorage(',') AS (id, firstname, lastname, designation,
city);
                
                DUMP A INTO '/user/mapr/training/pig/output';
                
                
    STORE A INTO '/user/mapr/training/pig/output';

Apache Pig Example Scripts

 X = LOAD '/user/mapr/training/pig/emp_pig1.csv' USING PigStorage(',') AS
(id, firstname, lastname, designation, city);
Y = LOAD '/user/mapr/training/pig/emp_pig2.csv' USING PigStorage(',') AS
(id, firstname, lastname, designation, city);
Z = JOIN X by (designation), Y BY (designation);
final = FILTER Z by X::designation MATCHES 'Manager';
A = GROUP X BY city;
B = FOREACH X GENERATE id, designation;
STORE final INTO '/user/mapr/training/pig/output';

Apache Pig Advanced Scripts

Get distinct of elements in pig
process data in parallel in pig
sample data in pig
order by elements in pig

DISTINCT

Removes duplicate tuples in a relation.
alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
B = DISTINCT A;

DUMP

Dumps or displays results to screen.
DUMP alias;
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
DUMP A;

ORDER BY

Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);
B = ORDER A BY f2;
DUMP B;

UNION

Computes the union of two or more relations.
alias = UNION [ONSCHEMA] alias, alias [, alias …];
L1 = LOAD 'f1' USING (a : int, b : float);
L2 = LOAD 'f1' USING (a : long, c : chararray);
U = UNION ONSCHEMA L1, L2;
DESCRIBE U ;
U : {a : long, b : float, c : chararray}

Join(Inner)

Performs an inner join of two or more relations based on common field values.
alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
A = load 'mydata';
B = load 'mydata';
C = join A by $0, B by $0;
DUMP C;

Join(Outer)

Performs an outer join of two relations based on common field values.
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY rightalias- column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
DUMP C;

Apache Pig user defined Functions

FOREACH

Generates data transformations based on columns of data.
alias = FOREACH { block | nested_block };
X = FOREACH A GENERATE f1;
X = FOREACH B { S = FILTER A BY 'xyz‘ == ‘3’; GENERATE COUNT (S.$0); }

CROSS

Computes the cross product of two or more relations.
alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
B = LOAD 'data2' AS (b1:int,b2:int);
X = CROSS A, B

(CO)GROUP

Groups the data in one or more relations.
The GROUP and COGROUP operators are identical.
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
A = load 'student' AS (name:chararray, age:int, gpa:float);
B = GROUP A BY age;
DUMP B;

Apache Pig Storage

Pig Latin vs hiveql

Pig’s Debugging Operators

Use the DUMP operator to display results to your terminal screen.

Use the DESCRIBE operator to review the schema of a relation.

Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.

Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

Shortcuts for Debugging Operators

\d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.

\de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.

\e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.

\i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.

\q - To quit grunt shell

Pig Advanced Operations

ASSERT

Assert a condition on the data..
ASSERT alias BY expression [, message];
A = LOAD 'data' AS (a0:int,a1:int,a2:int);
ASSERT A by a0 > 0, 'a0 should be greater than 0';

CUBE

Performs cube/rollup operations.
alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];
cubedinp = CUBE salesinp BY CUBE(product,year);
rolledup = CUBE salesinp BY ROLLUP(region,state,city);
cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);

SAMPLE

Selects a random sample of data based on the specified sample size.
SAMPLE alias size;
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;

RANK

Returns each tuple with the rank within a relation.
alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [DENSE] ];
B = rank A;
C = rank A by f1 DESC, f2 ASC;
C = rank A by f1 DESC, f2 ASC DENSE;

MAPREDUCE

Executes native MapReduce jobs inside a Pig script.
alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `];
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

IMPORT

Import macros defined in a separate file.
IMPORT 'file-with-macro';

STREAM

Sends data to an external script or program.
alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;
A = LOAD 'data';
B = STREAM A THROUGH `perl stream.pl -n 5`;

Built-in functions

Eval functions

AVG
CONCAT
COUNT
COUNT_STAR

Math functions

ABS
SQRT
Etc …

STRING functions

ENDSWITH
TRIM
…

Datetime functions

AddDuration
GetDay
GetHour
…

Dynamic Invokers

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

File System Commands with Apache Pig - Hadoop

Hadoop - Apache pig Utility Commands

Some more commands in PIG

To select few columns from one dataset

S1 = foreach a generate a1, a1;

Simple calculation on dataset

K = foreach A generate $1, $2, $1*$2;

To display only 100 records

B = limit a 100;

To see the structure/Schema

Describe A;

To Union two datasets

C = UNION A,B;

Using Hive tables with HCatalog

HCatalog (which is a component of Hive) provides access to Hive’s metastore, so that Pig queries can reference schemas each time.

• For example, after running through An Example to load data into a Hive table called records, Pig can access the table’s schema and data as follows:

 pig -useHCatalog
 grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
 grunt> DESCRIBE records;
 grunt> DUMP records;

what is apache pig - apache pig tutorial - What is Apache Pig - pig latin - apache pig - pig hadoop

What is Big Data ?

Hadoop and its Characteristics

Hadoop Ecosystem

Need for Pig

Where to use Pig?

What is Pig ?

What does it mean to be Pig?

Apache Pig - Platforms

Where does pig live

How does Pig work?

Apache pig - data model

Internalizing Pig

pig in real time

Ways to Handle Pig

Modes of Pig

Pig Components

Working with Data in pig

Pig Programs Execution

Apache Pig Sample Script

Apache Pig Example Scripts

Apache Pig Advanced Scripts

Apache Pig user defined Functions

Apache Pig Storage

Pig Latin vs hiveql

Pig’s Debugging Operators

Shortcuts for Debugging Operators

Pig Advanced Operations

Built-in functions

File System Commands with Apache Pig - Hadoop

Hadoop - Apache pig Utility Commands

Some more commands in PIG

Using Hive tables with HCatalog

Related Searches to What is Apache Pig ?

Wikitechy

Workshop

Join our Community

Other Languages

what is apache pig - apache pig tutorial - What is Apache Pig - pig latin - apache pig - pig hadoop

What is Big Data ?

Hadoop and its Characteristics

Hadoop Ecosystem

Need for Pig

Where to use Pig?

What is Pig ?

What does it mean to be Pig?

Apache Pig - Platforms

Where does pig live

How does Pig work?

Apache pig - data model

Internalizing Pig

pig in real time

Ways to Handle Pig

Modes of Pig

Pig Components

Working with Data in pig

Pig Programs Execution

Apache Pig Sample Script

Apache Pig Example Scripts

Apache Pig Advanced Scripts

Apache Pig user defined Functions

Apache Pig Storage

Pig Latin vs hiveql

Pig’s Debugging Operators

Shortcuts for Debugging Operators

Pig Advanced Operations

Built-in functions

File System Commands with Apache Pig - Hadoop

Hadoop - Apache pig Utility Commands

Some more commands in PIG

Using Hive tables with HCatalog

Related Searches to What is Apache Pig ?

Summer Offline Internship

Summer Online Internship

Internship in Chennai

Programming / Technology Internship in Chennai

Wikitechy

Workshop

Join our Community

Other Languages