[Solved-2 Solutions] Pig: Hadoop jobs Fail ?
What is hadoop ?
- Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment
Problem:
we have a pig script that queries data from a csv file.
The script has been tested locally with small and large .csv files.
In Small Cluster: It starts with processing the scripts, and fails after completing 40% of the call
The error is,
Failed to read data from "path to file"
Solution 1:
- An answer for the General Problem would be changing the errors levels in the Configuration Files, adding these two lines to mapred-site.xml
log4j.logger.org.apache.hadoop = error,A
log4j.logger.org.apache.pig= error,A
It as a kind of an OutOfMemory Exception
Solution 2:
- Its needed to check logs to increase the verbosity level if needed
To change the memory in Hadoop change the hadoop-env.sh file
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx128m ${HADOOP_CLIENT_OPTS}"
For Apache PIG we have this in the header of pig bash file:
# PIG_HEAPSIZE The maximum amount of heap to use, in MB.
# Default is 1000.
So we can use export
$ export PIG_HEAPSIZE=4096MB