[Solved-1 Solution] Datetime parsing in Apache Pig ?
What is parsing ?
- Parsing methods convert the string representation of a date and time to an equivalent DateTime object.
- Parsing is influenced by the properties of a format provider that supplies information such as the strings used for date and time separators, and the names of months, days, and eras.
- The format provider is the current DateTimeFormatInfo object, which is provided implicitly by the current thread culture or explicitly by the IFormatProvider parameter of a parsing method.
- For the IFormatProvider parameter, specify a CultureInfo object, which represents a culture, or a DateTimeFormatInfo object.
Problem :
We are trying to parse a Date in a Pig script and we got the following error "Hadoop does not return any error message". Here is the example of Date format: 16/7/18 11:00 AM
data = LOAD 'cleaned.txt'
AS (Date, Block, Primary_Type, Description, Location_Description, Arrest, Domestic, District, Year);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
It looks like the error is caused by the STORE command on "times".
If we do a DUMP then we got the error:
ERROR 1066: Unable to open iterator for alias times
It happens only when we use the ToDate function.
Solution 1:
- We need to specify the loader in the LOAD statement:
USING PigStorage('\t')
- We always remember to specify the schema with this type
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
After this the date conversion just works fine:
- (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z)
Use below code :
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
DUMP times;
- PigStorage is the default load function for the LOAD operator.
- The original issue happend by the lack of datatype
If you don't assign types, fields default to type bytearray