How to parse a JSON string from a column with Pig ?
What is JSON string ?
- JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language.
Parsing Simple JSON Objects
- If JSON data contains just key-value pairs i.e. {"key":"value","key2":"value2",...}
- We can use JsonStringToMap UDF to extract the required values by key
For example,
The last statement outputs values for a, b and c keys for every row:
Parsing JSON Arrays
Unfortunately, JsonStringToMap UDF does not work with JSON arrays, i.e. data in the following format:
- There are a dozen of custom UDFs written to work with JSON arrays, here is an example of how to write your own Python UDF to iterate JSON arrays and extract any required data.
- Let’s assume that payload column now contains the following data:
- Here we have 3 rows, every row contains a JSON array. Note that the first array contains 3 JSON objects, the second array contains 2 objects, and the third array contains just one JSON object
Write a Pig UDF in Python that returns the number elements in array, and the last value for a key in each array:
Now we can invoke our UDF as follows:
The last statement outputs the number of items in the array and the last value for a for every row: