[Solved-1 Solution] Apache pig - url parsing into a map ?
What's a URL?
- Uniform Resource Locators (URLs) provide a way to locate a resource using a specific scheme, most often but not limited to HTTP. Just think of a URL as an address to a resource, and the scheme as a specification of how to get there.
Parsing a url
- The URL class provides several methods that let you query URL objects. You can get the protocol, authority, host name, port number, path, query, filename, and reference from a url.
Problem:
How to URL parsing into a map in apache pig ?
Solution 1:
Use of flatten
- The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.
- FLATTEN the result of STRSPLIT so that there is no useless level of nesting in tuples, and FLATTEN again inside the nested foreach
- Also, STRSPLIT has an optional third argument to give the maximum number of output strings. Use that to guarantee a schema for its output.
The below code is helps for url parsing:
Output
- After finished splitting out the tags and values, group also by the tag to get your bag of values. Then put that into a map. Note that this assumes that if we have two lines with the same id (test2, here) we have to combine them.
- Unfortunately, there is apparently no way to combine maps without resorting to a UDF, but this should be just about the simplest of all possible UDFs.
With a UDF like that, we can do
For better url parsing ,we should add the error-checking code to the UDF