Cloudera’s Impala
Impala was the first to bring SQL querying to the public in April 2013. Impala comes with a bunch of interesting features:
- Impala can query many file format such as Parquet, Avro, Text, RCFile, SequenceFile
- Impala supports data stored in HDFS, Apache HBase and Amazon S3
- Impala supports multiple compression codecs:
- Snappy (Recommended for its effective balance between compression ratio and decompression speed),
- Gzip (Recommended when achieving the highest level of compression),
- Deflate (not supported for text files), Bzip2, LZO (for text files only);
- Impala provides security through authorization based on Sentry (OS user ID)
- Defining which users are allowed to access which resources,
- What operations are they allowed to perform authentication based on Kerberos + ability to specify Active Directory username/password,
- How does Impala verify the identity of the users to confirm that they are allowed exercise their privileges assigned to that user auditing,
- What operations were attempted,
- Did they succeed or not, allowing to track down suspicious activity; audit data are collected by Cloudera Manager;
- Impala supports SSL network encryption between Impala and client programs, and between the Impala-related daemons running on different nodes in the cluster;
- Impala allows to use UDFs and UDAFs;
- Impala orders the joins automatically to be the most efficient;
- Impala allows admission control – prioritization and queueing of queries within impala;
- Impala allows multi-user concurrent queries;
- Impala caches frequently accessed data in memory;
- Impala computes statistics (with COMPUTE STATS);
- Impala provides window functions (aggregation OVER PARTITION, RANK, LEAD, LAG, NTILE, and so on) – to provide more advanced SQL analytic capabilities (since version 2.0);
- Impala allows external joins and aggregation using disk (since version 2.0) – enables operations to spill to disk if their internal state exceeds the aggregate memory size;
- Impala allows subqueries inside WHERE clauses;
- Impala allows incremental statistics – only run statistics on the new or changed data for even faster statistics computations;
- Impala enables queries on complex nested structures including maps, structs and arrays;
- Impala enables merging (MERGE) in updates into existing tables;
- Impala enables some OLAP functions (ROLLUP, CUBE, GROUPING SET);
- Impala allows use of impala for inserts and updates into HBase.