Cloudera’s Impala

cloudera-impala-tutorial
Impala was the first to bring SQL querying to the public in April 2013. Impala comes with a bunch of interesting features:
  • Impala can query many file format such as Parquet, Avro, Text, RCFile, SequenceFile
  • Impala supports data stored in HDFS, Apache HBase and Amazon S3
  • Impala supports multiple compression codecs:
    • Snappy (Recommended for its effective balance between compression ratio and decompression speed),
    • Gzip (Recommended when achieving the highest level of compression),
    • Deflate (not supported for text files), Bzip2, LZO (for text files only);
  • Impala provides security through authorization based on Sentry (OS user ID)
    • Defining which users are allowed to access which resources,
    • What operations are they allowed to perform authentication based on Kerberos + ability to specify Active Directory username/password,
    • How does Impala verify the identity of the users to confirm that they are allowed exercise their privileges assigned to that user auditing,
    • What operations were attempted,
    • Did they succeed or not, allowing to track down suspicious activity; audit data are collected by Cloudera Manager;
  • Impala supports SSL network encryption between Impala and client programs, and between the Impala-related daemons running on different nodes in the cluster;
  • Impala allows to use UDFs and UDAFs;
  • Impala orders the joins automatically to be the most efficient;
  • Impala allows admission control – prioritization and queueing of queries within impala;
  • Impala allows multi-user concurrent queries;
  • Impala caches frequently accessed data in memory;
  • Impala computes statistics (with COMPUTE STATS);
  • Impala provides window functions (aggregation OVER PARTITION, RANK, LEAD, LAG, NTILE, and so on) – to provide more advanced SQL analytic capabilities (since version 2.0);
  • Impala allows external joins and aggregation using disk (since version 2.0) – enables operations to spill to disk if their internal state exceeds the aggregate memory size;
  • Impala allows subqueries inside WHERE clauses;
  • Impala allows incremental statistics – only run statistics on the new or changed data for even faster statistics computations;
  • Impala enables queries on complex nested structures including maps, structs and arrays;
  • Impala enables merging (MERGE) in updates into existing tables;
  • Impala enables some OLAP functions (ROLLUP, CUBE, GROUPING SET);
  • Impala allows use of impala for inserts and updates into HBase.

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,