Performance of Hive tables with Parquet & ORC

by robin · Published August 30, 2016 · Updated August 31, 2016

Source: http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy

ORC+Zlib has better performance than Paqruet + Snappy

Datasets

Table A – Text File Format- 2.5GB
Table B – ORC – 652MB
Table C – ORC with Snappy – 802MB
Table D – Parquet – 1.9 GB

Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.

Row count operation
Text Format Cumulative CPU – 123.33 sec
Parquet Format Cumulative CPU – 204.92 sec
ORC Format Cumulative CPU – 119.99 sec
ORC with SNAPPY Cumulative CPU – 107.05 sec

Sum of a column operation
Text Format Cumulative CPU – 127.85 sec
Parquet Format Cumulative CPU – 255.2 sec
ORC Format Cumulative CPU – 120.48 sec
ORC with SNAPPY Cumulative CPU – 98.27 sec

Average of a column operation
Text Format Cumulative CPU – 128.79 sec
Parquet Format Cumulative CPU – 211.73 sec
ORC Format Cumulative CPU – 165.5 sec
ORC with SNAPPY Cumulative CPU – 135.45 sec

Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU – 72.48 sec
Parquet Format Cumulative CPU – 136.4 sec
ORC Format Cumulative CPU – 96.63 sec
ORC with SNAPPY Cumulative CPU – 82.05 sec

Additional comments

both of these formats has their own specific advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file structure is flatter.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be the issue for the better query speed especially when it comes to sum operations.

The Parquet default compression is SNAPPY.

References:

Tags: hadoop hive

Performance of Hive tables with Parquet & ORC

Datasets

Additional comments

You may also like...

Leave a Reply Cancel reply

Archives

Performance of Hive tables with Parquet & ORC

Datasets

Additional comments

Related posts:

You may also like...

A Secure HDFS Client Example

Query escaped JSON string in Hive

HDFS disk consumption – Find what is taking hdfs space

Leave a Reply Cancel reply

Tags

Archives