Performance of Hive tables with Parquet & ORC
Source: http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
Datasets
Table A – Text File Format- 2.5GB
Table B – ORC – 652MB
Table C – ORC with Snappy – 802MB
Table D – Parquet – 1.9 GB
Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.
Row count operation
Text Format Cumulative CPU – 123.33 sec
Parquet Format Cumulative CPU – 204.92 sec
ORC Format Cumulative CPU – 119.99 sec
ORC with SNAPPY Cumulative CPU – 107.05 sec
Sum of a column operation
Text Format Cumulative CPU – 127.85 sec
Parquet Format Cumulative CPU – 255.2 sec
ORC Format Cumulative CPU – 120.48 sec
ORC with SNAPPY Cumulative CPU – 98.27 sec
Average of a column operation
Text Format Cumulative CPU – 128.79 sec
Parquet Format Cumulative CPU – 211.73 sec
ORC Format Cumulative CPU – 165.5 sec
ORC with SNAPPY Cumulative CPU – 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU – 72.48 sec
Parquet Format Cumulative CPU – 136.4 sec
ORC Format Cumulative CPU – 96.63 sec
ORC with SNAPPY Cumulative CPU – 82.05 sec
Additional comments
both of these formats has their own specific advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file structure is flatter.
And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be the issue for the better query speed especially when it comes to sum operations.
The Parquet default compression is SNAPPY.
References:
- https://community.hortonworks.com/questions/2067/orc-vs-parquet-when-to-use-one-over-the-other.html
- http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/