Fastest way of compressing file(s) in Hadoop
Compressing files in hadoop
Okay, well.. It may or may not be the fastest. Email me if you find a better alternate 😉
Short background,
- The technique uses simple Pig script
- Make Pig use tez engine (set the queue name appropriately)
- You can change codec in Pig script
Da Pig script
[compress-snappy.pig]
/* * compress-snappy.pig: Pig script to compress a directory * * input: IN_DIR: hdfs input directory to compress * output: OUT_DIR: hdfs output directory * */ set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; set exectype tez; set tez.queue.name DBA; set mapred.job.queue.name DBA; --comma seperated list of hdfs directories to compress input0 = LOAD '$IN_DIR' USING PigStorage(); --single output directory STORE input0 INTO '$OUT_DIR' USING PigStorage();
Execute using,
$ pig -p IN_DIR=/dir/large/files/dt=2017-05-01 -p OUT_DIR=/tmp/compression/test/dt=2017-05-01 -f compress-snappy.pig -x tez
The output directory may contain multiple .snappy files. They can be safely merged using command,
hdfs dfs -cat /tmp/compression/test/dt=2017-05-01/part* | hdfs dfs -put - /data/final/dataset-01/dt=2017-05-01/merged-2017-05-01.snappy
You like it? You share it 😉
References
- https://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs