Fastest way of compressing file(s) in Hadoop

by robin · Published July 25, 2017 · Updated July 25, 2017

Compressing files in hadoop

Okay, well.. It may or may not be the fastest. Email me if you find a better alternate 😉

Short background,

The technique uses simple Pig script
Make Pig use tez engine (set the queue name appropriately)
You can change codec in Pig script

Da Pig script

[compress-snappy.pig]

/*
 * compress-snappy.pig: Pig script to compress a directory
 *
 * input:       IN_DIR: hdfs input directory to compress
 * output:     OUT_DIR: hdfs output directory
 *
 */

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
set exectype tez;
set tez.queue.name DBA;
set mapred.job.queue.name DBA;

--comma seperated list of hdfs directories to compress
input0 = LOAD '$IN_DIR' USING PigStorage();

--single output directory
STORE input0 INTO '$OUT_DIR' USING PigStorage();

Execute using,

$ pig -p IN_DIR=/dir/large/files/dt=2017-05-01 -p OUT_DIR=/tmp/compression/test/dt=2017-05-01 -f compress-snappy.pig -x tez

The output directory may contain multiple .snappy files. They can be safely merged using command,

hdfs dfs -cat /tmp/compression/test/dt=2017-05-01/part* | hdfs dfs -put - /data/final/dataset-01/dt=2017-05-01/merged-2017-05-01.snappy

You like it? You share it 😉

References

https://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs

Tags: bigdata hadoop

Fastest way of compressing file(s) in Hadoop

Compressing files in hadoop

You may also like...

Leave a Reply Cancel reply

Archives

Fastest way of compressing file(s) in Hadoop

Compressing files in hadoop

Related posts:

You may also like...

Connecting to Apache Phoenix

Bash alias and functions for hadoop users

Creating Hive tables on compressed files

Leave a Reply Cancel reply

Tags

Archives