AWK quick reference
Tips
- awk regex
awk '/[0-9]+ /{print}' file.txt
- not beginning with an expression
awk '!/^anexpression/{print}' file.txt
- containing x and not containing y
awk '/x/ && !/y/' file.txt
- Mixing single and double quotes
$ awk 'BEGIN { print "Here is a single quote <'"'"'>" }' -| Here is a single quote <'>
References
AWK quick reference
Source: http://www.adminschoice.com/awk
awk is an interpreted language used for data processing and validation, generate reports and experiment with algorithms that can be ported to other languages.
awk name & History
The name awk comes from the initials of its creators: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan.
Original version of awk was written in 1977 at AT&T Bell Laboratories. Paul Rubin wrote gawk ( gnu awk ) in 1986.
Using awk
awk can be used directly on a command line, executed as program file or from a program file referenced by command line awk .
awk Command Line
awk can be used in command line as a tool to process and format the data from one or more input files or output of another program .
Syntax to use data file as input and run awk command to process data
awk ‘<awk command>’ <file>
or
Use command output as data input using PIPE processing
Command output | awk ‘<awk command>’
awk variable assignments :
awk works on lines and columns and process data line by line and assigns variables to each line and column.
$0 = Entire line
$1 = First Column
$2 = Second Column
$3 = Third Column
and so on
column is defined as a word/characters surrounded by space/s . Common Linux/Unix commands like df ,ls , ps gives columnar outputs and awk is very useful in getting listing and processing column data. A print statement is used to print variables .
Working with awk commands
awk commands are enclosed in single quotes, any single quote after awk options is considered as awk command and a matching single quote is taken as as end of command.
awk Examples
1. Extract the used space column by mount points using df out put
localhost ~]$ df | awk ‘{print $3 }’
Used
10831280
0
252
1020
0
176
123767
51118256
66006
Similar operation to extract 1st and 4tgh column from a file called testfile containing following lines
column1 column2 column3 column4
1111 2222 3333 4444
1111 2222 3333 4444
1111 2222 3333 4444
localhost ~]$ awk ‘{print $1,$4}’ testfile
column1 column4
1111 4444
1111 4444
1111 4444
Comma separating fields gives a default space between the output data fields. For large number of fields a special awk variable, Output Field Separator, OFS is used . Default is a space and it can be assign to any other value , such as a pipe symbol , | , in the example below.
localhost ~]$ awk ‘{OFS=”|” ; print $3,$4}’ testfile
column3|column4
3333|4444
3333|4444
3333|4444
awk BEGIN and END statement
Multi-line program uses BEGIN and END statements to execute statements once at the beginging and at the end.
basic construction is :
BEGIN <statment>
<processing statments >
END <statment>
Example:
localhost ~]$ awk ‘BEGIN { print “Count Records ” }
/4444/ { ++num }
END { print “Recs ” num }’ testfile
Count Records
Recs 3
awk program File
awk programs can be written and invoked from a file by providing awk interpreter location in the first liner ,
Syntax :
$awk -f <program file> <datafile>
Create a awk program test file, chkrec as below.
#! /bin/awk -f
BEGIN { print “Count Records ” }
/4444/ { ++num }
END { print “Recs ” num }
Execute file with -f option
localhost ~]$ awk -f chkrec testfile
Count Records
Recs 3
or make it executable & directly execute with data file as argument
localhost ~]$ chmod 755 chkrec
localhost ~]$ ./chkrec testfile
Count Records
Recs 3
Awk Example programs
1. Compare values
print Available Use% Mounted columns if used percentage is more than 60%
localhost ~]$ df| awk ‘$5 > “60” { print $4,$5,$6}’
Available Use% Mounted
4522188 92% /home
32298 68% /boot/efi
2. Sum operations
Add file sizes for selective files, /var/log/yum* and total sum is printed , column from each line is added in variable n and total is printed with END statement.
localhost ~]$ ls -l /var/log/yum* | awk ‘{ n += $5 }
END { print “Total bytes = “, n }’
Total bytes = 63665
3. if else conditions
Check available space , print ok in front of the output if less than 60% and Problem if more than 60%
df | awk ‘{ if ($5 > 60) print “Problem “$0
else
print “ok “, $0
};’
Problem Filesystem 1K-blocks Used Available Use% Mounted on
ok /dev/mapper/fedora-root 51475068 10831316 38005928 23% /
ok devtmpfs 1956180 0 1956180 0% /dev
ok tmpfs 1966388 252 1966136 1% /dev/shm
ok tmpfs 1966388 992 1965396 1% /run
ok tmpfs 1966388 0 1966388 0% /sys/fs/cgroup
ok tmpfs 1966388 176 1966212 1% /tmp
ok /dev/sda9 487652 123767 334189 28% /boot
Problem /dev/mapper/fedora-home 58642620 51118476 4522188 92% /home
Problem /dev/sda2 98304 66006 32298 68% /boot/efi
4. For loop
Print 1 to 5 numbers using a for loop by proving initial value , final value and increment function.
localhost ~]$ awk ‘BEGIN { for (i = 1; i <= 5; ++i) print i }’
1
2
3
4
5
5. awk Arrays , creating and sorting
Create a array by assigning values to array indexes :
A[“ZZ”] = “Last”
A[“DD”] = “Middle”
A[“AA”] = “First”
Sorting arrays
asorti – Array Sort by Indices
asort – Array Sort by value
asort(A)
A[“AA”] = “First”
A[“ZZ”] = “Last”
A[“DD”] = “Middle”
asorti – Array Sort by Indices
asprti(A)
A[“AA”] = “First”
A[“DD”] = “Middle”
A[“ZZ”] = “Last”
awk regular expressions
gsub
Global substitution for the pattern in target
gsub(regexp, replacement [, target])
gensub()
it is a general substitution function providing more features than the standard sub() and gsub() functions- the ability to specify components of a regexp in the replacement text
localhost ~]$ df | awk ‘{ print gensub(/\%/, ” Percent”, 1) }’
Filesystem 1K-blocks Used Available Use Percent Mounted on
/dev/mapper/fedora-root 51475068 10831316 38005928 23 Percent /
devtmpfs 1956180 0 1956180 0 Percent /dev
/dev/sda9 487652 123767 334189 28 Percent /boot
/dev/mapper/fedora-home 58642620 51118476 4522188 92 Percent /home
/dev/sda2 98304 66006 32298 68 Percent /boot/efi
index(in, find)
Find the index value of a sub string .
localhost ~]$ awk ‘BEGIN { print index(“SomeLongString”, “tr”) }’
10
length([string])
Find the length of string, length of lines in the example below
localhost ~]$ awk ‘ { print length($0) }’ testfile
31
29
29
29
match(string, regexp [, array])
match alphabet characters in file and print whole line
localhost ~]$ awk ‘ match($0, /[a-z]/) { print $0 }’ testfile
column1 column2 column3 column4
split(string, array [, fieldsep [, seps ] ])
Split a list of rpm names at dashes.
content of the files – rpms
libhbalinux-1.0.16-2.fc20.x86_64
gucharmap-3.10.1-1.fc20.x86_64
libplist-1.11-2.fc20.x86_64
libgcc-4.8.3-7.fc20.i686
glx-utils-8.1.0-4.fc20.x86_64
vlgothic-fonts-20140801-1.fc20.noarch
Split along dashes , keep in array and print selected index values , keep separators in a array called sep .
localhost ~]$ cat rpms | awk ‘{split($0, ary, “-“, seps) ; print ary[1],ary[2],ary[3]}’
libhbalinux 1.0.16 2.fc20.x86_64
gucharmap 3.10.1 1.fc20.x86_64
libplist 1.11 2.fc20.x86_64
libgcc 4.8.3 7.fc20.i686
glx utils 8.1.0
vlgothic fonts 20140801
print both arrays , ary and sep , the seprator arry contents
localhost ~]$ cat rpms | awk ‘{split($0, ary, “-“, seps) ; print ary[1],ary[2],ary[3],seps[1],seps[2]}’
libhbalinux 1.0.16 2.fc20.x86_64 —
gucharmap 3.10.1 1.fc20.x86_64 —
libplist 1.11 2.fc20.x86_64 —
libgcc 4.8.3 7.fc20.i686 —
glx utils 8.1.0 —
vlgothic fonts 20140801 —
sub(regexp, replacement [, target])
Substitute a pattern with a string , in the example below replace dash followed by any number with –>
localhost ~]$ cat rpms | awk ‘sub(/-[0-9]/, ” –> ” )’;
libhbalinux –> .0.16-2.fc20.x86_64
gucharmap –> .10.1-1.fc20.x86_64
libplist –> .11-2.fc20.x86_64
libgcc –> .8.3-7.fc20.i686
glx-utils –> .1.0-4.fc20.x86_64
vlgothic-fonts –> 0140801-1.fc20.noarch
substr(string, start [, length ])
Get a substring of defined length from a given position
Lets use this file having two fields
localhost ~]$ cat nums
123456789 abcdef
find 3rd position and print two values from first field.
localhost ~]$ awk ‘{print substr($1,3,2) }’ nums
34
find 3rd position and print two values from second field.
localhost ~]$ awk ‘{print substr($2,3,2) }’ nums
cd
tolower(string)
Convert alphabet string into lower case
tolower(“MiXeD cAsE 123”) returns “mixed case 123”.
Changing entire files to lowercase in the example below
localhost ~]$ cat letters
This is Just Some Random Text Here ..
localhost ~]$ awk ‘{ print tolower($0)}’ letters
this is just some random text here ..
toupper(string)
Convert alphabet string into upper case
localhost ~]$ awk ‘{ print toupper($0)}’ letters
THIS IS JUST SOME RANDOM TEXT HERE ..
Selective fields can be used for this operation, to make only first field as upper case:
awk ‘{ print toupper($1)}’ letters
THIS
Built in Operational Variables
IGNORECASE <digit>
If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case-independent.
OFS
The Output Field Separator . It is output between the fields printed by a print statement. Its default value is ” “, a string consisting of a single space.
localhost ~]$ awk ‘{ OFS=”|” ; print $1,$2,$3,$4}’ testfile
column1|column2|column3|column4
1111|2222|3333|4444
1111|2222|3333|4444
1111|2222|3333|4444
it can be defined by -F option also , following example define field separator as : and print first field.
awk -F: ‘{ print $1}’ /etc/passwd
ORS
The Output Record Separator determines how records/ lines are separated default value is “\n”, the newline character.
Lets use earlier used rpms file to print lines separated by an || operator.
localhost ~]$ awk ‘{ ORS=”||” ; print $0}’ rpms
libhbalinux-1.0.16-2.fc20.x86_64||gucharmap-3.10.1-1.fc20.x86_64||libplist-1.11-2.fc20.x86_64||libgcc-4.8.3-7.fc20.i686||glx-utils-8.1.0-4.fc20.x86_64||vlgothic-fonts-20140801-1.fc20.noarch||
NF
Number of Fields , separated by space or designated by FS value.
count number of fields separated by :
localhost ~]$ awk -F: ‘{ print $0,NF}’ /etc/passwd
root:x:0:0:root:/root:/bin/bash 7
bin:x:1:1:bin:/bin:/sbin/nologin 7
daemon:x:2:2:daemon:/sbin:/sbin/nologin 7
adm:x:3:4:adm:/var/adm:/sbin/nologin 7
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin 7
sync:x:5:0:sync:/sbin:/bin/sync 7
RS
The input record separator. default is a new line but can be changed to other values depending on the input file.
ARGC, ARGV
The command-line arguments available to awk programs are stored in an array called ARGV.
ARGC is the number of command-line arguments
ARGV is the value of argument. present and is indexed from 0 to ARGC -1
AWKPATH
awk gets its search path from the AWKPATH environment variable. If that variable does not exist, or if it has an empty value, gawk uses a default path ‘.:/usr/local/share/awk’.