Linux tips for web developers: working with text files, part 2

By
Ryan Robinson

Previously, we worked with linux commands cat and grep to search through text files for specific strings. Today, we will talk about combining cat and grep with a new command, awk, that will enable you to do much more with text files, log files, and CSVs (or any text-based spreadsheet).

AWK is actually a programming language specifically designed for processing text. We’ll barely scratch the surface of all of awk’s capabilities, but I’ll show you some ways it can help you in a development environment.

Let’s start with a sample web server access log:


192.168.1.1 - - [29/Aug/2016:20:36:00] "GET /test.html HTTP/1.1" 200 
192.168.1.2 - - [29/Aug/2016:20:38:00] "GET /test.html HTTP/1.1" 200
192.168.1.2 - - [29/Aug/2016:20:42:00] "GET /test.html HTTP/1.1" 200
192.168.1.4 - - [29/Aug/2016:20:53:00] "GET /test.html HTTP/1.1" 200
192.168.1.4 - - [29/Aug/2016:20:54:47] "GET /test.html HTTP/1.1" 200

We learned how to pull only lines that contain certain strings (an IP address, for example) using grep in the previous blog post, but what if we only want a certain piece of information on the line, such as the date? We can do that using awk!


awk '{print $4}' access.log

Which returns:


[29/Aug/2016:20:36:00]
[29/Aug/2016:20:38:00]
[29/Aug/2016:20:42:00]
[29/Aug/2016:20:53:00]
[29/Aug/2016:20:54:47]

What we’ve done is told awk to print the 4th column of text. By default, awk separates columns by spaces. You can also print multiple columns at once. Let’s get the IP address and the page visited.


awk '{print $4 "\t" $6}' access.log

Which returns:


[29/Aug/2016:20:36:00]	/test.html
[29/Aug/2016:20:38:00]	/test.html
[29/Aug/2016:20:42:00]	/test.html
[29/Aug/2016:20:53:00]	/test.html
[29/Aug/2016:20:54:47]	/test.html

Neat! We’ve printed the 4th and 6th columns. I’ve separated the columns with a tab using the \t in double quotes, but it isn’t necessary.

Now, what if we just needed the pages visited by a specific IP address? We can combine cat, grep, and awk to customize our output.


cat access.log | grep "192.168.1.2" | awk '{print $1 " visited " $6}'

Which returns:


192.168.1.2 visited /test.html
192.168.1.2 visited /test.html

We’re displaying the log with cat, piping it to grep to search for our IP, and the piping it to awk to get the desired columns of text. Cool, huh?

You can also separate columns of text by characters other than spaces. This need commonly arises working with comma separated files like CSVs. For example:


Peter Griffin, 31 Spooner St, Quahog
Homer Simpson, 732 Evergreen Terr, Springfield
Fred Flintstone, 301 Cobblestone Way, Bedrock

Let’s get a list of only the street addresses from this file.


awk -F ',' '{print $2}' spreadsheet.csv 

Which returns:


31 Spooner St
732 Evergreen Terr
301 Cobblestone Way

Using the -F flag, we’ve told awk to treat the commas as the column separators.

Finally, let’s separate the FirstName LastName column into two separate columns and put it in a new file.


awk '{print $1 "," $2}' spreadsheet.csv > new.csv

There is much more you can do with awk, but hopefully this has shown you some of the possibilities of this powerful command.