The best interface ever made is right under your nose

Almost any developer interacts with some form of a Unix environment on a daily basis, either when working on a development machine (usually Mac or Linux) or when interacting with a server (usually Linux). Despite that, many programmers are not proficient with the best way of interacting with such systems - the Unix shell. For a long time, I hated shell scripting as well. After working with languages like Python or C, the shell seemed like a complete mess. When starting out, I ran into problems such as errors caused by spaces in variable assignment (A=1 is OK while A = 1 is not), the absence and appearance of a dollar sign depending on the context (A=1 vs echo $A) and the convoluted way of expressing control flows. I avoided Unix shell scripting as much as possible. In recent years, however, I found a new appreciation for the shell, so much so, that it is now my favorite way of accomplishing tasks. I have discovered that the strength of the shell is not as a programming language by itself, but as a glue between other programs. To that end, the shell is far more convenient than any other programming language. The thing that made the difference for me was becoming familiar with the Unix philosophy and realizing how it's applied in practice.

The Unix philosophy originated by Ken Thompson:

Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

The way this works is the pipe mechanism. Each program invocation gets input through standard input (A.K.A stdin) and passes it to standard output (A.K.A stdout).

For example:

ls dir_of_stuff | grep "abc" | sort

When run, this command instantiates three programs, ls prints the files in the directory dir_of_stuff, grep filters files containing abc in the name and sort sorts the records in ascending order. None of the participating programs is aware of the existence of the others, only of the text stream they receive as input. This type of modularity allows for decade-old programs written in C to work seamlessly with modern programs written in Go. I like to think of the programs as verbs, their argument as adjectives and the data as the nouns. Each component meaningless on its own, but can express anything when combined together.

The POSIX standard, which defines what a Unix system implements, comes with many programs built-in. The programs ls, grep and sort shown above are some of them. As a demonstration of the usefulness of the shell, I'll present a pipeline I frequently use for file manipulation. I'll be using the commands find, awk, and xargs where:

find - Defines what to operate on.
awk - Defines what to do.
xargs - Defines how to do it.

find

The find program is used to locate files that satisfy a query, starting from a root path. The command looks like this:

find <root path> [query]

For example, given the following files:

/root_path/
  - file1.txt
  - file2.txt
  - file3.jpg
  - sub_path/
    - file4.txt

The query

find /root_path -name "*.txt"

will print:

/root_path/file1.txt
/root_path/file2.txt
/root_path/sub_path/file4.txt

Some of the query arguments I use frequently are:

-name <pattern> - A pattern of the searched file.
-type [f/d/s] - Find only files, directories or symbolic links.
-midpeth -maxdepth - Control which level to descend to or descend from the root directory.

The command find -mindepth 2 will print:

/root_path/sub_path/file4.txt

The command find -mindepth 1 -maxdepth 1 -name "*.txt" will print:

/root_path/file1.txt
/root_path/file2.txt

find also has an -exec option which lets you perform an action on the files you find. I intentionally do not discuss it here since I believe it breaks the Unix principle of "do one thing" - find files. I Instead prefer to use xargs for that purpose.

xargs

xargs is a useful tool for composing commands together, although it is hard to explain what it does in isolation. What xargs does is best explained through an example.

Say you have a directory with the following files in it.

a.txt b.txt ab.txt

If you'd like to print the files sorted, you can run find . | sort which will work as expected since sort gets its input through stdin. However, if you would like to remove all the files from find, the following won't work:

find . -mindepth 1 -maxdepth 1 | rm

This won't work because the rm command does not get the files to remove through stdin but as arguments i.e:

rm file1 file2 ...

So what would you do if you'd like to receive the files through standard input? That's what xargs does. The following command:

find . -mindepth 1 -maxdepth 1 | xargs rm

is equivalent to:

rm a.txt b.txt ab.txt

You may also define the command as a pattern through the -I argument. If, for example, you want to add the extension .ext to all the files in the directory, you could run:

find . -type f | xargs -I {} mv {} {}.ext

Where -I {} means {} will be replaced by the input in the given pattern.

Another useful feature of xargs is running commands on multiple processes using the -P<#processes> flag. For example, we will rescale 500 images using Imagemagick, a command line tool for image processing. The command takes about 30 seconds on my machine:

time find . -name "*.jpg" | xargs -I {} magick {} -resize '50%' smallres/{}
67.70s user 8.57s system 242% cpu 31.433 total

Using -P4 reduced running time to 20 seconds:

time find . -name "*.jpg" | xargs -P4 -I {} magick {} -resize '50%' smallres/{}
74.93s user 8.33s system 389% cpu 21.358 total

Not only does this feature allows the utilization of multiprocessing with existing programs, but it also affects how I write my own programs. I usually find it unnecessary to use multiprocessing in the program code (which can be a huge pain in some languages, such as C for example). I can leave the multiprocessing to xargs thus making my programs smaller and simpler.

Although the xargs command allows the definition of a command through a pattern, it is rather limited. As before, we want to follow the Unix philosophy of using programs that do one thing, therefore we will use awk to construct the commands.

awk

awk is a domain-specific programing language focused on the processing of text streams. Although learning a new programming language can be discouraging, it shouldn't be in that case. awk is small and simple, and you can learn its most common usages within the next short paragraph.

The structure of an awk invocation is as follows:

<previous command> | awk '{awk_command1; awk_command2; ...}'

The commands inside the {} block are run for every row fed by the pipe. Each line is automatically split into fields where $0 represents the whole line, and $1, $2, ... represents the first field, second field, etc... The output is handled by the print and printf functions and there are built-in string manipulation functions such as split.

For example, the command ps aux prints the running processes on the computer in the following format:

USER       PID %CPU %MEM    COMMAND
root         1  0.0  0.0    runit
root         2  0.0  0.0    [kthreadd]
root         3  0.0  0.0    [rcu_gp]
root         4  0.0  0.0    [rcu_par_gp]
root         6  0.0  0.0    [kworker/0:0H-events_highpri]
root         8  0.0  0.0    [mm_percpu_wq]
root         9  0.0  0.0    [rcu_tasks_kthre]
root        10  0.0  0.0    [rcu_tasks_rude_]
root        11  0.0  0.0    [rcu_tasks_trace]

We can print the PID of all python processes:

ps aux | grep python | awk '{print $2}'

If you would like to kill these processes, you can run

ps aux | grep python | awk '{print $2}' | xargs kill -9

Now we can use the complete find | awk | xargs pipeline, in which awk will be used to construct the commands to be run. We will run the commands using xargs -I {} sh -c {} which means "run the given line as a shell command" (with any of the examples, the -P option can be used to utilize multiple cores). I will finish with some examples of this pipeline.

Example 1

Directories left and right both contain videos of the same names. The following will produce a side by side comparison of each video in the out directory using the video editing tool FFmpeg. The command to concatenate video files looks like this:

ffmpeg -i vid1.mp4 -i vid2.mp4 -filter_complex hstack vid_out.mp4

We will use the following to run this command on all the videos:

find left -type f -name "*.mp4" | \
awk '{l = r = o = $0; sub("left", "right", r); sub("left", "out", o); print l, r, o}' | \
awk '{printf "ffmpeg -i %s -i %s -filter_complex hstack %s\n", $1, $2, $3 }' | \
xargs -I {} sh -c {}

Try to read through the code and see if you can figure out what it does before reading the explanation:

The first line finds all video files under left.
The second line prints three files for each row: left, right, and output (sub in awk substitutes the first occurrence of a substring).
The third line builds the ffmpeg concatenation commands.
The fourth line executes the commands.

When constructing a pipeline like this, I debug it by adding each stage individually and observing the output.

Example 2

Copy all the images in a directory dir to directory dir_flat so that all images will be at the root:

find dir -name "*.jpg" | \
awk '{split($0, p, "/"); printf "cp %s dir_flat/%d_%s\n", NR, $0, p[length(p)] }' | \
xargs -I {} sh -c {}

Notice how the row number (automatically stored in NR) is concatenated to the file name for the purpose of dealing with duplicate names.

Summary

The Unix Shell is more than just another development tool. It can completely change the way you view computing. As a user, you'll prefer small constrained programs that perform single tasks well, rather than monolithic ones that do many things badly. As a developer, you will make your own programs smaller and have them using a text stream interface, being able to take advantage of the wonderful Unix echo system. Learning and using the shell is a very rewarding experience, I continuously find better ways to do my tasks. You likely have an access to a terminal right now. Open it and experience the magic.

If you have any comments I'll be happy to hear from you

Resources

Luke Smith's brilliant videos, starting with this one.
Command-line Tools can be 235x Faster than your Hadoop Cluster - A must-read if you have anything to do with programming a data pipeline.

Home