I ran into an interesting problem recently. Looking to do some analysis of the log for a Rails app I maintain, I fetched a copy from the server. Since I don’t have a log rotator running for this app, the log is roughly 340 megabytes or in other words, a ton of text! I wanted to explore this data quite a bit using grep and the like to see what the most accessed routes are, how fast are the request completion times, etc. The log contains data from October 2013 to the present and since the log is appended to, the most recent data is at the end which left me with a problem. How do I find the most recent matches for a grep search?
Since this is a very large file, I didn’t want to use tail and pipe that to grep (it turns out on my macbook pro that even 340 MBs of text only takes a few seconds to go through so it wasn’t actually a big deal). While looking for a solution and talking to a friend of mine about the problem I wondered, is there a way to pipe data between unix tools and have them only process the data required? As it turns out that there is! The tool we need is called Process Substitution. Process substitution allows us to easily pass data from one command to another command that expects a file as input. Here is a basic example. Say we are trying to find a process by name using:
$ ps aux | grep 'ruby'
We can get the same result using process substitution by changing that to:
$ grep 'ruby' <(ps aux)
Now there are several things to explain about this.
- The “<” should look familiar if you’ve used I/O redirection in the terminal before. It tells the shell to connect the stdin of grep to the stdout of the command run in parentheses.
- One of the features of using process substitution is parallelism. In this example, the ps command is actually run in a child process. This allows for better performance assuming we have multiple cpus or cpu cores.
- This wiring is accomplished by the shell substituting “<(ps aux)” with a file descriptor of a unix pipe whose input is given to the ps command running in the child process.
How does this allow us to make a solution for the original problem? Well as a reminder, my original problem was I wanted to be able to do was find the most recent matches to some regex pattern without needing to process the entire log file. I can find the 20 most recent starting request lines with 5 lines of context from the Rails log by running:
$ tail -r <(grep --max-count 20 --before-context 5 'Started' <(tail -r production.log))
Starting from the inside and working out, first we start tail reading the production.log file. The “-r” switch tells it to display the input in reverse order by line. This is what let’s us read from the file backwards. Next, we have that piped into grep’s stdin. We tell grep to stop after 20 matches and to give us 5 lines of context from before the match (for reference the first text of a line in a Rails log that starts a new request is ‘Started’). Why 5 lines before the match instead of after? Because the input to grep is in reverse order of course! So that gives us the 5 lines that appear below the match in the production.log. Lastly, this output is passed into another instance of tail that is told to output the reverse which gives us the result we want, the last 20 new requests from the production log.
Astute readers will have noticed by now that there is a easier way to do write the same command:
$ tail -r production.log | grep --max-count 20 --before-context 5 'Started' | tail -r
Not only does the above give us the exact same output as using process substitution but the shell also spawns each of those commands in child processes allowing them to run in parallel too. That doesn’t mean that process substitution is useless, however. The Wikipedia page on process substitution has an example I wish I knew about when I needed it a few days ago. Diff is a tool that requires its input be files which prevents us from connecting pipes like with the log example. This is where process substitution comes in. My problem was that I wanted to diff the files in two directories to see what was different. Since I didn’t know about process substitution I resorted to writing the file list to files and then comparing those. The alternative I now know about would be to run this instead:
$ diff <(find dir_a -type f | cut -f 2- -d '/') <(find dir_b -type f | cut -f 2- -d '/')
If you’re not familiar with find, “-type f” tells it to look for only files. Find returns the full path of the file relative to the current working directory so the pipe into cut is there to remove the name of the directory we’re looking in from each side.
The unix shell and the standard unix tools are a major part of what makes me love to use OS X or a Linux distro. There are so many common problems that are really easy to solve with a little knowledge and a few commands such as the comparison of two directories above.