Monday, March 21, 2011

Daily wtf

My wife is taking a shell programming class, and one of the questions on the homework was:

How would you use fgrep to replace this command? grep '[a-z]\{9\}'

Now, fgrep is grep without the regular expressions. So this seems to be a very strange question. Nevertheless, let us assume that there is some hidden wisdom the problem is designed to help us uncover.

Approach 1: can we turn regexes back on with a flag? The manpage is unclear on this point; it tells us that fgrep is equivalent to grep -F, but doesn't specify what happens if we were to give, say, grep -F -E. At least in ubuntu lucid, it produces "grep: conflicting matchers specified".

Trying 'fgrep -E' produces 'fgrep: invalid option -- 'E'', which was my first hint that fgrep is /not/ in fact exactly equivalent to 'grep -F'. /bin/fgrep is in fact its own binary, about half the size of /bin/grep.

Now, the paragraph talking about fgrep in the manpage looks like this:

       In  addition,  three  variant  programs  egrep,  fgrep  and  rgrep  are
       available.   egrep  is  the  same  as  grep -E.   fgrep  is the same as
       grep -F.  rgrep is the same as grep -r.  Direct  invocation  as  either
       egrep  or  fgrep  is  deprecated,  but  is provided to allow historical
       applications that rely on them to run unmodified.

Note the last sentence, which tells us that if we call grep as fgrep, it should behave like fgrep. This turns out not to be the case. Symlinking or simply copying grep to fgrep doesn't appear to change grep's behavior at all. So that's a second bug in the manpage.

Approach 2: Perhaps the instructor really does want us to avoid regexes entirely. The manpage has this as the description of -F:
        -F, --fixed-strings
              Interpret PATTERN as a  list  of  fixed  strings,  separated  by
              newlines,  any  of  which is to be matched.  (-F is specified by
              POSIX.)

Separated by newlines? That's a little weird. I can't actually figure out how to put a newline into a command line argument. Inserting a literal one with ^V doesn't seem to do the trick, nor do things like fgrep 'a\nb'. So as far as I can tell, that's another bug, although fgrep -e pattern1 -e pattern2 does seem to work.

The other way to specify multiple patterns is to put them in a file, and then use fgrep -f pattern-file.

So we could, in theory, enumerate all the possible matching strings in a file and thus avoid the regex. A file with all 9-letter lowercase words, one per line, has 26^9 = 5.4T lines of 10 characters each, for a total of 54TB. Newegg currently has a 2TB disk for $69.99, so not counting tax, shipping, or filesystem overhead, a cool $1900.05 worth of disk would be enough to hold the pattern file. There do also appear to be several options for transparently compressed filesystems in linux, which makes the problem much more manageable.

You could also give fgrep -f - and let it read the patterns from stdin, piping in a list generated on the fly (although it's not clear if this would violate the terms of the question, since it requires more than just fgrep, unless you're a very determined typist).

The next problem is what fgrep does with the patterns. If it copies them into RAM or puts them into some sort of data structure (like, say, a search tree), you're going to need a whole lot of swap space. But this limitation might in turn be overcome by compressed swap.

And of course all of this assumes a 64-bit OS, and an implementation of fgrep that can handle trillions of patterns.

So my answer to the question How would you use fgrep to replace this command? grep '[a-z]\{9\}' would have to be:

Enumerate all 9-letter lowercase words to a 54 terabyte text file called 'patterns'. Then run fgrep -f patterns, ensuring there's enough memory available to store the set of matching patterns.

No comments: