Sunday, October 26, 2008

Friday, October 24, 2008

Free the TV spectrum

Google has recently been promoting a very worthy cause: Free the Airwaves. With the move to digital TV this February, there will be really valuable spectrum available, and it's important that ordinary people get to use it, not just big companies.

If you've ever used a laptop with wireless internet access, you've used free spectrum. The spectrum coming up for grabs is much more useful than the spectrum used for wifi.

Please consider signing the petition and calling your congressperson to support this measure. The FCC will be voting on it November 4th, and the NAB is fighting hard to keep it from happening.

This is a cause that will make a really significant difference in your own life: it'll affect the kinds of gadgets you use in a few years, how easily they connect to each other, and whether you have to pay for the "privilege".

Saturday, October 18, 2008

Tarring up lots of duplicate files

Lately I've been building tar archives with lots of directories that have almost exactly the same contents (and it'd be impractical to reduce them to forests of symlinks). I was disappointed at how poorly the tarfiles compress, considering how many copies of the exact same data are there. Both gzip and bzip2 do poorly. (And zip is even worse, since it only compresses one file at a time.)

So I figured I'd give the compression algorithms a little help, by feeding in the identical files in groups. This works fine since tar doesn't care about the file order being different (even if the files are in different subdirectories).

1. I ran find -not -type d on the root directory of what I wanted to tar up, in order to get a list of files. -not -type d is important, since if you tell tar to tar up foo and foo/bar, it'll recurse into foo/ and pick up bar, then pack up foo/bar agin, and you'll end up with two copies of foo/bar in your tarball. Weird, eh? (Beware of symlinks and other non-file types when you do things like -not -type d, of course).

2. I wrote a little perl script to sort the list of files by their basenames (the "baz.txt" in "/foo/bar/baz.txt"), and put the list in /tmp/files.

3. I ran tar -jcvf foo.tar.bz2 -T /tmp/files

So, all the files with the same basename got tarred up together regardless of what directory they were in, and that was enough to give bzip2 the hint that they were thus very compressible.

How well did it work? Pretty well, at least for my present dataset:

In all I had 6100 files totalling 177MB (so an average of 30k/file), with each file duplicated about 100 times (so, 60 or so unique files). Doing an ordinary tar -jcvf produced a 94MB .tar.bz2 file, so about 2:1 compression. gzip fared almost as well, at 96MB.

After sorting the file list, gzip got down to 83MB, but bzip2 did much better: 17MB! So, it went from 2:1 compression to 10:1, just by sorting the file list.

For comparison, I also built a tarball with essentially one copy of each of the 60 or so unique files, and came up with about a megabyte. So, we should have been able to do almost 200:1, but 10:1 is still a big improvement over 2:1.

Of course, this trick only works if the identical files have the same filenames. But here's another trick: instead of sorting the files by base filename, sort by md5sum:

find ./tar-me-up/ -type f -exec md5sum \{\} \; | sort | cut -f3 -d' ' >/tmp/files

It's slower, since it has to read and hash all the files, but now all the identical files will be together regardless of filename.

Now, what'd be really great for me is if tar could run those checksums internally, and simply point to the contents of an earlier file when it encounters a duplicate. That would be pretty simple, but would break tar's streaming nature, and get complicated when it comes to modifying existing tarballs. Really it's the compression algorithm's problem to notice redundancy like that; maybe later I'll play around with bzip2 and see if it can be persuaded to do a better job of noticing patterns.

Friday, October 17, 2008

Gigabit ethernet port won't connect at 1000mbit

A while back I spent a few hours trying to figure out why my workstation was only connecting to the LAN at 100Mbit/sec. mii-diag unhelpfully reports 100baseTx-FD even when mii-tool tells me I'm at 1000baseT-HD, which is probably a bug in mii-diag, but in this case I was definitely connecting at 100mbit.

Anyway, today it happened again. Turns out it was the cable: gigabit ethernet uses all 8 wires in the cable, whereas 100baseT uses only 4. So I bet the cable I was using had a flaky wire in one of the wires used only for gigabit.

Thursday, October 16, 2008

Sunday, October 12, 2008

TV Tropes Wiki

I just discovered the TV Tropes Wiki, a very amusing encyclopedia of literary conventions.

Bechdel-compliant literature (also, the origins of Istanbul)

Tonight I heard a really fun quickstep version of Istanbul (not Constantinople). So I came home and looked it up on Wikipedia, and found out that it's been done by quite a lot of people, as it turns out: The Four Tops did it first, then Frankie Vaughan, Caterina Valente, Santo & Johnny (#141), Bette Midler, an avant-garde representation by The Residents, and more recently, Lee Press-on and the Nails.

Amusingly, I don't think any of those editions is the one I heard tonight.

But that's not why I brought you all here today. A while back I heard about The Bechdel Test, and recently it came up again in several conversations. While the actual list of Bechdel-compliant movies isn't particularly enlightening, it's a really interesting thought experiment to see how many books and movies you can list that follow Bechtel's Rule, and to notice (and wonder why) when things don't. Oh, here's the original article I read that introduced me to Bechdel, in which an ascending screenwriter learns that Bechdel's Rule must not be followed, and having considered the options, decides to leave behind screenwriting.

Saturday, October 11, 2008

Soros on the financial crisis

I thought this was a fascinating perspective on the economic situation:

George Soros and Bill Moyers

I love that they included the transcript. I started watching the video, but found that the transcript suited me better.

Sunday, October 05, 2008

On partisanship in congress

Interesting NY Times article on an influential member of congress who's decided not to run for reelection due to frustration with partisan politics in washington:
Tom Davis gives up. It reminds me of Larry Lessig who recently gave up copyfighting to fight political corruption. I'm glad we have people like that taking a stand.

Saturday, October 04, 2008

Loss of control fuels rituals, superstition

Here's a rather vague description of recent research into control and belief. The big elephant in the room with this article is that religion must have simply arisen in chaotic times, a sentiment I'm rather indifferent to. But it's an interesting thing to watch for in myself. It also supports the Eastern notion of cultivating acceptance of what happens to us, without trying to fit it all into a framework of meaning.

Drunk histories

I really find this series of 4 videos hilarious. Basically,
somebody gets drunk and then talks about a famous historical event. (Uses strong language).

Wednesday, October 01, 2008

If all movies had cell phones

A few swear words, but overall just funny as heck:

If all movies had cell phones.

Warning, includes spoilers for The Notebook and Fight Club.