"I've got too many books," I said. "Not too many, of course. But I have now reached the point where I can forget that I own a book."
"You should catalog your collection," someone said. "Keep the database on your Palm."
"Hell no. It would take forever, and then it would keep taking forever every time I bought more books. I don't need anything else to remember regularly."
"You know," someone said, "I bet you could get a bar-code scanner cheap, and scan the ISBN codes. Then it would be a Small Matter of Programming to look up the titles on the Net."
My reply was offensive and heartfelt.
I apologize for not keeping this stuff up to date, but I don't use it any more. There are great shareware alternatives out there, and probably some open-source ones, but I haven't investigated them.
Looking up ISBNs - Converting UPC to ISBN - Buying a scanner - Querying for Titles - Afterword
Or, if you don't care about the historical details, here's How To Make It Work For You.
You probably want to check the Page of Updates to This Document, since some details have changed since I actually did this myself.
And finally... the Book List. Also the sublist of recent acquisitions.
Useful, but not what I was looking for. Since the site is just referring queries to Amazon.com (and so on), I might as well go to those sites directly. I don't need price comparisons. Possibly, if Amazon turns out to have an incomplete ISBN database, I might want to come back to isbn.nu rather than design queries for several booksellers myself. But Amazon probably has the goods, right?
The Amazon search form is large, ugly, and full of crap I don't need. I was resigned to figuring out all its details when someone on rec.arts.sf.written pointed out that a URL such as http://www.amazon.com/exec/obidos/ISBN=1565922867/ would work as well. Bingo!
In fact, Amazon finds some 85% of my test run. By checking both Amazon.com and Chapters.ca, I pushed that up to some 92%. Amazon.co.uk gave me a few more. The remaining intractables weren't found by any other web database I tried, so I expect that's the best I'm going to get.
(Bowker provides on-line access to Books Out Of Print... for thirty bucks per week. Oh well.)
An ISBN is a ten-digit number (the last digit may be "X").
A conversion must be found.
Fortunately, it's trivial. The http://isbn.nu/ search form actually does it in JavaScript:
if (indexisbn.indexOf("978") == 0) { isbn = isbn.substr(3,9); var xsum = 0; var add = 0; var i = 0; for (i = 0; i < 9; i++) { add = isbn.substr(i,1); xsum += (10 - i) * add; } xsum %= 11; xsum = 11 - xsum; if (xsum == 10) { xsum = "X"; } if (xsum == 11) { xsum = "0"; } isbn += xsum; }
Not that JavaScript isn't annoying and stupid, of course. (I keep it turned off, so I can't even use the UPC search function on that web site.) But it saved me from looking up the details of UPC and ISBN checksums.
(Small footnote: Chris Taylor contributes the Java code equivalent to the Javascript above.)
Did I say the bar-code was a thirteen-digit number, starting with 978? I lied. Some books have that EAN code. Others have a true UPC code, which is twelve digits. Other books have both (the EAN is often inside the cover.)
Moreover, either kind of code can have a five-digit extension. On an EAN, the extension gives the book's suggested retail price. On a UPC, the second half of the main barcode gives the price, and the extension gives half the ISBN, and the other half of the ISBN is...
Missing. I told you the snag was a doozy.
The first half of the main UPC barcode is a publisher number, which corresponds to an ISBN prefix. You have to look it up in a table. Combine the prefix with the five-digit extension, tack on the checksum, and you have the full ISBN.
But you can't get such a table anywhere. I don't think Bowker sells one -- they're in charge of ISBNs, not UPC publisher numbers.
How to deal? The obvious suggestion (which wasn't obvious to me until Christopher Davis suggested it, thank you Christopher) is to use those books that have both kinds of barcodes. When you scan, scan both whenever possible. The clever scripts can then use that information to build up a table of correspondences.
(Why does this silly EAN/UPC system exist? Basically, I'm told, because mass-market books (mostly paperbacks) are sold in mass-market outlets, like grocery stories and drugstores. Mass-market outlets often have ancient, creaky old scanners which only understand UPC codes.)
(In the distant future -- 2005, specifically -- all scanners will be smart, and publishers can start putting the EAN on every book, even mass-market paperbacks. Of course, everyone's collection will still be full of books without EANs. Life is hard.)
(Okay, I don't actually see CSI on the Google result list now. What the hell, I got there somehow.)
CSI sells a couple of relevant toys. They have a pistol-grip point-and-zap scanner (CCD-8000), and a smaller wand scanner (MT-605). These cost about a hundred bucks each, give or take, depending on model. (They have more expensive models too, but I assume the typical book-scanner has spent all his money on books.)
(I went for the wand, on the theory that simpler is better. Also, a bit cheaper.)
One must mind the interfaces. Both of these products comes in several forms. You can get them with an RS-232 connector, or a "keyboard wedge". The latter is a clever interface that plugs into the keyboard port of a computer, so that the barcodes you scan appear just as if you'd typed them on the keyboard. The keyboard wedge means that you don't need any data-capture software; just start up any text editor.
I actually wanted the thing to work on both my older Macintoshes, which use ADB connectors, and my PowerBook, which has USB. CSI sells a separate USB adaptor gizmo (MT-606). This converts one of the keyboard wedge interfaces to a USB connection. Thirty bucks. However, my older Macintoshes lose; the manufacturer no longer makes the wand with a Mac ADB interface.
I did have to be careful to program the scanner correctly. (How do you program a bar-code scanner? Right! You point it at a special table of bar-codes! I love it.) I set it up to read EAN and UPC, always including the first and last digit, and optionally including the five-digit extension.
This is a job for Perl!
Well, I don't know Perl. Perl is fuggly. I've put off learning it this long; I have no great desire to wade in now. Someone (a different someone :-) suggested Python. Python is simple. The dumbass whitespace formatting is annoying, but not in a way that makes the language harder to learn. Python it was.
(Footnote: It strikes me that one could modify the Python compiler to ignore whitespace, and use -- for example -- ":" by itself as a block terminator. Since the compiler is part of the run-time environment, this may even be trivial. It would make a lot of people happier with the language, wouldn't it?)
I decided to split the task into its parts. The first script, upcfind.py, goes over a list of scanned codes (both EAN and UPC) and updates a master table called upc-map. This table, as described above, maps UPC prefixes to ISBN prefixes.
The second script, makeisbn.py, reads the same list of scanned codes; it spits out a list of ISBNs. (If it finds any ISBNs in the original list, it leaves them alone. Any line that looks entirely confusing stays in the list, but the script puts a "#" mark before it so that later programs can ignore it.) This script uses the upc-map table, of course. It's also smart enough to find instances where you scanned both UPC and EAN barcodes of the same book, and only spit out the ISBN once.
The third script, shelve.py, is the one that actually hits the Internet. It reads the list of ISBNs, and writes two output lists. The "out-err" file is a list of ISBNs that the databases (Amazon and Chapters) didn't manage to find. The "out-good" file has three lines for every ISBN successfully queried. The three lines are ISBN, author, and title.
The fourth script, collate.py, is only relevant if you want to turn the data files into a JFile database. (JFile is a shareware database app for the Palm.) The collate.py script just takes one or more data files, strips out blank lines and comments, sorts the data, and adds the one-line header which you need to convert a text file to a JFile PDB. (JFile comes with Windows tools to import data; I wrote jtrans, which does the same job on Unix.)
The guts of these scripts are straightforward. Python has library modules for reading lines of text, manipulating them, sending HTTP queries, and returning the results. A bit of regexp cleverness was needed to parse Amazon's HTML, but nothing painful.
(Of course, it's possible that Amazon will change its response page format. They may even remove the ISBN query URL entirely. I don't guarantee that these scripts will work. They worked for me, is all.)
Different publishers hyphenate ISBNs in all sorts of inconsistent ways. Ignore the punctuation, and look for ten digits. (And don't trust the older "SBN" too far -- sometimes you can get a valid ISBN by prepending a zero, but in general you can't.)
Scanning books is a lot of work. Do not imagine that this project consists of a few minutes of beep-beep-beep, followed by four scripts and you're done. I have six well-filled bookcases (figure 250 books each), and scanning in all the books in a bookcase takes something like 45 minutes. Could be faster if you're lucky, but if you have any quantity of old or obscure books, you'll be typing a lot of comments by hand -- that eats time.
Then, after the scripts are run, you have to go back and fill in missing titles, fix typoes and outright mistakes, and generally massage the data. Make sure author's names and series titles are spelled consistently through the data -- that sort of thing. That's another 45 minutes per bookshelf.
So, overall, I probably blew eight or nine hours on this project -- not counting the programming time. Was this worthwhile?
Hell yes. I could probably type in the titles and authors of 250 books in less than ninety minutes... but it would be a two-person job: one to read off books, one to type the data in. (Try to do both, and it's a running battle whether you drop from exhaustion, turning from the computer to the bookshelf and back, or just petrify your eyeballs from the focussing strain.)
And the typing job would be awful and tedious, even by itself. And you'd have that editing-proofing-massaging stage to do anyway.
So this work certainly saved me folks-hours. It saved me effort; editing a generated list for mistakes is much easier than generating the list yourself, even if it takes nearly as much time. And, I shouldn't even need to say, the prospect of doing the job geekly got me to do it -- I would never have started if the only option had been dronely.
8/28/00: Dan Poirier reports that the Amazon web searcher has to be
jiggered. In shelve.py, line 73, change
re.compile('/Author=([^/"]*)')
to
re.compile('&field-author=([^/"]*)')
.
I haven't tried this myself.
8/29/00: Radio Shack is giving away free barcode scanners as part of some marketing program I don't understand. Skip the Penguin has put up a page about using the CueCat scanner for your own purposes (including cataloging books). Linux and Windows instructions included. Or see the Lineo page for another Linux CueCat driver.
Further updates on this page, since appending them here in the middle doesn't make sense.
Actually, if you skipped the sections above, you're going to get confused. But I'll try to hit the highlights.
XXXXXX=YYYY
(no spaces or other characters). This indicates that the six-digit
UPC prefix XXXXXX
maps to the ISBN prefix YYYY
.
upcfind.py scanfilename
line L: UPC prefix X is already in the list as Y -- not Z
makeisbn.py scanfilename > isbnfilename
UPC barcode requires five-digit extension
-- indicates
that you scanned a UPC code without the extension.
Unrecognized format
-- indicates that the line doesn't
seem to be either a UPC, EAN, or ISBN.
Unknown UPC prefix X
-- indicates that the
prefix wasn't found in the upc-map table.
XXXXXX=YYYY
line to your file.
See above.
XXXXXX=YYYY
lines
that are necessary.
shelve.py isbnfilename > datafilename
collate.py datafilename1 datafilename2 datafilename3 > dbfile
jtrans -e -o -n Books dbfile dbfile.pdb
collate.py datafilename1 datafilename2 datafilename3 | htmlmake.sh > books.html
The Book List
List of Recently-Acquired Books
Recent Updates to This Document