Ed note: we are very pleased to welcome this guest post by Mark Sandler, Director, Center for Library Initiatives, Committee on Institutional Cooperation (CIC).

There was an article in the Chronicle of Higher Education last week about Google slowing down its pace of scanning, and there seemed to be credible affirmation in the article that this is indeed the case.  The comment above was one of many—some for the scanning project, some against it.  Rather than sounding cranky about this comment by Poster 11134078 in the Chronicle’s big, open marketplace of ideas, I thought I’d vent my pique in the friendly confines of Evanston.

The Chronicle report states that Google has digitized 20 million volumes from about forty libraries around the world.  Twenty million volumes would give Google the second largest book collection in the U.S., smaller only than the Library of Congress with 34 million books. LC has been at the business of being a library since 1800, and currently has staff in excess of 4,000 to keep tabs on all their volumes (that staff down by more than a thousand employees from the 1990s.)  Google, on the other hand might have two or three librarians and a couple of engineers keeping tabs on its “smaller” collection of 20 million volumes; a collection they began building in 2004.

HathiTrust Digital Library has holdings in excess of ten million volumes, placing it somewhere like sixth or seventh on the list of U.S. library collections—behind the University of Illinois (12 million vols.), and at about the same level of holdings as Columbia University.  Like Google, Hathi also manages with a few FTE (probably between 2 and 5, depending on how positions are sliced and diced), and they’ve been working to build and organize their collection since 2008.

In both cases—Google and HathiTrust—the records being used are records provided by libraries, either directly or through OCLC.  Putting aside the variance of record quality within and across libraries, just the fact of managing differing record formats in a single system is a considerable hurdle to overcome.  Will it ever get better?  Sure, give them 200 years and 4,000 workers and I’m sure they could clean up the database to the satisfaction of Commenter 11134798.  In fact, I expect that Google and HathiTrust will make considerable progress on improving discoverability—or content organizing frameworks—over the next five years, as they turn their attention from mass digitization and mass ingest to a sharper focus on improving the user experience.  In the meanwhile, I think librarians, scholars, and other Chronicle readers should be standing on the sidelines cheering on these efforts, awestruck by their scale and impact.

Does LC have cleaner records?  Probably so, but they also have a gazillion pieces of whatever that are NOT cataloged despite a long history and hefty staff.  And how accessible are their marvelous holdings?  Depends on how quickly you can catch a flight to DC.  Oh, and are the contents of their collection searchable—important since you can’t browse their holdings?  No?  Just the records?  What a shame.  Well, I guess that bricks and mortar library also has some work to do to tidy up the store.  Let’s see how far they get in the next few years on getting their collections scanned so we can see what’s behind all those pristine MARC records.

So, to close this cranky post, I just want to say that our digital collections are in their infancy.  We’re learning as we go how best to represent them and make them meaningfully accessible to users.  We’re also getting more aggressive about redirecting staff attention from managing dwindling print resources to organizing the growing body of digital content owned and licensed by our libraries.  Faceted browsing may or may not prove to be the best approach, but it just strikes me as incredibly naïve to think that twenty million “anythings” could be easily organized, or made seamlessly accessible, in a matter of months.  Perhaps we could agree that, for now, we will begin our public comments by acknowledging how much has been accomplished in short order by the likes of Google, HathiTrust and partnering libraries, rather than griping about the shortcomings of a work in process.

Mark Sandler
Director, Center for Library Initiatives
Committee on Institutional Cooperation (CIC)