We are very pleased to welcome this guest post by Martin Mueller, Professor Emeritus of English and Classics at Northwestern University. Read more from Martin at his Scalable Reading blog.

I have put the following proposal to the Board and Technical Council of the TEI. The proposal sits in a tag cloud of Library, Interoperability, Customization, and Outreach. I start with the details of the proposal (which I formulate in terms of ‘is’ rather than ‘should be’) and follow them up with some broader reflections.

  1. The TEI Consortium assumes responsibilities for the “Best Practices for TEI in Libraries” document (BP hereafter), integrates it fully into the TEI Guidelines, and makes the maintenance of BP a central and continuing responsibility of the Technical Council.
  2. BP abandons its current five levels and focuses on the current “Level 4,” using as its model a P5 version of the EEBO schema, which has been battle tested with some 50,000 texts from three centuries of Western print culture. (The EEBO schema operates at pretty much the same level of granularity as the Level 4 schema, as do the various Chadwyck-Healey collections.)
  3. Maintenance of BP by the TEI goes beyond maintaining the rules of the schema and involves the development and maintenance of a “cradle to grave processing model” or set of default procedures that support the display and querying of BP encoded texts in a sharable and reliable manner (The quoted phrase Sebastian Rahtz’ shorthand for these functionalities). Building and maintaining the various bridges that lead from the encoding standard to a project may well be the most challenging part of this proposal.

An intelligent implementation of some version of this proposal would be very popular and go quite a ways in strengthening the role of the TEI as a key technology in text-centric humanities disciplines. It also appears to run against the grain of the TEI’s delight in flexibility and customization. I say “appears” on purpose. Flexibility and customization are great things, but among the many customizations there should be one that is tightly controlled and appeals to users (and there are many of them) who will happily settle for a solution that meets their needs with little or no customization. You could call this tightly controlled customization TEI-Satisfice, echoing Herbert Simon. I will call it for the moment TEI-Nudge because it is offered in the spirit of Nudge, the recent book by Thaler and Sunstein in which they argue persuasively that in many walks of life people will do better if they are offered well-designed default solutions to opt out from rather than opt into. If there were a well-designed TEI-Nudge, many users for a variety of projects with common problems would see no need to go beyond it. Others would outgrow it at some point, and once they do they will have learned enough to strike out on their own. There are of course projects for which no version of TEI-Nudge would offer a good starting point. But in this proposal I do not focus on the projects that would not fit into it, but on the very considerable number of projects that could be done well enough within its limits. I like to remind people of the wisdom of Winnicott’s good enough mother in the context of such discussions.

From what I have heard from or about members of the Council, some of them think it would be a good idea to try something like this, while others may detest the very thought and think it is a betrayal of core principles and in any event impossible because every project is different. Well, every project is different, but not that different. An astonishing variety of texts from all walks of life has been encoded within the boundaries of the EEBO-TCP schema, and folks familiar with EEBO-TCP texts can tell a lot of stories about their cultural and linguistic diversity. A processing model may make all these texts look like the equivalent of the blue Oxford, orange Teubner, maroon Pleiade, or yellow Garnier editions. But this would be OK with folks who don’t see such uniformity as a disadvantage and may even see uniformity of presentation as a way of highlighting diversity of content. It would certainly be a significant step forward if large numbers of texts encoded with limited or no variance could be queried in (much) the same way for (much) the same encodings.

Which brings us to the question of interoperability and re-usable data. As I was thinking about these things, two emails floated across my screen. Kevin Hawkins drew my attention to James Cummings’ Oddly Pragmatic talk at the JADH conference in Kyoto. Patrick Durusau in a posting on the TEI pointed to an article about “BioCP a minimalist approach to interoperability for biomedical text processing, which, if you followed it up, leads you to a web site that lists the BioC goals as

  1. simplicity
  2. interoperability
  3. broad use and reuse

Laurent Romary calls the article “interestingly naive’”and may well be right. In any event, the biomedical literature, vast as it is, is in many ways more homogeneous than the stuff that TEI tries to wrangle into corsets of varying shapes. But here is a paragraph from the Lexus project at the he Nijmegen Max Planck Institute for Psycholinguistics:

Lexicography in general is a domain where uniformity and interoperability have never been the operative words: depending on the purpose and tools used different formats, structures and terminologies are being adopted, which makes cross lexica search, merging, linking and comparison an extremely difficult task. LEXUS is also an attempt at putting an end to this problem. Being based on the Lexical Markup Framework (LMF), an abstract model for the creation of customized lexicons defined following the recommendations of the ISO/TC 37/SC 4 group on the standardization of linguistic terminology, LEXUS allows on the one hand to create purpose-specific and tailor-made lexica, and on at same time assures their comparability and interoperability with other resources.

“Comparability and interoperability with other resources” are an increasingly important topic on various Digital Humanities agendas. You find echoes of it in a recent “work set construction” Mellon grant to the Hathi Research Centre, and under the heading “Wissenschaftliche Sammlungen” it is a major part of an ambitious Dariah project. Progress towards it is slow, tedious, and partial, but it is also necessary. But “simplicity, interoperability, broad use and reuse” are pretty good goals to keep in mind for quite a few purposes.

The TEI has had trouble engaging with this topic because in its collective rhetoric of praising the virtues of customization and flexibility it fails to appreciate the substantial virtues of standardization where they can and should be practised. And the TEI keeps insisting on a distinction between ‘interoperability’ and ‘interchange’ that makes very little sense to folks outside the TEI’s discursive realm. Cummings’ Kyoto talk is a case in point. He has a paragraph about the “Unmediated Interoperability Fantasy” in which he argues “that being able to seamlessly integrate highly complex and changing digital structures from a variety of heterogeneous sources through interoperable methods without either significant conditions or intermediary agents is a deluded fantasy.” Of course it is. Sharing data is never easy, whether in technical or social terms. There is a lot of talk about “agile data integration” because typically it is anything but.

But Cummings underestimates the power of interoperability that depends on “limited or … only structural granularity,” which he opposes to the “necessary mediation, investigation, transformation, exploration, analysis, and systems design [that are] the interesting and important heart of digital humanities.” From which you might conclude that TEI based encoding becomes interesting and worthwhile only when it involves complex customization. But that is not always true. I would argue, on the contrary, that for many projects the major pay-off comes from the consistent application of simple categories across large data sets. It’s a little like plumbing: the biggest thing is having running water in the first place. Having hot water is also important, but single level faucets that blend hot and cold water add relatively little.

Take the ~50,000 currently existing TCP texts and compare flat file versions with their current and quite coarse encoding. The difference in query potential is very high, especially if you add to that coarse encoding simple forms of linguistic annotation or named entity tagging that can be added in a largely algorithmic fashion. I can imagine more subtle forms of encoding, but as it gets subtler, the law of diminishing returns sets in. So the term “only structural granularity” misses an important point. For a lot of current and future users of the TEI the really important benefits come from the simple stuff, and beyond some level of complexity they begin to feel some sympathy with Andrew Prescott’s not very kind phrase about “angels dancing on angle brackets.”

From what I know about libraries, they are likely to respond positively to a TEI commitment to a common scheme for the not inconsiderable number of projects that can in fact be adequately managed within it, with the added benefit that they will from the beginning have a higher degree of “interoperability” than they otherwise would. Interoperability is always a relative and partial thing, but it is a lot easier to get more of it if is designed into projects from the beginning. And, as said before, the TEI does its cause little good by dismissing ‘interoperable’ while adopting ‘interchange.’

From the perspective of the TEI, libraries should not be a “Special Interest Group.” They are a core constituency, perhaps the core constituency. They hold a very large percentage of encoded texts, and given their commitment to long term preservation and access, a TEI project that ends up in a library has a much better change of remaining accessible in the long run. The scholarly end users who encounter a TEI-encoded text (often without knowing that it is a TEI encoded text) are most likely to encounter it via some library or other. Take away libraries, and the TEI will dwindle into a niche technology. If the TEI integrates some form of TEI-Nudge into its Guidelines, libraries will know what they are paying their dues for. And libraries pay a lot of the TEI dues.

The final term in my initial tag cloud is ‘outreach.’ At my university–and I suspect at many others– a student or faculty member wanting to do some text encoding project will turn to the Library first. At my university library–and I suspect at many others–the relevant folks may not know a whole about the TEI or they may know quite a bit, but there is no organized service with the capabilities for dealing with the routine aspects of a project in a routine manner. Now imagine a TEI-Nudge that library staff are or can become readily familiar with. So you look at a project and make an initial guess whether you can do this in the standard manner or whether you need to “opt out” and do something special. Quite a few projects can be done within the standard option. There is of course the moral hazard that things will be done in the standard way when they could be done better in a special way. But there is the greater and current hazard that the project doesn’t get done at all or gets done in some quirky way, when it would have been better off staying within the standard schema and the predictable display and query routines that come within it.

I am aware that all of this runs against the ingrained Duns Scotism of the humanists who delight in the haecceitas of each project. But haecceitas is very expensive, and foundations–as well as libraries and IT organizations — take an increasingly dim view of it. Many years ago, and faced with a text-centric problem of sorts, a 20-year old entry level employee at Bell& Howell was put in charge of answering customer letters. He discovered that a high percentage of such letters could be answered by one of eight form letters. This was the beginning of a rapid career that made Charles Percy the chairman of Bell&Howell at age 29 and later one of the most distinguished senators of his age. Of course, documents worth encoding in TEI are very different from customer letters. But not that different, and eight out of ten probably will benefit from staying within the confines of a well thought-out standard schema and its surrounding processing rules. And even the two that don’t may benefit from staying within that standard schema as far as possible.

Finally a few words about the Guidelines. Integrating BP into the Guidelines, and doing so in the right way, would be a huge step forward in making the TEI as a whole more approachable. I am a great fan of the Guidelines, and I take it from the recent French survey that this has been the experience of expert users who have learned how to find their way around it. The world would be a better place if all digital projects were documented with as much care and understanding as the TEI has been. But novice users find it daunting, for reasons that I can understand. If you think of the Guidelines as the Large Catechism of text encoding, it would greatly benefit from a Small Catechism that acts as a gateway to it. A really well-documented version of TEI-Nudge could play that role. “Really well documented” means that the limited schema is described in terms that will be intelligible to novices and that there are clearly marked paths to the more complex phenomena. This is a non-trivial intellectual and editorial project. It certainly cannot be solved by taking some version of the current BP and just plopping it somewhere in the Guidelines.

The Guidelines in their current form are a product of the nineties when storage and bandwidth were limited and images were expensive. In terms of storage and bandwidth it would now be quite cheap to include illustrations in the documentation. Many problems of encoding become a lot easier to “see” when you look at the image of the printed or manuscript page and intuitively grasp the meaning of the “whitespace XML” that shapes the layout of the page. It will be time-consuming and tedious business to transform the Guidelines into a text and image project, but it would be well worth doing. And if the TEI were to commit a non-trivial portion of its current cash cushion, it might well attract matching funding from elsewhere.

What I call TEI-Nudge shares some outreach goals with the TAPAS project but goes about in a different way. The two ways may be complementary. As I understand it, TAPAS aims at creating a shared infrastructure while enabling and indeed encouraging customization. My proposal starts from the observation that a schema like the EEBO dtd can in fact meet a great many projects under one umbrella. Some people might call it a “lowest common denominator” approach. I’d call it a “highest common factor” approach.