John Wilkin - Transcript
University Librarian for Library Information Technology at the University of Michigan
-- Beginnings --
[Why was the Digital Library Production Service at the University of Michigan created and how did you become involved?]
Sure. I think—that’s probably one of the most interesting questions in terms of our evolution. And the things that we were talking about there a moment ago—Making of America and JSTOR factored into that. So we had a number of things that we were spitting out that—including those things and something called PEAK, which you’ll find literature on, which is now long dead. We put Elsevier’s journals online for a number of organizations and institutions before Elsevier had an online service. The museum education site licensing program—MESL—we quickly recognized that—that it was going to be impossible to sustain those activities without a—production organization. Without infrastructure and staff devoted to something that was less ad hoc and always changing. And so we took some of the staff who were involved in those activities and allocated new funding to build an organization that would be responsible for the sustained support of those activities and others like them.
I think that the key thing here and one that I think for the students is important to keep in mind is sustainability. We talk about sustainability all the time in—you know, with regard to energy and digital formats and—but this really is about being able to sustain an enterprise where—where there is an ebb and flow of—systems and materials. Being able to do that year after year in a way that pays attention to whether you’re going to able to support it. It often feels easier to build something, to build a system, in isolation from questions—it is easier, I would say—to build a system in isolation from questions of sustainability. But when you have to pay attention to the fact of turnover and staff and intersections with other systems, the burden is heavier. So that was a lot of what went into creating DLPS.
[What was your role?]
I was responsible for some of the projects, you know, you’ll find a paper, presentation or two online that says something like, “from project to production,” and that was the language we were using. So—and so was—a key system builder, then—I think there was—I hope that it was because of this that there was a fair amount of trust in me that I was able to pay attention to, give attention to things like that—so rather than a search the directors of the library—of the computing organization on campus and the media union appointed me. They had a DLPS and provided me with resources from the several different organizations.
[What needed to be accomplished in order to create the DPLS in terms of technology, people, and infrastructure?]
The work that we were doing? Yeah, I think that coalescence provided some interesting opportunities and some needs. So if you’ve got—say—early years of development—a server. And that’s all you need, a server. To get something going. And some storage. Here. And some there. It doesn’t look like you need a data center, but when you’ve got the coalescence of a lot of different things and your storage needs grow, and frankly, back in ’96 the form factor for storage devices was different. You needed something more than just what was under—possible under someone’s desk. My desk. My little office downstairs back then didn’t need heat, didn’t, because the machines in the room generated more than enough heat for me. But—but when you have that coalescing of things, you have economies of scale, and you have new and different types of needs. So we needed a data center.
We began to take advantage of a data center up on north campus, you know, it doesn’t matter where the machines are, the networks would be robust enough. We needed more storage because I think often you’re able to sweep problems under the carpet when it’s just this project, and you can ignore a little bit of new stuff coming in, it doesn’t push you over the edge. But when you have all these things coming together you need more storage. So we were then buying large amounts of RAID, at the time, early. You know, sort of a—you know what RAID is right?
RAID is Redundant Array of Independent Discs. So when you have a storage device with these discs in there, there’s either software or hardware that controls the way that the information is spread across the discs. And RAID devices are designed to ensure that you won’t lose any information if you lose a disc. And that idea is—is now elevated to some fairly advanced and sophisticated systems—we have systems that can lose entire nodes of—of storage devices without losing any data because there is such distribution to replication. So we began, you know, in ‘96 investing in RAID devices, for example, to store our data. We couldn’t afford to lose a disc because we would lose data. So having RAID was important.
So those were some of the sorts of things. Also, when you get to scale like that, you start thinking about staff in different ways. Everybody, before, was a jack of all trades. And just like the saying goes, jack of all trades, master of none, I can say that my—early interfaces sucked and my programming was not much better. And so we—when you coalesce those needs, you’re able to say, these people were responsible for interface, for usability, for—design of the system, for management of the system. Everybody did everything, you know. This is true.
One of our very talented programmers today was at the time interface designer, programmer, system administrator, the equipment sat on his desk. It was a terrible, terrible waste. Tim would be the first to tell you that he knows nothing about design of interfaces and what he did was at least as bad as what I did. And it was really kind of a shame to have Tim doing security patches on systems when his real talent was your thoughtful, creative design of systems that could respond to interface demand. So we hired, probably in the second year of our existence, an interface specialist, that was the title we gave at the time. Now we have an entire user experience department. But you know, this is the difference I think in these sorts of things as you—as the organization matures. We have an entire department devoted to both system administration and system integration. They do creative things on top of that—of the system. But back in—those days and the early days of DLPS it was the beginnings of us having areas of specialization big enough to have people devoted to programming or interface.
[What was your workflow? Did each person need to fix the issues they encountered?]
Kinda how it went. And for the longest time we could all fit in one room. And so workflow—designing a formal workflow was more cumbersome than the value it contributed. We—we still struggle with the balance between formality in doing things and—and the freedom to operate in more improvisational ways. And I’d say that though—I said we struggle—it is a very productive and healthy tension. We have fewer formal methods for managing things and more informal ad hoc methods and I think we feel very agile because of that. There’s not a form we fill out to do things. Because the form would not reflect the way that things change from activity to activity. So we stay close, we communicate frequently. We have people now who are project managers. Everybody managed projects then, now we have people who are project managers. And they knit things together.
[Please tell us about the Making of America Project]
Sure, sure. You—you asked how did we become one of the partners, and I think we were—we have always been very close to Cornell. As long as I can remember, Anne Kenny and Wendy Lougee and—I forgot his name, he’s going off to EDUCAUSE now—and I were able to say, hey, we see some common interests here, we’ve done this work for JSTOR and for Tulip and you’ve done this work with the Xerox DocuTech system. Maybe we can come together to do something around early American printing—that—digitize that content. Cornell was, as I said early—before we started the conversation, really intent on creating reprints. Michigan knew that they were going to use—we knew that we were going to use the digital images for online access. When—I think the funding came in ’96 and by ‘97 we were demonstrating with the fruits of the digitization and online access system there—it was, you know, clearly not quite everybody’s imagination. Cornell and Michigan brought together other pre-DLF institutions and we looked at it and said, there’s nothing to do here, this is really very meaningful. So it was really friendships and a recognition of the way that our different experiences can come together around this one—this one problem.
Cornell did all the heavy lifting on digitization, specs, Anne Kenney and Steve Chatman did lots of great work on benchmarking. We had used that benchmarking in the work for JSTOR—but they guided the way in the contract for Northern Micrographics for digitization. We did—we guided the way on sort of metadata specifications for structure around the content and then—the online delivery. We built the delivery system and shared the code with them and then implemented their content at Cornell for them.
[Did you work collaboratively with other institutions in the early days?]
Yes. Yeah. If you look at the early days of the Digital Library Federation—and it was called NDLF at first, National Digital Library Federation, you’ll see that that was really what it was about. It was about sharing. It was about best practices and standards and where best practices and standards were inappropriate, it was about sharing experiences and strategies. A lot of the early conversations were about architectures, scaleable, sustainable architectures. And I can remember meetings where—you know, institution after institution put up architectural diagrams on slides to explain how they were doing things. And then from there, it was thinking about ways we could knit our resources together to do things in more productive ways. Indeed, a meeting in Ithaca in ‘97 was really—gee, how can we take the digitization of five thousand volumes and turn it into something that looks like a comprehensive digitization effort? And everybody said pfffftt, you can’t do that! And it took awhile, but—you know, we have done it, but everybody wants to do it now, right? But a lot of it was really about sharing, and there still is, you know, because of the nature of the library community, a lot of that. You guys know about NDSA, right? [Top] [Back to Interview Breakdown]
-- Challenges --
[What were the main challenges that you faced and overcame?]
One I’ll mention that—from PEAK that I think I’m—we’re very proud of and I’m disappointed we don’t see this in the marketplace—we—you know, when we did this thing with Elsevier’s journals, nobody had put that much content online before in a public system. It was more than what we were seeing in the private sector by a long shot. It was twelve hundred journals, as far back as Elsevier had had them in digital form which was not back to the beginning, but it was certainly back aways for journals when they’d begun digitization. But one reason why we were able to do the project, why—or we had support for the activity was that an economist at the University of Michigan convinced Elsevier that it was important to explore different economic models for access to online journals.
So PEAK stands for Pricing Electronic Access to Knowledge, it’s a bad acronym. But Jeff—Jeff MacKie-Mason who’s now the dean of the iSchool now here and who was in the Ford School of Public Policy and Department of Economics at the time, said—I tell you, it’s really kind of almost an accident that he’s ended up in the iSchool, but look where we are and the sort of things we’re doing. He said, you’ve got traditional subscription models, but that may not work well for electronic journals. So let’s do traditional subscription models and let’s do pay by the drink because that’s what you think you want, where people will buy access to individual articles. But let’s do something else, where the institution can say, “buy tokens.” And the tokens get spent automatically, recognizing that if somebody accesses an article, more people will also be likely to access that article. The long tail—the opposite of the long tail, right? This was 19—97. Jeff was pretty smart. And so, you could buy tokens up front, but the tokens needed to—you needed to predict at the beginning of the year the number of tokens you wanted to buy. Once a token was spent on an article, everybody else at your institution would be able to access that article for free. If you run out of tokens, your only choice is to buy things by the drink. Does that make sense? So he was trying to create a little dilemma for institutions, to say, we don’t want to overbuy because then we leave money on the table, and we don’t want to underbuy because—because then we’re paying for individual articles at that point. So does all this make sense?
So, great idea, right? How do you do that? It was hard! And we did it. So we—implemented all three models, and we had institutions that were forced into using one of each of the models whether we had redouble representation across. And there were a lot of things that Jeff and the economics doctoral student working with Jeff learned from that, and you’ll find that in the literature. One of the most important things we found was the way that authentication is a barrier. And it still is a barrier today when people have to authenticate, they often don’t. And they turn down or turn away from services in doing that. I believe that we’re getting closer to an environment where single sign on makes that less relevant. I think that’s what we need to have, where single sign on means, I don’t have to authenticate, I’ve already done it once. And you don’t have to authenticate again. But that was one of the things that they learned.
[What is single sign on?]
So, single sign on, you know, we talk about—we use that phrase when—the institutional infrastructure supports credentials across systems. Typically in the past, you had to sign on to each institution, and each system one by one. Single sign on is a—a mechanism or a strategy for having the credentials be observed by all the systems. So understood in a common way and used by the different—different systems.
So that—you know, that didn’t used to exist. We struggled hard to get to single sign on, and now we’re struggling to get to—inter-institutional authentication—shibboleth. And—we’re getting there. It’s looking good.
That was one; size was an issue. We—you know—we—there weren’t, like, search engines out that there you could use. We developed our own search engine. It’s called FTL. We found that the amount of data that we had outstripped what FTL could do; we had to enhance FTL; FTL was used for years later—after that on JSTOR. We did all the work on making it scale—we had far more than JSTOR did for a very long time. You know, hardware, scaling, those sorts of things. Most of these things are just not relevant today. It’s, you know, it’s the user problems.
[What are some of the challenges you faced with the Making of America Project?]
I think some of the things that we did early on—helped us to understand scale in this kind of thing. Every digital object needs an identifier to manage the digital object. How do you get an identifier? What is a reasonable identifier? Have you done anything like this? This kind of thing?
[They haven’t been involved in it, no.]
We had—done this work with—American Verse Project, it was pretty large scale, and we were inventing identifiers that look a lot like what Internet Archive does: some parts of the author’s name, some parts of the title, distinguishing characteristics, taking into account ISO parameters, don’t put special characters in, that kind of thing. And it taught us pretty quickly that that was not going to scale to thousands of works that we would digitize. You know, “twa” for “Twain” and—“king” for “King Arthur’s Court,” you’re going to run out of unique letters pretty soon. And you look at the IA identifiers like that, they’re really long and they have a bunch of random letters in there to make ‘em unique. Well, you know, semantically rich identifiers are stupid. They—they really don’t help you very much at all. And so it taught us that we needed to move to things that were—that were more—more scaleable. We used, you know, every book had a cataloging record and every cataloging record had an identifier in that notice database system, and they were letters and numbers, so we used those. Good, right? But somebody typed them in. And what happens when you type in things like AKV232448, they type the wrong thing, right? And then you’ve got to find this thing, which is labeled wrong, and what record it came from, and then you’re doing reverse cataloging. That was—you know, a lesson that we learned applied to something else that taught us a new lesson. It should be automated. It should be—it should be validatable. Do you know what we use today? Barcodes.
Yeah. You can log them in. You can’t type anything wrong. They have check digits. They’re in the record already. If they’re not in the record already, you get them in the record. And barcodes are—are pretty nifty that way. So, getting better and better. [Top] [Back to Interview Breakdown]
-- Hindsight --
[Looking back, would you have done anything differently? What would you have changed if you could?]
I don’t feel like we had any real dead ends that caused us problems.
Let me tell you about something else we did at the time that was sort of—figuring things out and that we built on. We were building—you know, we had lots of different collections of content, and each one required a different access system. And we used similar methodologies to build those different access systems. We kept doing the same thing, starting from scratch and borrowing code from things we’d developed previously and after awhile we’d make a lot of improvements to the thing that was the most important today and we’d realize that the improvements were not being inherited by the earliest systems that we’d developed. And—and it became clear to us that we needed modularity. That we needed to build the code around modules that were constant and worked for all the systems and that what we did for each new one would be unique for—would be the unique piece for that. So building in modularity. We didn’t get it right the first time and there were things that were very hard for us to fix in later years and I’m sure that there are programmers who were—would say that the search logic piece of the—the core module for DLXS was badly done and we should have done that one differently. That’s not on my radar but—but we learned that—you know, through this process that we needed to build out in that way. I think these are principles that we, you know, we develop and learn about in the process of doing these things.
[What were some successes that you are proud of?]
I think we can draw the lineage from Hathi Trust right back into these things. That we—we learned about scale and sustainability from that. And in 2001—thereabouts, which was not many years later, when Larry Page said, I’d like to digitize the whole library, that didn’t seem like a crackpot thing, it seemed like something that had value for us. We’d already moved by that point in time from digitization of historical collections to digitization as a preservation method, and we knew what the value to us was. So—so when we started working on what we then called Mbooks, getting the digitized content back into the systems at the University of Michigan. We knew how to lick the scalability problem. We had no issues. We—it was not that it was a slam dunk, but we knew what we were doing.
The RAID systems that I described for you in those early days, were—some of them were Jetstore—and we lashed together a whole bunch of them, knowing that it was going to cause us problems but it was not a strategy that would sustain us for very long, but that we could do that for now. And use it to bridge to the next level when we could get a more scaleable storage infrastructure. And it was—you know, we knew what we were doing. We had experience in all of those areas. We could see the next step. Everybody pulled together and did great work, but there were very few outright mysteries.
Scaling a full-text search to something in the millions of volumes was a mystery. We didn’t know how to do that. We knew that what we were using would not work. And that required some real R&D. And yet our experience about—about lode and user behavior and what—what’s—what’s being searched allowed us to guide that R&D in meaningful ways and produce what we produced today, which works. [Top] [Back to Interview Breakdown]
-- Advice --
[What are some current issues that you face?]
One thing that comes up over and over again is—is ambiguity in—we talk about historical digital collections, ambiguity, we don’t have—a census of the materials available. Very important meeting a couple of weeks ago, small group of people, library directors, heads of consortia, coming together to try to figure out—how we can deal with government documents. We need them online, and we need a comprehensive corpus. Nobody knows what the comprehensive corpus of US government documents is. You would think that we would know that, but we don’t know that.
Big regional depository library collections are not cataloged. Not cataloged comprehensively, they’re not cataloged at the item level, so we don’t know how many volumes we have. Is it 1.7 million, 2 million, 2.2 million? I’m kinda shooting in the dark a little bit, and it’s that way for all sorts of things. Nobody can say what a comprehensive corpus of the public domain is or of—of, you know, candidates for copyright renewal determination. We don’t know. We—there’s so much that we don’t know, and we ought to know better. A lot of our approaches to record-keeping have—have undermined our ability to know. And we will have to undo that a bit to move forward. OCLC is a repository of records. It’s a record sharing mechanism. It’s not an inventory of what’s in our collections.
[What do you think students should be aware of about digital materials, libraries, and cultural heritage right now?]
I—I think that your education should be about strategies and methodologies—rather than specific—you know, specific skills because the skills are going to change. You know, when I went to library school, I had—I still have a copy of AACR2 on my shelf. I never consult it, but you know, it’s good to have that one souvenir. (Laughs.) But, you know, it was about learning AACR2, and I think that we need to be cognizant that things are changing enough that it’s about—it’s about frameworks and strategies to do those sorts of things and then how we apply them. If you think about, you know, becoming a digital—digital preservations specialist, you can’t know all the formats. You should know about formats, you should know some key formats, you should know why those formats are meaningful and how to extrapolate things.
We just hired our second—I’m sorry. We hired for the second time our digital preservation librarian. A nice, new kind of position. And Lance, you know, Lance knows some formats very well, but he knows strategies to understand other sorts of formats, so, we’re trying to figure out video right now, and folks will tell you that there is no one answer on video, there are lots of different strategies. And Lance can help us to shape our strategies with digital video because of what he knows and the way that he knows things. So, you know, picking an area of usability or user experience or HCI and figuring out strategies is more effective than learning this framework for that thing.
[Are project management and digital curation two promising fields?]
Yeah. I’m hiring a—needed particularly. I’m in the final stages of hiring a—what we call a special projects librarian. It’s an adjunct for me, a junior person. We used to have a research library residency program here. People right out of library school or nearly right out of library school working in research libraries—we don’t have that anymore. I think it’s a real loss. It was a great opportunity for me, that’s how I came to Michigan in ‘86. So I use the special projects librarian as sort of a quasi-residency program. But the person is essentially a project manager. And—we, I think have come to realize that there’s no magic bullet—you know, Basecamp. You know what, it’s pretty good for some things, not good for others, and doesn’t suit everybody really well. JIRA, Footprint, you know, it doesn’t matter, it’s not about the tools, it’s about the way you use things. And—you know, you can probably use GoogleDocs to—to good ends there if you’ve got the skills and knowledge and things you pull together to make things work that matter—that matter most.
[Is there anything else you feels students should know?]
No, I mean, I do think it’s a very exciting time and there are lots and lots of different areas that one can bring these things to bear on. I was listening to a colleague today that I’ve had a conversation with a succession of three deans or interim deans here about the nature of our work in the library and how it’s changed. And I go through the list of the last, you know, fifteen hires—librarian hires of the last year. And not a one of them is a cataloger or a reference librarian. It’s a copyright specialist or—or a user interface specialist or—gee, sometimes a programmer, you know, somebody coming out of library school with programming skills, but skills that he or she can apply to, you know, particular types of problems. Project management, that kind of thing. And I think that’s an important—an important fact. It’s an exciting time with lots of very interesting opportunities in the field where we’re making things happen. [Top] [Back to Interview Breakdown]