Digital Archives

Bob Stepno's Other Journalism Weblog
Explorations of personal and community journalism...
Traditional, Alternative, Online...
2002-2009 blog page archive

Bob's weblog front page

about this blog

stepno.com my_home_page

boblog.blogspot

Bob at Radford University

AEJMC Newspaper Div

folk music 'PodFolk' non-podcast

Red Liner blog @Harvard.edu mostly cobwebs

Bob's Emerson College cobwebs

Bob's UNC Chapel Hill cobwebs

Evergreen items

What are Podcasting & Videoblogging?

Backgrounder: About_Weblogs

Web Tools Overview

Weblog writing tips

Online news writing

Weblog picture tips

Political reporting links

Journalism ethics links

What is RSS? And that orange 'XML' thing?

RSS news aggregators (for PC World)

5 more aggregators

Digital Archives

From old newspapers to screencasts
and from disappearing footnotes to "read-write" scholarship

Aug. 7, 2006 (Updated Aug. 11) -- On the closing day of last week's Association for Education in Journalism and Mass Communication convention, I was part of a panel discussion of digital archives -- one that combined "glass half empty" and "glass half full" approaches. The official title was quite a mouthful itself: "The Ethereal Electronic Archive: The Life and Death of Information in the Electronic Age."

On the "half full"--or at least "being filled"-- side, I expressed optimism about the National Digital Newspaper Program, an effort to get a huge number of historic newspapers scanned and made available on the Web at no charge through the Library of Congress and state libraries. The "test bed" stage of the project has involved a handful of universities and libraries, working on scanning 100,000 digital page images from their states, covering just the first decade of the twentieth century.

Alas, optimism was all I could show -- the small San Francisco Marriott conference room we were in didn't have Internet access or a digital projector. Otherwise I would have shown these sites for more information, along with a few linked farther below in this essay:

Library of Congress: National Digital Newspaper Program
Aug. 10: Updated Grant Guidelines (also links to a 64-page TechNotes PDF dated July 21).
2005 grant awardees (from March 2005 -- no 2006 awards?)
National Endowment for the Humanities: NDNP project page
Organization of American Historians: NDNP article, May 2004
NDNP conference presentation, 36-page slide show, May 2006: Thinking Ahead, Designing Now.
More publications and conference presentations about NDNP (mostly PDF files).
NDNP: The Kentucky Testbed (The University of Kentucky's scanning project.)
Elsewhere: U.K. NewsArchive Plus, Library Newspaper Cooperative.
Afterthought The video side of digital archives.

Everyone wasn't optimistic: I heard in San Francisco that some weren't pleased with the way the project was going. At the start of August the NEH grants website still carried two-year-old application information, with a note that new guidelines would be "available in summer 2006." [Note: As indicated above, that page was updated three days after the first draft of this essay. Previously, you had to find a separate part of the NEH site to get the November 2006 application deadline.]

But NDNP is a long-range project. The 20-year goal sounds good to me: to have a free online Library of Congress collection of historically significant newspapers published between 1836 and 1922, representing all the states and U.S. territories.

Sidebar: I may have missed it, but I haven't seen an explanation of the 1900-1910 decade as a starting point. Maybe it's a liberal media conspiracy to document the start of the "progressive era." More likely it's just a "new century" thing, complete with news about the Wright brothers, the Panama Canal, the spread of the phonograph, the Photostat machine and the automobile... even the first Hollywood movie, all good topics for communication researchers. Or perhaps the choice of dates has something to do with the project's goal of including papers from all the states (remember, there were only 46 of them until 1912). And, yes, I got some of that information from Wikipedia... in preparation for the section below on being careful quoting websites.

Most of the scanning will be from papers already on microfilm at state libraries or universities. See the U.S. Newspaper Program, which encouraged microfilming. As an example, see the University of Tennessee's contribution to the effort.

More than 1,000 pages of me... on pay-per-view

I'm also "glass is half full" about commercial digitization projects, such as the ProQuest archives of The New York Times and other papers -- including my alma mater, The Hartford Courant, which is now available from 1764 to 1979. That means my students can search for the stories I wrote in the 1970s and find embarrasing bad examples. Oh joy. But they'll need motivation -- the University of Tennessee Library buys access to the Times ProQuest archive (from 1851 on), but not to the Courant. I suspect these archives turn into six-figure budget lines at university libraries.

For casual users, the Times and Courant websites provide access to the archives for a few dollars per article -- from papers that sold for a few cents per issue. (You can get the cost down to 50 cents an article if you agree to buy a "500 article pack" or a monthly subscription.) It still beats scrolling through rolls of microfilm. Been there; done that.

I'm not ready to write a check for $1,000 to see what I had to say back in the 1970s. On the other hand, maybe I can use those pay-per-view dollar figures to point out to students the value of the digital resources they are already paying for with their tuition -- and should take more advantage of. A century and a half of the Times is one amazing supplement to a journalism textbook!

Another voice on the "glass half full" side was Ray Abruzzi from Thomson Gale publishing, which is compiling its own 19th Century U.S. Newspapers Digital Archive -- a 1.5-million-page collection drawn from 200 newspapers, being selected with the help of a panel of journalism historians. The papers are being scanned and formatted so that researchers will be able to search the full text, choose from information categories ("editorials," "arts") and view full-page facsimiles or individual articles. Like the ProQuest archives of individual newspapers, the Thomson Gale project will be marketed to libraries, not to individuals.

(That reminds me: I forgot to ask the panel and audience members whether any of them had used the digital archives of The New Yorker or National Geographic magazines for media-history research projects, something I've been curious about.)

Early Web Disinformation

On the "glass half-empty" side, I talked about one of my first Web projects, a paper I wrote with a classmate (also named Bob), for Paul Jones' first Information Science course at UNC Chapel Hill. Our half-term project ("by 'the Bobs'") was to research questions of "Information Quality and Disinformation" on the World Wide Web -- in 1995 -- and make a Web page to show the class.

The Web was still young. As search engines crawled for keywords like "disinformation," I was astonished when librarians and researchers all over the world linked to our term paper as a resource. We were even cited by someone at Oxford (footnote 8), even if his paper cited the wrong person as my co-author! A librarian from New England was such a fan that she wrote to me several semesters later when she discovered that a student at another university had copied the entire multi-part essay. (I wrote to him about the plagiarism. He said his version -- which didn't change a word of our paper, just put his name on it -- was "an experiment in disinformation" for his own class. It even kept the phrase "Two people named Bob might even disagree..." I think it's gone now.)

A further complication: After a while, our professor took that particular class-project Web server offline -- it was called http://blake.oit.unc.edu -- and all of those links went dead (or "404" as the browsers put it). When librarians wrote to complain, I posted a backup copy in my personal Web space at UNC. And when I graduated -- expecting that site to go away, too -- I bought my own domain, "http://stepno.com," and put all of my UNC work there, including a copy of the original http://www.stepno.com/unc/disinfo.htm.

In fact, I'm hoping to do the same someday with all of my http://www.stepno.com/oldblog/ blog pages, just in case Radio Userland's server doesn't last for all eternity, although that will mean any "outside" links to those pages will go away, or that enough patient readers who have bookmarked articles keep a copy with a unique phrase or two that Google (or some search engine of the future) will be able to use to track down a particular page.

By now there are a few dead links and missing graphics in that old Disinformation file, but the main content is preserved, at least until my money for the domain runs out. (Donations of perpetual-endowment funds gleefully accepted.) Coincidentally --and adding to the confusion-- when I resumed paying tuition to defend my dissertation, UNC restored my second-generation version of the page, at http://www.unc.edu/~rbstepno/disinfo.htm -- complete with out-of-date e-mail contact information.

On the other hand, I'm glad to see some librarians have kept their bookmarks up to date... But is that their problem? Is it mine? Is it my professor's? I check "none of the above" on this one: Unfortunately, it's just the way of the Web. Right?

Disaster for Scholars?

That may be, but for academic scholars, disappearing source material is a disaster. If someone quoted the original address of that page in a doctoral dissertation (mistaking me for an authoritative source), how could the student's dissertation committee or some other future researcher double-check the facts? (And, in that case, point out that the student was just quoting some speculation by a couple of other students, not some carefully quantifiable research study.)

The missing-footnote question provided a lot of the "glass is half-empty" side of our digital archive panel discussion. Nora Paul, director of the Institute for New Media Studies at the University of Minnesota, talked about attempts to archive websites, such as http://archive.org and the Library of Congress's Minerva, as well as the problem of previously public records being taken out of circulation out of security concerns. She had quite a bit of other information and links about digital archives in Powerpoint slides that she couldn't project, so I'll write to her and see if I can link them here. (I'm kicking myself for not turning on my MP3 recorder and making a podcast of the whole panel discussion. I had the recorder in my bag; I simply forgot to turn it on. Talk about disappearing data!)

The academic "disappearing footnote" problem has been well covered in ongoing research by University of Iowa professors Daniela Dimitrova and Michael Bugeja, who were the last two panelists. Coincidentally or ironically, tonight Google's top hit for Professor Dimitrova is a "404" missing page, not unlike the topic of her research.

Frankly, I hadn't thought of the problem that much until their presentation. Professor Bugeja, director of the Iowa State journalism school, had a wake-up call a few years ago when he was fact-checking the manuscript of his own most recent book. He discovered that 30 percent of his Web citations were no longer accurate. Since then he has been finding more and more evidence that recent research is relying way too much on flimsy website URLs for substantiation. The reasons are obvious. "Footnotes malfunction for many reasons," he says. "Technicians reformat folders and redesign sites or, especially worrisome, revise content at the same online address."

"It's the way the Web works; it shouldn't be trusted to be otherwise" was my first reaction. Links come and go. Students graduate and universities delete their pages. Companies merge, fail, and webmasters decide that a Flash home page is cooler than the old one without learning how to make the contents searchable.

But the more research turns to the Web -- or is about the Web, how can traditional scholarship keep functioning? For instance, this year's AEJMC convention had -- I didn't count them -- quite a few research papers that were about blogs, podcasts, MySpace, Facebook, Wikipedia and other recent Web phenomena. What did they do for footnotes? What good will those footnotes be in five or ten years? Not much. Are naive academics trusting them?

My modest suggestion: Fight bits with bits. Any serious study using a website as a source of material should include a copy of that cited material -- either as a printed image or digital PDF file, GIF, PNG or TIFF image of the page in question -- as well as the original URL and the date it was read by the researcher. For archival purposes, saving in more than one format wouldn't be a bad idea.

I did something like that 10 years ago with my dissertation research, back when capturing a screen wasn't as easy as it is now. One result: Images I grabbed, glued togehter with Photoshop, reduced and put online as "thumbnails" now appear in another professor's book. WRAL-TV, where I did the research, no longer had pictures of its 1996 and 1997 page designs. The professor read about them in my 1999 conference paper and asked the TV station for a copy of the images. I was surprised when they sent him back to me. Result: He gives me credit for the images in his book and on his own website; unfortunately (I've just noticed), he misidentifies me as a former employee of the TV station. I'll send him e-mail about that.

If copyright considerations do not allow publication of a researcher's source material, I suppose the source copies could be kept in a "for scholars only" archive at the author's home (or degree-granting) university, which is how I remember confidential anthropology and ethnomusicology research tapes being handled during my first academic incarnation.

Storycasts and Memex Model

And there's new hope -- especially for research about the Web. To see how completely a site can change over time -- part of the problem we're talking about -- and how to save a movie of online material, I think the best example is Jon Udell's "screencast" demonstration using the Wikipedia "Heavy Metal Umlaut" page. That particular screencast is a Flash video file, with narration, capturing a series of actions or pages on a computer screen. (Here's Jon's "The Making of the Movie" article.) Similar "screen movie" recordings are becoming wildly popular with computer gamers, recording of their best attempts to kill, nuke or build things in cyberspace. (Software I haven't tried: Camtasia, SnapZ, Replay Screencast, Fraps, Wink, more, more and more.)

Coincidentally, some of the software that makes such captures possible also might make it possible to copy movies whose creators don't want them copied. Intellectual and social issues of "read-write" culture versus "read-only" culture were a theme of the AEJMC conference's keynote address by Lawrence Lessig. Among other things, it was probably the first time most journalism educators had seen AMVs -- Anime Music Videos made by combining "anime" cartoons from one source with music by another.

So how do you do read-only scholarship in a read-write culture? I compared this idea to Vannevar Bush's "Memex" concept from 1945 -- a storage medium that would contain all of a scholar's work and a republishable "trail" through all of the source material. Articles about Memex don't always mention the "republish" part, in the last paragraph of part seven of the article. After a Memex user has joined together bits of various books, his own notes, and more books, he learns that a friend is interested in a related topic:

So he sets a reproducer in action, photographs the whole trail out, and passes it to his friend for insertion in his own memex, there to be linked into the more general trail.

Of course if trails of captured screens or screencasts become part of dissertations, they will need preserving, too, whether as an open-standard file (TIFF, MP3) or some proprietary format like PDF or Flash. Does that mean including a DVD of captured images with every dissertation? Or another acre of petabyte Web servers at http://archive.org? I don't know. It's a big glass; maybe a swimming pool. It's half full... or half empty... or something.

Meanwhile, if you're thinking of quoting or citing this page in something more serious than a weblog, be warned that it's just a blog entry, a first draft, and a fairly casual "work in progress." Even the page title has been fine-tuned and broken into two lines since the first posting. Within four days, things the first draft said were no longer true, and I discovered the Aug. 11 changes just by luck. I can't promise to keep everything that up to date, but I may edit it, add to it, change my mind about something, or move it to a new address at http://stepno.com. Google should still find it for you, if you include enough key words along with my rather rare last name. If enough people think my train of thought is worthy of adding to the formal world of scholarship, I'll ask a committee of peer-reviewers to look at a spruced-up version, make a last batch of changes, then declare it finished and put it somewhere like Newspaper Research Journal, the kind of publication a tenure committee respects.

Since I've removed this from the standard blog-page layout,
please e-mail me any corrections or comments, or just
click the "comment" link at the bottom of the original blog item.