Digital Archives
From old newspapers to screencasts and from
disappearing footnotes to "read-write" scholarship
Aug. 7, 2006 (Updated Aug. 11) -- On the
closing day of last week's Association
for Education in Journalism and Mass Communication
convention, I was part of a panel discussion of digital archives -- one that combined "glass half empty" and "glass half full"
approaches. The official title was quite a mouthful itself: "The Ethereal
Electronic Archive: The Life and Death of Information in the Electronic
Age."
On the "half full"--or at least
"being filled"-- side, I expressed optimism about the National Digital Newspaper
Program, an effort to get a huge number of historic
newspapers scanned and made available on the Web at no charge through
the Library of Congress and state libraries. The "test bed" stage of
the project has involved a handful of universities and libraries,
working on scanning 100,000
digital page images from their states, covering just the
first decade of the twentieth century.
Alas, optimism was all I could
show -- the small San Francisco Marriott conference room we were in
didn't have Internet access or a digital projector. Otherwise I would
have shown these sites for more information, along with a few linked
farther below in this essay:
Everyone wasn't
optimistic: I heard in San Francisco that some weren't pleased with the way
the project was going. At the start of August the NEH
grants website still carried two-year-old application
information, with a note that new guidelines would be "available in summer 2006." [Note: As indicated above, that page was updated three days after the first draft of this essay. Previously, you had to find a separate part of the NEH site to get the November 2006 application
deadline.]
But NDNP is a
long-range project. The 20-year goal sounds good to me: to have a free
online Library of Congress collection of historically significant
newspapers published between 1836 and 1922, representing all the states
and U.S. territories.
Sidebar: I may have
missed it, but I haven't seen an explanation of the 1900-1910 decade as
a starting point. Maybe it's a liberal media conspiracy to document the
start of the "progressive
era." More likely it's just a "new century" thing, complete
with news about the Wright
brothers, the Panama Canal, the spread of the phonograph, the
Photostat
machine and the automobile... even the first Hollywood
movie, all good topics for communication researchers. Or
perhaps the choice of dates has something to do with the project's goal
of including papers from all the states (remember, there were only
46
of them until 1912). And, yes, I got some of that information
from Wikipedia... in preparation for the section below on being careful
quoting websites.
Most of the scanning will be from
papers already on microfilm at state libraries or universities. See the
U.S. Newspaper
Program, which encouraged microfilming. As an example, see
the University
of Tennessee's contribution to the
effort.
More than 1,000 pages of me... on pay-per-view
I'm also "glass is half full" about
commercial digitization projects, such as the ProQuest archives of
The
New York Times and other papers -- including my alma mater, The
Hartford Courant, which is now available from 1764 to 1979.
That means my students can search for the stories I wrote in the
1970s and find embarrasing bad examples. Oh joy. But they'll need
motivation -- the University of Tennessee Library buys access to the
Times ProQuest
archive (from 1851 on), but not to the Courant. I suspect these archives turn into
six-figure budget lines at university libraries.
For casual users, the Times and Courant websites provide access to the
archives for a few dollars per
article -- from papers that sold for a few cents per
issue. (You can get the cost down to 50 cents an article if you agree
to buy a "500 article pack" or a monthly subscription.) It still beats
scrolling through rolls of microfilm. Been there; done that.
I'm not ready to write a check for $1,000 to see what I
had to say back in the 1970s. On the other hand, maybe I can use those
pay-per-view dollar figures to point out to students the value of the digital
resources they are already paying for with their tuition -- and should take more advantage of. A century and a half of the Times is one amazing
supplement to a journalism textbook!
Another voice
on the "glass half full" side was Ray Abruzzi from Thomson Gale publishing, which is
compiling its own 19th Century U.S. Newspapers Digital
Archive -- a 1.5-million-page collection drawn
from 200 newspapers, being selected with the help of a panel of
journalism historians. The papers are being scanned and formatted so
that researchers will be able to search the full text, choose from information
categories ("editorials," "arts") and view full-page facsimiles or individual articles. Like the ProQuest archives of individual
newspapers, the Thomson Gale project will be marketed to libraries, not
to individuals.
(That reminds me: I forgot to ask
the panel and audience members whether any of them had used the digital
archives of The
New Yorker or National
Geographic magazines for media-history research projects,
something I've been curious about.)
Early Web Disinformation On the "glass
half-empty" side, I talked about one of my first Web projects, a paper
I wrote with a classmate (also named Bob), for Paul Jones' first
Information Science course at UNC Chapel Hill. Our half-term project
("by 'the Bobs'") was to research questions of "Information Quality and
Disinformation" on the World Wide Web -- in 1995 -- and make a Web page
to show the class.
The Web was still young. As
search engines crawled for keywords like "disinformation," I was
astonished when librarians and researchers all
over the world linked to our term paper as a resource. We
were even cited by someone at
Oxford (footnote 8), even if his paper cited the wrong person
as my co-author! A librarian from New England was such a fan that she
wrote to me several semesters later when she discovered that a student
at another university had copied the entire multi-part essay. (I wrote
to him about the plagiarism. He said his version -- which didn't change
a word of our paper, just put his name on it -- was "an experiment in
disinformation" for his own class. It even kept the phrase "Two people
named Bob might even disagree..." I think it's gone now.)
A further complication: After a while, our
professor took that particular class-project Web server offline -- it
was called http://blake.oit.unc.edu -- and all of those links went dead
(or "404" as the browsers put it). When librarians wrote to complain, I
posted a backup copy in my personal Web space at UNC. And when I
graduated -- expecting that site to go away, too -- I bought my own
domain, "http://stepno.com," and put all of my UNC work there,
including a copy of the original http://www.stepno.com/unc/disinfo.htm.
In fact, I'm hoping to do the same someday with all
of my http://www.stepno.com/oldblog/ blog pages, just in case Radio
Userland's server doesn't last for all eternity, although that will
mean any "outside" links to those pages will go away, or that enough
patient readers who have bookmarked articles keep a copy with a unique
phrase or two that Google (or some search engine of the future) will be
able to use to track down a particular page.
By now
there are a few dead links and missing graphics in that old
Disinformation file, but the main content is preserved, at least until
my money for the domain runs out. (Donations of perpetual-endowment
funds gleefully accepted.) Coincidentally --and adding to the
confusion-- when I resumed paying tuition to defend my dissertation,
UNC restored my second-generation version of the page, at
http://www.unc.edu/~rbstepno/disinfo.htm -- complete with out-of-date
e-mail contact information.
On the other hand, I'm
glad to see some
librarians have kept their bookmarks up to date... But is
that their problem? Is it mine? Is it my professor's? I check
"none of the above" on this one: Unfortunately, it's just the way of
the Web. Right?
Disaster for Scholars?
That may be, but for academic
scholars, disappearing source material is a disaster. If someone quoted
the original address of that page in a doctoral dissertation (mistaking
me for an authoritative source), how could the student's dissertation
committee or some other future researcher double-check the facts? (And,
in that case, point out that the student was just quoting some
speculation by a couple of other students, not some carefully
quantifiable research study.)
The missing-footnote
question provided a lot of the "glass is half-empty" side of our
digital archive panel discussion. Nora Paul, director of the Institute
for New Media Studies at the University of Minnesota, talked about
attempts to archive websites, such as http://archive.org and the Library of Congress's Minerva, as well as the problem of previously public records being taken out of circulation out of security concerns. She had quite a bit of other information and links about digital archives in Powerpoint slides that she
couldn't project, so I'll write to her and see if I can link them here.
(I'm kicking myself for not turning on my MP3 recorder and making a
podcast
of the whole panel discussion. I had the recorder in my bag; I simply forgot to
turn it on. Talk about disappearing data!)
The
academic "disappearing footnote" problem has been well covered in ongoing
research by University of Iowa professors Daniela Dimitrova
and Michael Bugeja, who were the last two panelists. Coincidentally or
ironically, tonight Google's top hit
for Professor Dimitrova is a "404" missing page, not unlike
the
topic of her research.
Frankly, I hadn't
thought of the problem that much until their presentation. Professor
Bugeja, director of the Iowa State journalism school, had a wake-up
call a few years ago when he was fact-checking the manuscript of his own most recent
book. He discovered that 30
percent of his Web citations were no longer accurate. Since
then he has been finding more and more evidence that recent research is
relying way too much on flimsy website URLs for substantiation. The
reasons are obvious. "Footnotes malfunction for many reasons," he says.
"Technicians reformat folders
and redesign sites or, especially worrisome, revise content at the same
online address."
"It's the way the Web works; it
shouldn't be trusted to be otherwise" was my first reaction. Links come
and go. Students graduate and universities delete their pages.
Companies merge, fail, and webmasters decide that a Flash home page is
cooler than the old one without learning how to make the contents
searchable.
But the more research turns to the Web
-- or is about the
Web, how can traditional scholarship keep functioning? For instance,
this year's AEJMC
convention had -- I didn't count them -- quite a few research
papers that were about blogs, podcasts, MySpace, Facebook,
Wikipedia and other recent Web phenomena. What did they do for
footnotes? What good will those footnotes be in five or ten years? Not
much. Are naive academics trusting them?
My modest
suggestion: Fight bits with bits. Any serious study using a website as
a source of material should include a copy of that cited material --
either as a printed image or digital PDF file, GIF, PNG or TIFF image
of the page in question -- as well as the original URL and the date it
was read by the researcher. For archival purposes, saving in more than
one format wouldn't be a bad idea.
I did something
like that 10 years ago with my dissertation
research, back when capturing a screen wasn't as easy as it
is now. One result: Images I grabbed, glued togehter with Photoshop,
reduced and put online as "thumbnails" now appear in another
professor's book. WRAL-TV, where I did the research, no longer had
pictures of its 1996 and 1997 page designs. The professor read about
them in my 1999 conference paper and asked the TV station for a copy of
the images. I was surprised when they sent him back to me. Result: He
gives
me credit for the images in his book and on his own website;
unfortunately (I've just noticed), he misidentifies me as a former
employee of the TV station. I'll send him e-mail about
that.
If copyright considerations do not allow
publication of a researcher's source material, I suppose the source
copies could be kept in a "for scholars only" archive at the author's
home (or degree-granting) university, which is how I remember
confidential anthropology and ethnomusicology
research tapes being handled during my
first academic incarnation.
Storycasts and Memex Model
And there's
new hope -- especially for research about the Web. To see how completely a site
can change over time -- part of the problem we're talking about -- and how to
save a movie of online material, I think the best example is Jon
Udell's "screencast" demonstration using the Wikipedia "Heavy
Metal Umlaut" page. That particular screencast is a Flash video file,
with narration, capturing a series of actions or pages on a computer
screen. (Here's Jon's "The
Making of the Movie" article.) Similar "screen movie"
recordings are becoming wildly popular with computer gamers, recording
of their best attempts to kill, nuke or build things in cyberspace.
(Software I haven't tried: Camtasia,
SnapZ,
Replay
Screencast, Fraps, Wink, more,
more
and more.)
Coincidentally, some
of the software that makes such captures possible also might make it
possible to copy movies whose creators don't want them copied.
Intellectual and social issues of "read-write"
culture versus "read-only" culture were a theme of the AEJMC
conference's keynote address by Lawrence Lessig. Among other things, it
was probably the first time most journalism educators had seen AMVs -- Anime
Music Videos made by combining "anime" cartoons from one source with
music by another. So how do you do
read-only scholarship in a read-write culture? I compared this idea to
Vannevar
Bush's "Memex" concept from 1945 -- a storage medium that
would contain all of a scholar's work and a republishable "trail"
through all of the source material. Articles about Memex don't always
mention the "republish" part, in the last paragraph
of part seven of the article. After a Memex user has joined
together bits of various books, his own notes, and more books, he
learns that a friend is interested in a related
topic:
So he sets a
reproducer in
action, photographs the whole trail out, and passes it to his friend
for
insertion in his own memex, there to be linked into the more general
trail.
Of
course if trails of captured screens or screencasts become part of
dissertations, they will need preserving, too, whether as an
open-standard file (TIFF, MP3) or some proprietary format like PDF or
Flash. Does that mean including a DVD of captured images with every
dissertation? Or another acre of petabyte Web servers at
http://archive.org? I don't know. It's a big glass; maybe a swimming
pool. It's half full... or half empty... or
something.
Meanwhile, if you're thinking of quoting
or citing this page in something more serious than a weblog, be warned
that it's just a blog entry, a first draft, and a fairly casual "work
in progress." Even the page title has been fine-tuned and broken into two
lines since the first posting. Within four days, things the first draft said were no longer true, and
I discovered the Aug. 11 changes just by luck. I can't promise to keep
everything that up to date, but I may edit it, add to it, change my mind
about something, or move it to a new address at http://stepno.com.
Google should still find it for you, if you include enough key words
along with my rather rare last name. If enough people think my train of thought is worthy
of adding to the formal world of scholarship, I'll ask a committee of
peer-reviewers to look at a spruced-up version, make a last batch of
changes, then declare it finished and put it somewhere like Newspaper Research
Journal, the kind of publication a tenure committee
respects.
Since I've removed this from the standard blog-page layout, please e-mail me any corrections or comments, or just click the "comment" link at the bottom of the original blog item.
|
|
© Copyright
2009
Bob Stepno.
Last update:
7/27/09; 3:57:42 AM. |
|
|