TIP: Click on subject to list as thread! ANSI
echo: osdebate
to: All
from: mike
date: 2007-03-29 18:13:52
subject: Linux to help the Library of Congress save American history

From: mike 


http://www.linux.com/article.pl?sid=07/03/26/1157212

===
The Library of Congress, where thousands of rare public domain documents
relating to America's history are stored and slowly decaying, is about to
begin an ambitious project to digitize these fragile documents using
Linux-based systems and publish the results online in multiple formats.


Thanks to a $2 million grant from the Sloan Foundation, "Digitizing
American Imprints at the Library of Congress" will begin the task of
digitizing these rare materials -- including Civil War and genealogical
documents, technical and artistic works concerning photography, scores of
books, and the 850 titles written, printed, edited, or published by
Benjamin Franklin. According to Brewster Kahle of the Internet Archive,
which developed the digitizing technology, open source software will play
an "absolutely critical" role in getting the job done.

The main component is Scribe, a combination of hardware and free software.
"Scribe is a book-scanning system that takes high-quality images of
books and then does a set of manipulations, gets them in optical character
recognition and compressed, so you can get beautiful, printable versions of
the book that are also searchable," says Kahle.

While previous versions were written for both Linux and Windows, the
Internet Archive has migrated Scribe entirely to Linux, and Windows support
has been dropped. Kahle says the project uses Ubuntu now.

When asked why the Library of Congress chose Scribe for this project, Dr.
Jeremy E. A. Adamson, the library's director for collections and services,
replies that the Internet Archive has already demonstrated "the
efficient production of high-quality images" with it.

Kahle says that a Linux-based Scribe workstation at the Library of Congress
will hold the material to be scanned in a V-shaped cradle -- it doesn't
crack books all the way open -- while two cameras take images of it. A
human operator performs quality assurance, then Scribe sends the digital
images across the breadth of the country to the Internet Archive in San
Francisco, where it is processed and eventually posted online in various
formats. Free software is used almost every step of the way.

"[It's a] Linux-based station out there in the field. It rsyncs the
files up to the servers, [and then] it goes and does the processing on a
Linux cluster of over 1,000 machines, and then posts it online -- also on
Linux machines," Kahle says.

Image processing for an average book takes about 10 hours on the cluster,
and while the project still uses proprietary optical character recognition
(OCR) software, Kahle says that many open source applications come into
play, including the netpbm utilities and ImageMagick, and the software
performs "a lot of image manipulation, cropping, deskewing, correcting
color to normalize it -- [it] does compression, optical character
recognition, and packaging into a searchable, downloadable PDF; searchable,
downloadable DjVu files; and an on-screen representation we call the Flip
Book."

The Flip Book is used at The Open Library, a charmingly retro Web interface
for online books that mimics old technologies (clicking "Details"
for a title brings up a yellowed card catalog entry), which the Internet
Archive says was "inspired by a British Library kiosk."

The books are stored in the PetaBox, which is the Internet Archive's
massive million-gigabyte storage system -- a system that Kahle says is
"all built on open source software."

Caring for brittle books

A good number of the historic materials in question are old, fragile, and
in such rough shape that placing them in Scribe's cradle, or even
attempting to read them, could irreparably damage them. Adamson says that
some of the books, for example, have pages "that have become brittle
with age"; while Adamson says these materials are in a broad range of
conditions that limit their physical handling, he uses the general term
"brittle books" to describe it. No list of such brittle materials
at the Library of Congress has been made, but Adamson says that "they
comprise a percentage of virtually every collection." Adamson says the
project's objectives include the development of a more formal
classification and description of these "brittle" materials, and
to "establish digitization workflows based on that classification of
condition."

If scanning the brittle materials demands new software and digitization
techniques, the Library of Congress will work in conjunction with the
Internet Archive to make the innovations available to the public. But
there's no way to know at this point what they may be, because the project
is only getting underway.

"The project proposal calls for months of planning before any scanning
or engineering is to begin," Adamson says. And the planning, he says,
is "significant": "Space needs to be prepared to accommodate
the physical scanning of books, server storage allocated, project plans
need to be written, project team members briefed, along with myriad other
details required for a project of this magnitude and complexity."

Eventually, Adamson says, when the scanning and processing of materials has
been completed, the high-quality digitized versions of these historic
documents (and metadata associated with them, such as indices and contents)
will be freely accessible online -- which Kahle says is a "huge
step" in broadening the reach of the ever-too-small public domain.

"There may be public domain books that are sitting on shelves, but if
you can't get access to [something], what good does it do to be in the
public domain?" says Kahle. "The Library of Congress is dedicated
to keeping [these digitized holdings] public domain, which I think is a
great step that's not being followed by everybody else."

The program is part of larger efforts, both at the Library of Congress, to
preserve old media and records, and at the Internet Archive, which is
already scanning public domain materials with its Open Content Alliance, a
consortium of about 40 libraries. Kahle says that the alliance is presently
operating in five cities, using the Scribe software, at a brisk clip of
12,000 books a month.

"We're part of the 'open world' through and through -- we use open
source software, we generate open source software, we generate open
content," says Kahle. "We're trying to take this open source idea
to the next level, which is open content and open access to cultural
materials, which means 'publicly downloadable in bulk.' I think we're
really seeing the next level up of this whole movement -- we had the open
network, then open source software, now we're starting to see open source
content."


Links

"Library of Congress" - http://loc.gov/ "Sloan
Foundation" - http://www.sloan.org/ "previous versions" -
http://sourceforge.net/projects/scribesw/ "Ubuntu" -
http://ubuntu.com/
"Internet Archive" - http://archive.org/ "netpbm
utilities" - http://netpbm.sourceforge.net/ "ImageMagick" -
http://applications.linux.com/article.pl?sid=05/03/29/1525217&tid=39
"The Open Library" - http://www.openlibrary.org/
"PetaBox" - http://www.archive.org/web/petabox.php "preserve
old media and records" - http://www.digitalpreservation.gov/
"Open Content Alliance" - http://www.opencontentalliance.org/

===

   /m

--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
SEEN-BY: 633/267
@PATH: 379/45 1 633/267

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.