NEW IPO Logo - by Charles Larry Home Search Browse About IPO Staff Links

Electronic Newspaper Archiving

Karl Bridges

Introduction

During the past year Booth Library, under the auspices of a Library Services and Construction Act Title III sent from the Illinois State Library, has developed a prototype system for the digitization of newspaper microfilm. This has been an exciting year as Booth Library, in cooperation with the Journalism Department, EIU Student Publications, and EIU Computer Services, has developed new technology for the delivery of newspaper resources. In this report, I will discuss what we have accomplished, the technology and methodology developed and the implications of this project for future research in this area.

Recently, there has been increased interest in developing electronic newspapers. With more than 60 million personal computers in American homes, with that number climbing everyday, newspaper publishers are looking for methods to get their traditionally paper-based products into an electronic format. These efforts have occurred in many places and in different ways with a result, overall, that has not met with commercial acceptance—either by the industry or consumers.

The main reason for this lack of success is that electronic newspapers thus far have not looked like newspapers. Based on the form of full text wire services, they do not have the look and feel that newspapers readers have been used to for more than 200 years. In addition, electronic newspapers have not been economically viable because their production has required the addition of staff and resources to, essentially, re-edit the newspaper into an electronic format.

At Eastern Illinois University we have engaged in a pilot project to develop a fully formatted electronic newspaper. Using portable document format (PDF) technology integrated with the World Wide Web (WWW), we can offer our university newspaper, the Daily Eastern News, in a fully digital format that preserves the "look and feel" of traditional newspapers with expanded advertising possibilities. An additional benefit of our techniques is that we also can obviate the need for microfilm copies and replace our microfilm newspaper backfile with a searchable digital version.

Converging Needs of Newspapers and Libraries

This project is the result of the converging needs of the student publications department at Eastern and the library. Student publications maintained a traditional clip file "morgue" of newspaper stores. For its part, the library maintained a card index file of newspaper stories. While these functioned well, there were definite problems in terms of access and maintaining both systems—primarily in terms of both the time required for upkeep and the difficulty for users to access newspaper information.

The central problem with newspaper indexing, regardless of whether you use pre- or post-coordinated indexing, is that the index developer rather than the user decides what's important. An essential aspect of newspaper databases is that the use can search what's important to them. This user-centered philosophy was at the core of our thinking in this project.

A variety of options were considered—such as maintaining the existing system or developing a database of the existing card index file. However, after careful thought it was decided that we had to develop a fully digital system that met a variety of criteria including:

• being fully searchable

• available over our campus network and the Internet

• capacity to be fully automated

• capacity to include a digitized version of existing microfilm

Initially, prior to getting this grant we settled on using Gopher—primarily because it met at least some of our criteria, while also fitting in with the financial aspects of the project (e.g. there was no funding for either equipment or personnel). This worked well, at least to the extent that we were able to offer the full text of newspaper via Gopher. However, it also had severe limitations, especially in terms of the fact that the material didn't look like a newspaper.

*Karl Bridges, Assistant Professor, Booth Library, Eastern Illinois University, Charleston. A similar article will appear in Campus Wide Information Systems, Vol. 12., No. 4.

183


Methodology

Technology Overview

One of the first decisions we had to make was what kind of computer platform to utilize. There were three primary choices: Macintosh, UNIX and Windows. We chose the Macintosh environment for several reasons:

It is the accepted platform used in the newspaper industry. Because the student publications department uses Macintosh equipment, it was easy to integrate our project with their operations.

Using MacHTTP, Macintosh represented the easiest and fastest way to develop and maintain a functional World Wide Web (WWW) platform. While, arguably, a WWW operated on a UNIX work-station has some advantages, they are outweighed by the high initial cost of equipment and ongoing support costs. In addition, the MacHTTP WWW environment supports a variety of add on software that gives greatly enhanced functionality by giving features such as user logging statistics.

The Macintosh environment also offered us a fairly easy way to automate many of the tasks involved because Apple Computer's, Applescript allows us to integrate the various pieces of hardware and software used in the project.

Hardware Specifics

The central piece of hardware purchased was the Macintosh 8150 server. The robust server, running at 80MHZ with 48MB of RAM, could handle the load of potentially hundreds of daily users accessing the WWW. In addition, we also purchased two 4 gigabyte harddrives for storage of the digitized newspaper pages.

Redundancy is an important aspect of the system. We are running each 4 gigabyte drive as a RAID—Redundant Array of Independent Disks. Essentially, we have subdivided the drives into 2 gigabyte drives that mirror each other. In the event of drive failure, the data is duplicated on the mirror and no data is lost, thus increasing the system's stability. The software used is the Apple RAID Software that came with the server. In addition, we have a tape backup drive that daily and automatically copies the drive's contents to an 8mm tape as a third safeguard.

Besides the server we have two Macintosh 8100 workstations-one placed in the library and one currently in student publications. Although originally thought all processing would be done in the library, much of the file processing was done in the student publications lab primarily because of transferring files over the university network. We were very concerned about not overtaxing the university network.

We also purchased two scanners: one Avec Colour 2400 and one Nikon 3510 slide scanner. The flatbed scanner is used for scanning index cards of the existing Daily Eastern News Index, while the Nikon Scanner is used for microfilm. In addition, each scanner has input devices for pictures and other materials being used in World Wide Web pages being developed for the project and in the library as a whole.

Portable Document Format—Overview

Our technology is based on the recently developed technology from Adobe Systems-portable document formatting. This format, made possible by Adobe's Acrobat Distiller, Exchange, and Reader software, allows us to take the newspaper page (produced in QuarkXpress) and translate into a compressed PDF format that is readable, using free Reader software, on any kind of computer platform.

The Readers give the user a variety of tools, including the ability to enlarge and shrink the document, copy text and graphics, print and do searches within individual documents. You can think of the Reader as being in the same class of helper applications associated with graphical WWW browsers—such as Sound-Machine (sounds) or GIF Viewer (GIF format pictures). The Acrobat Reader is fully compatible with most commonly used WWW graphical nettools, including Mosaic and Netscape. Once properly configured, Netscape or Mosaic will automatically launch Acrobat. Acrobat documents also are accessible useing Gopher—although, given the hardware requirement of Acrobat (A 386 or better with color monitor or the Macintosh equivalent) you might want to consider just using Netscape. Gopher is now a subset of the WWW.

File size is a major concern, especially to those accessing the Internet over a modem—the dirt road to the information river as I like to think of it. Acrobat offers great file compression of between 10 and 20 time, depending on the content of the pages. We're typically able to turn a 1 megabyte QuarkXpress file into a 50K Acrobat documents. Of course, if you make heavy use of color expect file size to increase-probably to around 200K. An area for future investigation is the concept of "streaming" the file. It is possible, for example, to load a large sound file in "pieces"—as part of the sound is played as other parts are downloaded. It should be possible to do the same thing with large Acrobat files.

184


Electronic Archiving Techniques

The process involves taking the newspaper page (or other document), printing it as a post-script file, and then having the Acrobat Distiller software make it into a PDF file. This system is largely automated since the Acrobat Distiller software has a feature that allows a folder to be denned as "watched." Every 10 seconds the Acrobat software checks that folder for new files, and if they are there it automatically processes them into PDF files that are deposited in an "out" folder from which they can be sent elsewhere—in our case to our World Wide Web server. Because we are a Macintosh environment (at least in terms of our WWW server and our newspaper graphic department), we can further automate the process through use of Apple-scripts, which allows us to write a file that will automatically move files across the network at a set time of day. I should stress that although we make use of Macintosh equipment, that Acrobat software is available for other commonly used platforms, such as Windows, DOS and Unix. A newspaper that did its pagination in a Windows environment would be equally able to take advantage of this technology. And, of course, the equipment used for producing the original pages and the PDF files is completely irrelevant for the end-user.

After producing the PDF files we move them to our World Wide Web server in the library. In our case, we chose a Macintosh 8150 server. This choice was made primarily because of cost and by a desire to avoid setting up a UNIX box in the library, because we had neither the time or expertise to accomplish that task. The choice of a Macintosh based WWW server is debatable. As discussed above, there are some definite limitations as opposed to UNIX-especially in the ability to automate and program the server as well as in the capacity of the server for a large number of users. The WWW server, http://www.booth.eiu.edu, has been a major development for the library and has greatly increased our ability to effectively serve our user population. The server and the MacHTTP software are robust and do a good job, both in terms of utility for users and costs (primarily staff time) involved in keeping up the server.

The main cost for us is the time involved in maintaining the World Wide Web server. This involves updating the HTML pages relating to the newspaper pages on a daily basis, which takes an hour or two daily. At present, we also have additional costs involved with processing our two-year backfile of newspaper pages. We expect, however, to have those pages done and on the WWW by June 1996. The biggest problem we've encountered, aside from the issue of searching, is that the success and visibility of this project has created such a demand for WWW services, digitization and the production of PDF pages across campus that I find it difficult at times to focus on the newspaper project itself!

University Networking-Integrating Tokenring and Ethernet Networks

A major issue in this project was the integration of two different networking schemes in our university environment. Student Publications uses ethenet while the university as a whole uses the tokenring topology. We needed an effective bridge between the two in order to successfully move the newspaper pages to the library. This also was important because we planned to provide dial-up access for our off campus partners through separate phone lines located in Student Publications. For several reasons, mainly the saturation of the existing university phone lines, is was not possible to have dial-up access through the university's existing modem pool.

The solution we decided on was the Internet Router. This software, running on a low end-machine (a Machintosh IICi), allowed us to effectively bridge the two networks. This software acts as a "translator" that passes the ethemet signals to the tokenring-connected making the ethernet-connected computers visible to the tokenring-connected computers on the campus. This seems to work fairly well, although there was a long start-up period because we had to diagnose and fix (with the cooperation of Computer Services) several bad interactions with various campus computers. Several devices, mainly printers, had to be powered down and back up in order to be properly visible to the router. This process also had the effect of dividing the campus network into zones; previously the university network had been completely contiguous.

This solution, while connecting Student Publications and Booth Library, did not solve the connectivity problems for our off-campus partners. Cowden-Herrick High School became connected to the Internet through an ISBE initiative. We connected them through a 14.4KB dial-up connection. Charleston High School was already connected to the Internet through EIU. The main problem area was the Charleston Public Library. They have dial-up capability, but they have no one to call. For various policy reasons the university will not provide dial-up access. At the same time, the local phone company, although rapidly becoming involved as an Internet provider, has offered the service at a price ($700+ per month) that it is not affordable. This is a textbook example of why we need access for public institutions to be provided for in the development of the National Information Infrastruc-

185


ture. We are currently researching other options, including third party Internet service providers, like Prodigy, and making cooperative arrangements with other institutions, like the Charleston public school system, which are making efforts themselves to become Internet connected. Since a large percentage of public school students are public library users this seems a logical development.

Searching PDF Documents—Area for Improvement

At present the major drawback for this project is the lack of an effective search engine. Although we can produce the PDF documents with relative ease, we lack a search engine capable of effectively searching the PDF documents. Adobe does offer a search engine that allows searching of PDF documents. However, it is strictly Windows-based and does not offer the ability to integrate fully with the World Wide Web. The commercial search engine that meets all our requirements would allow searching of PDF documents and other file formats, as well as allow the development of indexes from remote databases. This means that each department in the university could develop a database of documents that could be indexed from a central location.

We have identified three commercially available search engines that meet our requirements—some partially and some completely. At this point, we have decided to work with a demo version of the PL Web software. Fundamentally, it was the only search engine we could find that was affordable. Although it runs in a UNIX environment, it has a variety of features (including the ability to index documents on remote computers) that we found attractive. This decision partly fueled by the interest of the university in PLWeb as a possible solution for storage and indexing of various university documents.

Searching Solutions

Applesearch

The solution we initially implemented was Apple Computer's Applesearch. In conjuction with the AppleWebsearch application (available free on the Internet), this allows full text searching that can be fully integrated with the WWW and graphical browsers such as Netscape. The problem is that it does not support searching of PDF documents and, as far as I have been able to determine, Apple has no interest in developing this, although it has been rumored that third parties are developing the appropriate software.

Verity

Verity Corporation's Topic Websearcher (http:// www.verity.com) was also considered. Aside from the price ($7,000 approximately) it seemed to meet our major requirement—the ability to search PDF documents. It seemed to have some fairly sophisticated search and retrieval tools. Verity's software seemed to have excellent ability to allow searching and retrieval of information across several databases and the Internet with standard searching features, including proximity and Boolean. A drawback was that the product was separate from the Web server software. Integrating these two elements was an added support cost that we didn't wish to incur.

PLWeb

PLWeb, manufactured by Personal Library Software of Rockville, Maryland, was the software we decided to test. This has the appearance, based on the testing we have performed over the Internet (http://www.pls.com), to be the integrated solution that we have been seeking.

The company's documentation indicates that this UNIX-based product (Hewlett-Packard HP-UX, Sun Solaris) supports a variety of document formats, including ASCII, HTML and PDF. There also is a user configurable search interface and a custom thesaurus. There is support for Natural Language Searching as well as Relevance Ranking, Concept Searching, and Boolean and Wild Card searching. The server element seems robust and has a variety of security and operational features that would make it a good choice for a library environment. My experiences in dealing with the company were positive. They seemed friendly and knowledgeable and, alone among vendors I dealt with, are willing and able to offer a 30-day free trial that can be downloaded over the Internet. The price, approximately $4,900, also was reasonable given the features.

Scanning Newspaper Microfilm

The digitization of newspaper microfilm is another important element of this project. As most librarians are aware, microfilm is not the ideal medium for meeting user needs. It is difficult to work with, expensive to manufacture and maintain (especially in terms of equipment), and it is geographically specific. By digitizing newspaper microfilm using the PDF format it is possible to overcome some of these problems. Using a high-end digital scanner (a Nikon AF-3510) originally designed for professional photographic work, we are able to get sufficient resolution to successfully digitize newspaper microfilm. Converted into a PDF format

186


this "digital microfilm" has a quality comparable to that obtained with conventional microfilm viewing techniques with the additional advantages that were outlined earlier vis-a-vis newspapers.

From a production standpoint there are some major concerns. Microfilm is difficult to work with. We are having to develop a mechanical holder for the film because our scanner does not come with a roll film attachment. An exhaustive search among most major camera and microfilm manufacturers found no one who manufactures such a roll film carrier suitable for our scanner. There also are problems with the film itself. When digitizing film we found a variety of scratches, dust and other damage to the film, primarily related to wear and tear from the conventional microfilm readers. With normal usage this damage would probably go unnoticed, but it becomes quickly obvious when digitizing the film at high resolution. This was surprising and, perhaps, would bear closer examination at a later date. We also found that the standard process of printing microfilm on blue film rather than clear meant extra steps had to be taken to "wash" this color from the digitized microfilm image. These are the kinds of problems expected at the projects outset and can be successfully dealt with, using standard photographic techniques and tools associated with Adobe Photoshop. One solution is to work from new copies of the microfilm. We can get replacement copies from the master microfilm at the State Library for about $22 a roll.

Providing User Access—The Netscape Solution

This project was designed from the ground up to be accessible by Internet users. Early on we focused our attention on the Netscape Client as the primary means we would recommend for our users-in particular our users gaining access via dial-up. There were a variety of reasons for this approach, including:

• Netscape handles images faster than any other client available

• Netscape is free to educational users

• Netscape provides access to HTML Level 3 commands.

Although not widely in use now, HTML Level 3 is expected to be the new standard for HTML documents because it is more functional. Another important aspect was that Netscape announced that support for PDF would be built into its browser. This will eliminate the need for an Acrobat Reader helper application, thus simplifying installation and lowering support costs.

As previously noted, we equipped three partner organizations as part of this grant-Cowden-Herrick High School, Charleston High School and the Charleston Public Library. Each institution received a Quardra 630 computer with keyboard and monitor, a 14.4 KB FAX capable modem and a Hewlett Packard Deskwriter C color printer. Installations were done in the winter and spring of 1994-1995. The equipment was received with great joy and eagerness by all parties. In the interest of promoting maximum interest and encouraging learning, the participants were encouraged to have as many people as possible use the equipment, especially the Internet capabilities. It also was stressed that if they find other uses in their organizations for the computer they should do so.

In the case of the two high schools, the addition of color printing capability was seen as a plus that was taken advantage of.

Areas for Future Development

The most important aspect of this project, from the standpoint of journalism researchers and the newspaper industry is that the newspaper is delivered.

One important aspect of this technology, PDF, is that it allows us to make interactive internet links from within documents. Using Acrobat Exchange we can define active areas within documents, "hotspot" that the user can click on, and link to Internet URLs. In short, we can use Acrobat to create hypertext documents.

There are some clear advantages to this. From a newspaper standpoint it allows us to make using the electronic newspaper much more like reading a regular newspaper. When you want to change pages you simply click on a link and move backward and forward between documents. It also offers the potential to link to other kinds of files. For example, advertisers could define their newspaper ads as active links. By clicking on an ad a user could link to an HTML form that could be used to order a product. This form could then be sent via e-mail or fax to the merchant. A more mundane library use might be to make links to other kinds of files or WWW pages. One idea we've been toying with is an interactive university yearbook with Internet links to graphics, sounds and full-motion video. We continue to research both these concepts.

This technology allows more effective use of existing newspaper resources. For example, we take all our newspaper photographs in color. The cost of color printing, however, is too expensive to allow the newspaper to print in color very much. With this technology, we potentially have the capability to offer the newspaper in color over the Internet on a daily basis, while continuing to print the paper newspaper version in black and white. We are currently researching

187


methodologies to be able to do this. Additionally, this technology allows the newspaper to print all the information it gathers rather than just selected information. Although newspaper reporters now take copious notes, only a small amount of that information ever sees print due to newspaper space limitations. By using the WWW it would be possible to publish all the information—clearly this represents a more cost effective use of the newspaper's expenditure for the reporter's time and effort in information gathering.

The project is only in its initial stages. However, it is clear that it represents an exciting and clearly different methodology to putting newspapers on the Internet. By being able to offer the newspaper in a format that looks like a newspaper, while also offering a cost effective methodology for doing so, we are developing an alternative for electronic publishing what we feel in time will gain acceptance and usage.

Conclusion

While this project didn't meet all its initial goals, in particular, our failure to connect the public library, I feel that this project, in the aggregate, was successful. From a technical standpoint we have developed and integrated a number of elements into a functional system.

The technology, however, was the least important aspect of the project. More than anything else, the project has exposed a large number of people to the possibilities inherent in the Internet. As more people became exposed to what we were doing, we could see the excitement level building as they say how they could adapt what we were doing to their own purposes. The idea of putting fully formatted documents on the Internet is a viable one. During the course of the year we received several requests from members of the university community to put their documents, flyers and newsletters on the Internet. Such a response is heartening and shows that people are beginning to understand the potential of such projects.

We also built infrastructure, both within the university and between the university and the community. The contacts and interaction between the various individuals, both on and off campus, was a valuable exercise that will be very beneficial in the years to come, especially as we further explore the potential of networks and distance education.

Eastern Illinois University's electronic newspaper archive can be searched at http://www.booth.eiu.edu/HTML/archive-html

188


|Home| |Search| |Back to Periodicals Available| |Table of Contents| |Back to Illinois Libraries 1995|
Illinois Periodicals Online (IPO) is a digital imaging project at the Northern Illinois University Libraries funded by the Illinois State Library