menu 1
menu 2
menu 3
menu 4

dg.o Web

dg.o > news > stories >

Hydra-headed Metadata

After Sept. 11, authorities said information-stove-piping by intelligence agencies was one of the biggest stumbling blocks in the fight against terrorism. Now, two leading researchers discuss different approaches to merging government files, and cracking open their secrets.

Jamie Callan
Language Technologies Institute, Carnegie Mellon University
W. Bruce Croft
Computer Science Department, University of Massachusetts, Amherst
Eduard Hovy
Digital Government Research Center
Information Sciences Institute, University of Southern California


The terrorist events of September 11 reminded everyone of the need for accurate and timely government intelligence. Some of the information that might prevent disasters is secret, and therefore inaccessible. But in many cases, the information is present somewhere, and freely available. The problem is getting hold of it in a usable form.

Unfortunately, while present-day government in almost all its branches has collected, analyzed and stored information, most of it non-uniform. Information is all over the place, in hundreds of different formats and systems and versions. You don't know where to find it, how to access it, or how to convert it to a format you can work with once you actually have it.

One of the principal problems facing those trying to standardize non-homogeneous data sets is variation in terminology. For example, what one agency calls salary, another might call income, and a third calls wages, while using salary to mean something else entirely. Example: one agency might calculate monthly average prices of unleaded gasoline in California by measuring wholesale rates each month, while another measures prices at selected pumps weekly and averages them. The results will differ, but both will be called "average monthly gasoline prices in California".

Clearly, this state of affairs causes confusion for not only Government workers, but also for journalists, congressional staffers, students, the general public, and intelligence officers. All would benefit from government information systems that locate, retrieve, and integrate desired information quickly, handling transparently the details of which databases contain the information or in what format it is presented. No system should expect its patrons to trust its results unquestioningly, so these information systems should also make it easy to examine the relationships among documents and/or databases with similar content if desired.

The basis for any new system is metadata, that is data that describes data or collections of data. The Dewey Decimal system, the Library of Congress Subject Headings, Medical Subject Headings (MESH), and many other controlled vocabularies (sometimes called ontologies) are all familiar forms of metadata. Each document is catalogued by a small number of terms from the controlled vocabulary, as is each information request, and matching them is very simple.

But practical experience has shown that integrating vast and disparate term sets and data definitions to create new forms of metadata is fraught with difficulty. The U.S. Government has funded several metadata initiatives, including the Government Information Locator Service (GILS) and the Advanced Search Facility (ASF) (, These projects perform exemplary work in establishing a structure of cooperation and standards between agencies, including structural information (formats, encodings, links). However, they do not focus on the actual creation of metadata, nor do they define the algorithms needed to generate it.

Experience with traditional forms of metadata has shown that it is expensive and time-consuming to produce, that people (e.g., authors) often resist creating it when there is no immediate or direct benefit, and that information-seekers often find it difficult to relate their requests to pre-specified ontologies or controlled vocabularies. Generating a common ontology for a domain also tends to be controversial. New standards for communicating metadata, such as XML, do nothing to address the underlying issue of where it originates. Controlled vocabularies and relatively static ontologies are not solid foundations for information systems that must cover a wide range of subjects, support rapid integration of new information, be easy for the general population to use, and can only be maintained at moderate expense. Large-scale use of metadata requires new answers to fundamental questions.

Recently, the Digital Government program of the National Science Foundation has funded a number of projects to address the challenge of integrating large, heterogeneous, widely distributed and disparate Government data collections. In this paper, we describe two complementary approaches: large ontology-based data access planning using small domain models semi-automatically acquired, and dynamic metadata creation from language models

More (pdf)