DG Project Studies Ways of Unifying Heterogeneous Databases
You need to determine the cost of housing and the quality of the schools, and you want to know where your spouse can find a job, whether traffic is likely to be congested, how hot the weather can get - and the answers to myriad other questions.
Do you know the correct language for querying the databases that contain the information? Better still, wouldn't it be wonderful if you didn't have to search dozens of Web sites for the answers?
DG researchers are currently working on a solution in creating "GovStat" a Web-based way to help citizens take fuller advantage of federal, state and local governments' rich statistical resources. GovStat is conceived as the first working step towards a full-fledged "Statistical Knowledge Network" (SKN). The SKN would bring together all of the information in hundreds of government databases in a form that could be easily understood and accessed by any citizen. At best, the SKN could be decades away from implementation. But many practical pieces have the potential to come into being within the next few years as part of GovStat.
Funded by the NSF, GovStat is being developed by teams at the University of North Carolina Interaction Design Lab and the University of Maryland Human-Computer Interaction Lab. Federal partners include the Bureau of Labor Statistics, National Center for Health Statistics, Energy Information Administration, Social Security Administration, the National Agricultural Statistical Service and the Census Bureau.
PI on the project is Gary Marchionini, of the University of North Carolina at Chapel Hill. Marchionini, who is the Cary C. Boshamer Professor in the School of Information and Library Science, knows the extent of the challenge first-hand. At the University of Maryland, he worked on a similar project for Montgomery County: "It took us a couple of years to gather all this information for just one county," he said.
Marchionini says that FedStats - the current government Web site that allows people to get Federal statistical information from different agencies - can be seen as "A good example of the beginnings of an integrated view.
"It is still more of a portal, a pointer to information," he says.
GovStat, however, was conceived to be a more fully-integrated service that pulls information itself together for the user, rather than just generating a simple index of documents where the information can be found.
What makes a perfect SKN such a challenge is the "back end" - the underlying code that the public never sees. Writing the software that would make thousands of heterogeneous databases interoperate smoothly is a head-spinning task. Not all databases were written in the same programming language, and even those that were might use different instructions to perform the same tasks. Queries and fields may be set up differently - even something as simple as first name, last name instead of last name, first name can take a major effort to coordinate across dozens of databases.
One long-term approach is to create alternative information architectures, using techniques such as statistical clustering and principal component analysis, says Marchionini:
"We're trying to automatically generate 'bins' to throw tens of thousands of Web pages in, and find ways to label these bins. BLS has over 30,000 Web sites that are kind of organic - they evolved over time. They've worked very hard, and we've worked with them, to make it all more user orientated. We're trying to come up with automatic ways to do this, not only for individual agencies, but [also for] the million Web pages that deal with federal statistics alone."
But the greatest challenge is likely to be nomenclature: How do different agencies refer to common concepts? That is where Marchionini's group is concentrating their efforts: "Take a concept like 'income' - if you're the Department of Labor, 'income' is earned income, what you get from your job. It's not alimony, it's not investment income, it's not rent, not all these other things. But at the Department of Commerce, they actually mean something different."
"Gary's tack is how can we bridge across between the everyday world and these heritage-bound data repositories, where the last thing you want to do is change the way you compute the CPI," says John Bosley of The Bureau of Labor Statistics, which has already worked with the lab for several years. "A lot of what he's trying to do is work around some facts of life—economics, staffing levels, legacy databases, inherent conservatism, and even the fact that agencies don't want to look like each other too much."
One tool – a Statistical Glossary, is at the prototype stage and about to undergo user testing with students and members of the general public. It has four levels or types of definitions, explains Marchionini, using the Consumer Price Index as an example: "The standard 6th-grade-level general definition, the technical definition, a graphical explanation and then an animated explanation. You actually see a visualization, like a market basket with all these things going into it, and then out plops the CPI."
Says Bosley, "Our main role is to keep their research grounded in the practicalities of providing data to everybody in the country, from people who have PhDs in statistics and economics to my eleven-year-old grandson using them to write school reports. That's one of the reasons the glossary project is so important to us; we realize we are still jargon-ridden. Nobody outside of compensation analysts uses the word 'compensation' to talk about pay."
They are also experimenting with whether people respond better to seeing a general definition first, or whether all the definitions should be context-sensitive. For instance would it be preferable to set up a glossary listing several agency-specific definitions for the word "income" or simply the common dictionary definitions? The researchers are also developing a statistical ontology, which would establish the conceptual relationships between terms.
Bosley and UNC professor Stephanie Haas are working together to develop a statistical literacy test, as a way to discover what are the core concepts people need to understand statistics. "[It's about] what are the minimum number of statistical concepts that people really need to keep themselves out of trouble. I can't compare crime statistics to labor statistics straight out of the box because they're based on different samples, so people need to know the concept of sampling."
The lab's innovations in interface design also come into play, in what Marchionini calls the "Relation Browser."
"It allows you to look at columns of entities," he explains. "One column might be topics, another might be geographic regions, another dates within topics. I roll my mouse over a topic, for example - 'Energy,' [with sub-categories]: 'petroleum,' 'nuclear,' 'electricity,' 'coal,' 'solar.' When coal gets highlighted, I see over in 'date' and 'region' how much data and how many Web sites or how many tables or how many units I have for a given date category or geographic category. If I click on it, and freeze it, I get all the results on the same screen. On one page, I'm understanding all the relationships between all these entities - and they're all hot-linked, so I can just go to them."
As GovStat becomes more and more realized, it will start to bridge the gap between how information is normally stored and retrieved in databases and how citizens actually think about information," Marchionini says.
"People don't look for the 'consumer price index', they ask, 'How much does it cost?'" Marchionini concludes. His lab has already done fruitful work with the Bureau of Labor Statistics for six years, he says, and GovStat is part of the continued commitment to allow citizens to - in the words of the project's motto - "Find what you need, understand what you find."
Bosley adds that his former boss, Cathy Dippo, was extremely sensitive to metadata, "She would say, 'Don't just throw numbers at people, let them know what they mean,'" he recalls. "She gave individual consulting contracts to help us work on metadata issues even before NSF had the program, now we can continue to work together on the project, because we're empowered as a participating agency."
|†||This site is maintained by the Digital Government Research Center at the University of Southern California's Information Sciences Institute.||" CONTACT " POLICIES||†|