» 2010 » February

Archive for February, 2010

How metadata could pay for newspapers

February 13th, 2010

What if newspapers published not just stories but databases? Dan Conover’s vision for the future of newspapers is inspired in part by his first reporting job, for NATO:

When we spotted something interesting, we recorded it in a highly structured way that could be accurately and quickly communicated over a two-way radio, to be transcribed by specialists at our border camp and relayed to intelligence analysts in Brussells.

The story, says Conover, is only one aspect of reporting. The other part? Gathering structured metadata, which could be stored in a database—or expressed as linked data. ((Some news organizations, like the New York Times (see Linked Open Data) and the BBC (overview, tech blog) are already embracing linked data.))

Newspapers already have classification systems and professional taxonomists. The New York Times’ classifications system, in use since 1851, now aggregates stories from the archives in Times Topics, a website and API. ((I delved into Times Topics’ taxonomy and vocabulary in an earlier post.))

What if, in addition to these classifications, each story had even more structured metadata?
Capturing metadata ranges from automatic to manual. Some automatic capture is already standard (timestamps) or could be (saving GPS coordinates from a photo or storing timestamps), and some information needing manual capture (like the number of alarms of a fire) is already reported.

Dan compares the “old way” with his “new way”:

The old way:

Dan the reporter covers a house fire in 2005. He gives the street address, the date and time, who was victimized, who put it out, how extensive the fire was and what investigators think might have caused it. He files the story, sits with an editor as it’s reviewed, then goes home. Later, he takes a phone call from another editor. This editor wants to know the value of the property damaged in the fire, but nobody has done that estimate yet, so the editor adds a statement to that effect. The story is published and stored in an electronic archive, where it is searchable by keyword.

The new way:

Dan the reporter covers a house fire in 2010. In addition to a street address, he records a six-digit grid coordinate that isn’t intended for publication. His word-processing program captures the date and time he writes in his story and converts it to a Zulu time signature, which is also appended to the file.

As he records the names of the victimized and the departments involved in putting out the fire, he highlights each first reference for computer comparison. If the proper name he highlights has never been mentioned by the organization, Dan’s newswriting word processor prompts him to compare the subject to a list of near-matches and either associate the name with an existing digital file or approve the creation of a new one.

When Dan codes the story subject as “fire,” his word processor gives him a new series of fields to complete. How many alarms? Official cause? Forest fire (y/n)? Official damage estimate? Addresses of other properties damaged by the fire? And so on. Every answer he can’t provide is coded “Pending.”

Later, Dan sits with an editor as his story is reviewed, but a second editor decides not to call him at home because he sees the answer to the damage-estimate question in the file’s metadata. The story is published and archived electronically, along with extensive metadata that now exists in a relational database. New information (the name of victims, for instance) automatically generates new files, which are retained by the news organization’s database but not published.

And those information fields Dan coded as “Pending?” Dan and his editors will be prompted to provide that structured information later — and the prompting will continue until the data set is completed.

– Dan Conover in The “Lack of Vision” thing? Well, here’s a hopeful vision for you

And that data set? It might even be saleable, even though each individual story had perhaps been given away for free. Dan highlights some possibilities, and entire industries have grown around repackaging free and non-free data (e.g. U.S. Census data, phone book data). I think of mashups such as Everyblock and hyperlocal news sites like outside.in.

Tags: business models, future of journalism, future of reporting, journalism, linked data, metadata, newspapers, structured data
Posted in future of publishing, information ecosystem, semantic web | Comments (1)

Opening bibliographic data

February 7th, 2010

I love the CERN library’s message of “Raw bibliographic book data available now!”, framed
1989: TimBL invented WWW at CERN
2009: TimBL calls for “Open Data Now” at TED

CERN is the latest library to share their book data, as CERN emerging technologies librarian Patrick Danowski announced on twitter. The Open Book Data Project is further described on their website and in a youtube video (below) purpose-made for the occasion. The data is dual-licensed as CC0 and PDDL.

This isn’t the first time that library data has been shared with a splash.

After speaking at Code4Lib 2008 (my first Code4Lib conference), Brewster Kahle was presented with MARC records from the Oregon Summit consortium.

In 2007, a number of Library of Congress records were deposited in connection with
~~Scriblio~~ Open Source Endeca, a faceted catalog Casey Bisson ~~Durfee~~ described at Code4Lib2007. ~~Scriblio~~ It has gone through several incarnations; the open source Kochief project is the latest.

Further, as Jonathan Gorman and I were discussing in #code4lib earlier this week, there are several collections of MARC records and more donated to Open Library hosted at the Internet Archive. A few are misclassified so also consider keyword searches (‘MARC’ and ‘MARC libraries’) if you’re trying to find all the MARC records that archive.org has.

Linked data in libraries is coming along more slowly; fruit, perhaps, for another post.

Where do you look for bibliographic records? Feel free to leave tips in the comments!

Updated 2010-04-14, with thanks to Dan Scott for corrections!

Tags: bibliographic data, CERN, Internet Archive, Kochief, MARC, Open Book Data Project, open data, Open Library, Scriblio
Posted in library and information science | Comments (2)

jodischneider.com/blog

reading, technology, stray thoughts

Categories

Search

Archive for February, 2010

How metadata could pay for newspapers

Opening bibliographic data

Recent Posts

Monthly

Meta

jodischneider.com/blog

reading, technology, stray thoughts

Categories

Search

Archive for February, 2010

How metadata could pay for newspapers

Opening bibliographic data

Salmon Protocol: Comments Swimming Upstream

Problems and Opportunities for the Social Web 2010

Recent Posts

Monthly

Meta