How metadata could pay for newspapers

February 13th, 2010
by jodi

What if newspapers published not just stories but databases? Dan Conover’s vision for the future of newspapers is inspired in part by his first reporting job, for NATO:

When we spotted something interesting, we recorded it in a highly structured way that could be accurately and quickly communicated over a two-way radio, to be transcribed by specialists at our border camp and relayed to intelligence analysts in Brussells.

The story, says Conover, is only one aspect of reporting. The other part? Gathering structured metadata, which could be stored in a database—or expressed as linked data. ((Some news organizations, like the New York Times (see Linked Open Data) and the BBC (overview, tech blog) are already embracing linked data.))

Newspapers already have classification systems and professional taxonomists. The New York Times’ classifications system, in use since 1851, now aggregates stories from the archives in Times Topics, a website and API. ((I delved into Times Topics’ taxonomy and vocabulary in an earlier post.))

What if, in addition to these classifications, each story had even more structured metadata?
Capturing metadata ranges from automatic to manual. Some automatic capture is already standard (timestamps) or could be (saving GPS coordinates from a photo or storing timestamps), and some information needing manual capture (like the number of alarms of a fire) is already reported.

Dan compares the “old way” with his “new way”:

The old way:

Dan the reporter covers a house fire in 2005. He gives the street address, the date and time, who was victimized, who put it out, how extensive the fire was and what investigators think might have caused it. He files the story, sits with an editor as it’s reviewed, then goes home. Later, he takes a phone call from another editor. This editor wants to know the value of the property damaged in the fire, but nobody has done that estimate yet, so the editor adds a statement to that effect. The story is published and stored in an electronic archive, where it is searchable by keyword.

The new way:

Dan the reporter covers a house fire in 2010. In addition to a street address, he records a six-digit grid coordinate that isn’t intended for publication. His word-processing program captures the date and time he writes in his story and converts it to a Zulu time signature, which is also appended to the file.

As he records the names of the victimized and the departments involved in putting out the fire, he highlights each first reference for computer comparison. If the proper name he highlights has never been mentioned by the organization, Dan’s newswriting word processor prompts him to compare the subject to a list of near-matches and either associate the name with an existing digital file or approve the creation of a new one.

When Dan codes the story subject as “fire,” his word processor gives him a new series of fields to complete. How many alarms? Official cause? Forest fire (y/n)? Official damage estimate? Addresses of other properties damaged by the fire? And so on. Every answer he can’t provide is coded “Pending.”

Later, Dan sits with an editor as his story is reviewed, but a second editor decides not to call him at home because he sees the answer to the damage-estimate question in the file’s metadata. The story is published and archived electronically, along with extensive metadata that now exists in a relational database. New information (the name of victims, for instance) automatically generates new files, which are retained by the news organization’s database but not published.

And those information fields Dan coded as “Pending?” Dan and his editors will be prompted to provide that structured information later — and the prompting will continue until the data set is completed.

– Dan Conover in The “Lack of Vision” thing? Well, here’s a hopeful vision for you

And that data set? It might even be saleable, even though each individual story had perhaps been given away for free. Dan highlights some possibilities, and entire industries have grown around repackaging free and non-free data (e.g. U.S. Census data, phone book data). I think of mashups such as Everyblock and hyperlocal news sites like outside.in.

Tags: , , , , , , ,
Posted in future of publishing, information ecosystem, semantic web | Comments (1)

Opening bibliographic data

February 7th, 2010
by jodi

I love the CERN library’s message of “Raw bibliographic book data available now!”, framed
1989: TimBL invented WWW at CERN
2009: TimBL calls for “Open Data Now” at TED

CERN is the latest library to share their book data, as CERN emerging technologies librarian Patrick Danowski announced on twitter. The Open Book Data Project is further described on their website and in a youtube video (below) purpose-made for the occasion. The data is dual-licensed as CC0 and PDDL.

This isn’t the first time that library data has been shared with a splash.

After speaking at Code4Lib 2008 (my first Code4Lib conference), Brewster Kahle was presented with MARC records from the Oregon Summit consortium.

In 2007, a number of Library of Congress records were deposited in connection with
Scriblio Open Source Endeca, a faceted catalog Casey Bisson Durfee described at Code4Lib2007. Scriblio It has gone through several incarnations; the open source Kochief project is the latest.

Further, as Jonathan Gorman and I were discussing in #code4lib earlier this week, there are several collections of MARC records and more donated to Open Library hosted at the Internet Archive. A few are misclassified so also consider keyword searches (‘MARC’ and ‘MARC libraries’) if you’re trying to find all the MARC records that archive.org has.

Linked data in libraries is coming along more slowly; fruit, perhaps, for another post.

Where do you look for bibliographic records? Feel free to leave tips in the comments!

Updated 2010-04-14, with thanks to Dan Scott for corrections!

Tags: , , , , , , , ,
Posted in library and information science | Comments (2)

Salmon Protocol: Comments Swimming Upstream

February 3rd, 2010
by jodi

Salmon, an aggregation protocol, is championed by Google’s John Panzer, and described as an “an open, simple, standards-based solution” for “unifying the conversations”.

‘Conversations’ is deliberately plural, I think, to evoke the many conversations, invisible to one another: “The comments, ratings, and annotations increasingly happen at the aggregator and are invisible to the original source.”

Using Salmon, an aggregator pushes comments back to a “Salmon endpoint” (via POST). These can be published (or moderated) upstream at the original source. See also the summary of the Salmon protocol.

Comments swimming upstream…

Tags: , , , ,
Posted in information ecosystem, social web | Comments (0)

Problems and Opportunities for the Social Web 2010

February 3rd, 2010
by jodi

In a post at ZDNet, Dion Hinchcliffe delineates 7 problems of today’s social web:

  1. Fragmentation of conversation.
  2. Disconnects between older and newer generations of social media
  3. Lack of control of identity, contacts, and data.
  4. A better social Web on mobile devices.
  5. Poor integration between social media and location services.
  6. Difficulty of coherently engaging in social activity across many channels.
  7. Coping with and getting value from the expanding information volume of social media.

from “The social Web in 2010: The emerging standards and technologies to watch” encountered via Ed H. Chi’s post at the PARC Augmented Social Cognition blog.

The trends? Openness, portability, aggregation of distributed content. Hopefully we’ll see more on all these fronts in 2010 and beyond. Hinchcliffe also suggests that we want “Better social and location capabilities added to the core of mobile devices.”

See the full post at ZDNet for more discussion and references to a number of standards, formats, and related developments. In the next post, I’ll highlight Salmon, a protocol for distributed commenting, which I’d neither encountered nor heard of.

Tags: , , , ,
Posted in information ecosystem, social web | Comments (1)

Juxtaposition

January 28th, 2010
by jodi

Sometimes it’s the juxtaposition that amuses me:

Jill Gengler: I love being able to save someone's bacon. Tom Coates: The great slab of fatty pork that I presume to call a brain is almost totally recumbent this morning. Come on piggy! Do some thinking!

Tweetie

Jill Gengler: I love being able to save someone’s bacon.

Tom Coates: The great slab of fatty pork that I presume to call a brain is almost totally recumbent this morning. Come on piggy! Do some thinking!

We’re making progress at archiving individual streams, I think. But the overall conversation, “what was I seeing then”, and the links between things? Needs work, at least chez moi!

Updated 2010-04-14 to fix typos. :)

Tags: , , ,
Posted in information ecosystem, random thoughts | Comments (2)

A taxonomy of tweets

January 11th, 2010
by jodi

Here’s a taxonomy of tweets from an experiment at SemanticHacker Blog:

  • User’s current status
  • Private conversations
  • Links to web content
    • links to blog and news articles
    • links to images and videos
    • other links
  • Politics, sports, current events
  • Product recommendations/complaints
  • Advertising  “posted from a company’s twitter account”
  • Spam
  • Other messages “that don’t quite fit under any of the above categories. Fan messages to celebrities, shoutouts to other users, web-based polls and quizzes, and so on.”

via Hak-Lae Kim on twitter

Tags: , ,
Posted in social web | Comments (0)

Starving the subconscious

November 30th, 2009
by jodi

Your brain builds something from whatever mental flotsam and jetsam is in your head. Perhaps it’s a useful thing, an answer to a question you didn’t know you needed. Perhaps it’s just an interesting combination of thoughts put into a story. It’s dreaming, but you’re awake.

-[Rands]

…when you have a real important problem you don’t let anything else get the center of your attention – you keep your thoughts on the problem. Keep your subconscious starved so it has to work on your problem, so you can sleep peacefully and get the answer in the morning, free.

-[Richard Hamming]

Metaresearch?

Tags: ,
Posted in PhD diary, random thoughts | Comments (0)

Galway: recommending a photoessay

November 29th, 2009
by jodi

I’m moving from “getting settled” to “getting down to work”. Photosharing isn’t a priority (in part because I’m shy of that kind of social networking: photos always reveal more than you think they do).

Shawn Micallef from Spacing Toronto has a lovely photoessay about Galway. I think he captures the city well. It gives a flavor of the place, from a North American perspective:

Geese at Galway Bay by Shawn Micallef

Geese at Galway Bay by Shawn Micallef

Go read/see the whole thing.

Tags: ,
Posted in random thoughts | Comments (3)

Ribbit: Google Voice with social web, your own number, (and eventually a fee)

November 26th, 2009
by jodi

Based in Silicon Valley, Ribbit is an internet telephony startup and a subsidiary of British Telcom. Ribbit has an “open platform for voice innovation”, with API access for developers (see also getting started) and several end-user products.

Ribbit Mobile is similar to Google Voice: it’s a next-generation phone system currently aimed at the US and UK markets. You use your own (presumably mobile) number. What really got my attention, though, was a new social feature they call “Caller ID 2.0”:

Ribbit wants to leverage your social networks for Caller ID

Ribbit wants to leverage your social networks for Caller ID

When a call comes in, Ribbit Mobile will reach into the social web and bring you the recent LinkedIn updates, Facebook updates, Tweets, and Flickr photos of the person calling you. Ribbit Mobile lets you know not just who is calling but what the caller has been up to on the web.

I was already excited about Ribbit when I first encountered them.

I guess it’s time to try out their Google Wave gadgets! And if Google Voice or VoIP with your mobile number has any appeal, I’d advise you to request an invite for their beta. Note that they’ve already got plans to charge for their services.

Tags: , , , , ,
Posted in social web | Comments (0)

What types of data do social networks have? See Schneier’s Taxonomy.

November 20th, 2009
by jodi

Rights to data may depend, says Bruce Schneier, on what type of data it is and who provided it. He provides a useful enumeration:

1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.

2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.

3. Entrusted data. This is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data — someone else does.

4. Incidental data. Incidental data is data the other people post about you. Again, it’s basically same same stuff as disclosed data, but the difference is that 1) you don’t have control over it, and 2) you didn’t create it in the first place.

5. Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.

See Schenier’s post for discussion. Via a pointer on Rob Styles’ blog, in turn via Rob’s tweet.

Have you come across other taxonomies for social networking data?

Here’s a simple but far less expressive one way to characterize data on social networks. Is it “about you” or “from you”? Either the first, the second, neither, or both. “Aboutness”, however, is ontologically challenging. Any use for this?

Collaboration/shared control isn’t considered in this taxonomy. For instance, “entrusted data” doesn’t capture the notion of “shared data” in a collaborative system such as wave, a wiki, or perhaps even email.

For behavioral data in libraries, see also “intentional data”, as used by Lorcan Dempsey, back to 2005 (and many times since) [for instance, in discussion with “emergent knowledge”]. I prefer “behavioral data” since much data about intention is by no means deliberate/intentional!

Tags: ,
Posted in social web | Comments (3)