Posts Tagged ‘metadata’

Let’s link the world’s metadata!

December 9th, 2010

Together we can continue building a global metadata infrastructure. I am tasking you with helping. How can you do that?

For evangelists, practitioners, and consultants:

  • Thanks for bringing Linked Data to where it is today! We’re counting on you for even more yummy cross-disciplinary Linked Data!
  • What tools and applications are most urgently needed? Researchers and developers need to hear your use cases: please partner with them to share these needs!
  • How do you and your clients choose [terms, concepts, schemas, ontologies]? What helps the most?
  • Overall, what is working (and what is not)? How can we amplify what *is* working?

For Semantic Web researchers:

  • Build out the trust and provenance infrastructure.
  • Mature the query languages (e.g. SPARQL) [perhaps someone could say more about what this would mean?]
  • Building tools and applications for end-users is really important: value this work, and get to know some real usecases and end-users!

For information scientists:

  • How can we identify ‘universals’ across languages, disciplines, and cultures? Does the Colon classification help?
  • What are the best practices for sharing and reusing [terms, concepts, schemas, ontologies]? What is working and what is failing with metadata registries? What are the alternatives?

For managers, project leaders, and business people:

  • How do we create and justify the business case for Terminology services [like MIME types, library subject headings, New York Times Topics]?
  • Please collect and share your usage data! Do we need infrastructure for sharing usage data?
  • Share the economic and business successes of Linked Data!

That ends the call to action, but here’s where it comes from.

Yesterday Stuart Weibel gave a talk called ”Missing Pieces in the Global Metadata Landscape” [slideshare] at InfoCom International Symposium in Tokyo. Stu asked 11 of us what those missing pieces were—with 3 questions: the conceptual issues, organizational impediments, and the most important overall issue. This last question, “What is the most important missing infrastructural link in establishing globally interoperable metadata systems?”, is my favorite, so I’ll talk about it a little further.

Stu summarizes that the infrastructure is mostly there, but that broad adoption (of standards, conventions, and common practice) is key. Overall these are the key issues he reports:

  • Tools to support and encourage the reuse of terms, concepts, schemas, ontologies (e.g., metadata registries, and more)
  • Widespread, cross-disciplinary adoption of a common metadata approach (Linked Data)
  • Query languages for the open web (SPARQL) are not fully mature
  • Trust and provenance infrastructure
  • Nothing’s missing… just use RDF, Linked Data, and the open web.  The key is broad adoption, and that requires better tools and applications. It’s a social problem, not a technical problem.
  • The ability to identify ‘universals’ across languages, disciplines, and cultures – revive Ranganathan’s facets?
  • Terminology services [like MIME types, library subject headings, New York Times Topics] have long been proposed as important services, but they are expensive to create, curate, and manage, and the economic models are weak
  • Stuff that does not work is often obvious. We need usage data to see what does work, and amplify it

You may notice, now, that the “call” looks a little familiar!

Tags: , , ,
Posted in information ecosystem, library and information science, semantic web | Comments (0)

How metadata could pay for newspapers

February 13th, 2010

What if newspapers published not just stories but databases? Dan Conover’s vision for the future of newspapers is inspired in part by his first reporting job, for NATO:

When we spotted something interesting, we recorded it in a highly structured way that could be accurately and quickly communicated over a two-way radio, to be transcribed by specialists at our border camp and relayed to intelligence analysts in Brussells.

The story, says Conover, is only one aspect of reporting. The other part? Gathering structured metadata, which could be stored in a database—or expressed as linked data.1

Newspapers already have classification systems and professional taxonomists. The New York Times’ classifications system, in use since 1851, now aggregates stories from the archives in Times Topics, a website and API.2

What if, in addition to these classifications, each story had even more structured metadata?
Capturing metadata ranges from automatic to manual. Some automatic capture is already standard (timestamps) or could be (saving GPS coordinates from a photo or storing timestamps), and some information needing manual capture (like the number of alarms of a fire) is already reported.

Dan compares the “old way” with his “new way”:

The old way:

Dan the reporter covers a house fire in 2005. He gives the street address, the date and time, who was victimized, who put it out, how extensive the fire was and what investigators think might have caused it. He files the story, sits with an editor as it’s reviewed, then goes home. Later, he takes a phone call from another editor. This editor wants to know the value of the property damaged in the fire, but nobody has done that estimate yet, so the editor adds a statement to that effect. The story is published and stored in an electronic archive, where it is searchable by keyword.

The new way:

Dan the reporter covers a house fire in 2010. In addition to a street address, he records a six-digit grid coordinate that isn’t intended for publication. His word-processing program captures the date and time he writes in his story and converts it to a Zulu time signature, which is also appended to the file.

As he records the names of the victimized and the departments involved in putting out the fire, he highlights each first reference for computer comparison. If the proper name he highlights has never been mentioned by the organization, Dan’s newswriting word processor prompts him to compare the subject to a list of near-matches and either associate the name with an existing digital file or approve the creation of a new one.

When Dan codes the story subject as “fire,” his word processor gives him a new series of fields to complete. How many alarms? Official cause? Forest fire (y/n)? Official damage estimate? Addresses of other properties damaged by the fire? And so on. Every answer he can’t provide is coded “Pending.”

Later, Dan sits with an editor as his story is reviewed, but a second editor decides not to call him at home because he sees the answer to the damage-estimate question in the file’s metadata. The story is published and archived electronically, along with extensive metadata that now exists in a relational database. New information (the name of victims, for instance) automatically generates new files, which are retained by the news organization’s database but not published.

And those information fields Dan coded as “Pending?” Dan and his editors will be prompted to provide that structured information later — and the prompting will continue until the data set is completed.

- Dan Conover in The “Lack of Vision” thing? Well, here’s a hopeful vision for you

And that data set? It might even be saleable, even though each individual story had perhaps been given away for free. Dan highlights some possibilities, and entire industries have grown around repackaging free and non-free data (e.g. U.S. Census data, phone book data). I think of mashups such as Everyblock and hyperlocal news sites like

  1. Some news organizations, like the New York Times (see Linked Open Data) and the BBC (overview, tech blog) are already embracing linked data. []
  2. I delved into Times Topics’ taxonomy and vocabulary in an earlier post. []

Tags: , , , , , , ,
Posted in future of publishing, information ecosystem, semantic web | Comments (1)

Organizing a PDF library: Mendeley for information extraction, Zotero for open source goodness

August 27th, 2009

I’ve been using Zotero for awhile now. I make no secret of the fact that I’m a big fan. In early July I was testing out Mendeley to give a workshop with a colleague who’s been excited about it.

I wanted to see whether Mendeley could reduce any of my pain points. While I’m not moving to Mendeley*, I do plan to take advantage of its whizz-bang PDF organization. When Mendeley offers Zotero integration, I think I’ll be set. *Zotero is opensource; Mendeley is merely free at the moment. Zotero also offers web archiving features while Mendeley is strictly for PDF organization.

I spend a lot of time reading and pulling materials into my library; I spend far less time organizing materials. So I decided I’d try the PDF metadata functions of each. Zotero can pull in materials lots of different ways, but it doesn’t yet have a “pull this PDF in from this URL” button for reports and things that aren’t in databases. I don’t want to spend my time typing up metadata (I’m lazy and busy, what can I say), but I do want to have an organized library. (Hey, got an organizing business? I’d pay for your services.) So the “get metadata for this PDF” features are of prime interest to me.

I usually have a “to read” pile lying around. I did a very non-scientific test, starting with a folder of 44 PDFs (“PDFs to read”). I dragged them into each program.

Zotero had a small point of failure: I expected “get PDF metadata” to be in the Preferences menu, but I had to look up its location on their website. Happily, it’s easy to find from the Support page of Retrieve PDF Metadata. The page explains that metadata comes from Google Scholar, based on the DOI if it’s embedded. That sounds like a reasonable methodology, but one that’s only going to work for recent journal articles and books published by e-savvy publishers. Most of the files I dump into “PDFs to read” are preprints from personal websites or reports from nonprofits’ websites. DOIs aren’t expected in that context.

Of my 44 test cases, Zotero says “No matching references found.” on 26 of them. Results from the 18 “successful” matches are spottier. The first one I checked leads me to believe that things haven’t changed since the last time I tried out this feature, maybe 8 or 10 months ago. It’s an article called A New Approach to Search [PDF], by Joe Weinman, and it’s available from his website. I can identify the source as Business Communications Review, October 2007 from small type in the footer. So can Mendeley. But Zotero calls it Peters, R. S. 1970. Ethics and education. Allen & Unwin Australia. I’m not really sure why. Google search, perhaps?

Zotero’s ‘identification’ of the next article is even stranger:
Capital, R. Sheriff’s Office moves to new facility. Cell 224: 6547. (Notice: the title and journal don’t even belong together!) This article is actually the contest-winning federated search article published by Computers in Libraries [PDF]. It’s available from the publisher’s website. While Information Today publishes some great articles about technology, their HTML doesn’t have any semantic information. Since no one’s yet written a screenscraper for their site, Zotero can’t auto-grab the metadata. But Mendeley successfully identifies this PDF, too.

I wondered whether Mendeley was grabbing metadata from the files so I took a closer look at these two files. Nope, there was very little usable metadata. (Adobe Bridge is great for reading XMP metadata.) Furthermore, the first article (by Weinman) lists its creator as Sharon Wallach; clearly neither program is pulling that.

Onward and upward: overall there are 4 bad identifications and 22 good identifications of the 44, from Zotero. The false positive score of 9% is the part that bothers me the most.

Mendeley does better but it’s not perfect. At first it appears to have identified all 44 PDFs, but there’s a fair bit of missing information (for instance 13 missing the “Published in” field). When I looked closely, I found 26 bad data, 4 could be improved, 2 weren’t identified. Which means I’m satisfied with only 12 of these, but there’s another important factor: Mendeley marks these files as ‘unreviewed’, meaning that the metadata is suspect until I review and/or correct it. So the false positives are easy to detect. This is reassuring. Especially since (unlike Zotero) only one of Mendeley’s identifications was worse than none at all, and it was dead easy to spot:
Fohjoft, W. J., Jg, J. T., Vtfe, T. F., Jo, F., Epo, O., Bcpvu, N. E., et al. (n.d.). !12 3/4 “#$%&$’,5.

It’s interesting to look at where Mendeley fails: non-scientific articles and documents with non-standard title pages. Mendeley chokes on Open Provenance Model and Funny in Farsi (no metadata at all) and label a Master’s report only with the year (2000).

I’m most interested about Funny in Farsi; I would expect better metadata from Random House, but sure enough Bridge doesn’t find any. I like Mendeley’s auto-rename feature, but on the files it doesn’t label, that renaming is a big disadvantage: filenames are often reasonable metadata. These three filenames (opm-v1.01.pdf, Funny_in_Farsi.pdf, and 2576.pdf) give either information about the contents or a chance at refinding it with a search engine. For opm-v1.01.pdf , googling the filename finds it immediately. For Funny_in_Farsi.pdf, searching for Funny in Farsi provides 8 search results, and a savvy searcher could get more metadata (e.g. the publisher’s name) from the results. Searching for 2576.pdf clarke open source finds the third.

I’m also interested in what neither Zotero nor Mendeley got right. Neither correctly identified a PDF with Highlights of the National Museum of American History. Drag and drop of citations (with ugly special characters and all) gives

Parton, J. 2004. Revolutionary Heroes and Other Historical Papers. Kessinger Publishing.

Museum, N., & History, A. (2008). Star-Spangled Banner, 1814. Smithsonian.

Neither does well on the Palmer report, either:

Bird, A. 1994. Careers as repositories of knowledge: a new perspective on boundaryless careers. Journal of Organizational Behavior: 325-344.

Factors, I., Palmer, C. I., Teffeau, P. I., Newton, P. C., Assistant, R., Research, I., et al.
(2008). No title. Library, (August).

With a closer look, you can see Mendeley takes the authors as:
Factors, Identifying
Palmer, C I C Institutional Repository Development Final Report Carole L
Teffeau, Principal Investigator Lauren C
Newton, Project Coordinator Mark P
Assistant, Research
Research, Informatics

If you want more details, please leave a comment or drop me a line; I had hoped to add info but decided just to push this out of my queue. I was thinking about it because Mendeley really does help me review the papers I’ve been meaning to read. Guess it’s time to think about that Mendeley to Zotero workflow again!

Tags: , , , , ,
Posted in information ecosystem, reviews | Comments (7)