Posts Tagged ‘scholarly publishing’

Wanted: the ultimate mobile app for scholarly ereading

January 7th, 2011

Nicole Henning suggests that academic libraries and scholarly presses work together to create the ultimate mobile app for scholarly ereading. I think about the requirements a bit differently, in terms of the functional requirements.

The main functions are obtaining materials, reading them, organizing them, keeping them, and sharing them.

For obtaining materials, the key new requirement is to simplify authentication: handle campus authentication systems and personal subscriptions. Multiple credentialed identities should be supported. A secondary consideration is that RSS feeds (e.g. for journal tables of contents) should be supported.

For reading materials, the key requirement is to support multiple formats in the same application. I don’t know of a web app or mobile app that supports PDF, EPUB, and HTML. Reading interfaces matter: look to Stanza and Ibis Reader for best-in-class examples.

For organizing materials, the key is synergy between the user’s data and existing data. Allow tags, folders, and multiple collections. But also leverage existing publisher and library metadata. Keep it flexible, allowing the user to modify metadata for personal use (e.g. for consistency or personal terminology) and to optionally submit corrections.

For keeping materials, import, export, and sync content from the user’s chosen cloud-based storage and WebDAV servers. No other device (e.g. laptop or desktop) should be needed.

For sharing materials, support lightweight micropublishing on social networks and email; networks should be extensible and user-customizable. Sync to or integrate with citation managers and social cataloging/reading list management systems.

Regardless of the ultimate system, I’d stress that device independence is important, meaning that an HTML5 website would probably the place to start: look to Ibis Reader as a model.

Tags: , ,
Posted in books and reading, future of publishing, information ecosystem, library and information science, scholarly communication | Comments (5)

The Social Semantic Web – a message for scholarly publishers

November 15th, 2010

I always appreciate how Geoffrey Bilder can manage to talk about the Social Semantic Web and the early modern print in (nearly) the same breath. See for yourself in the presentation he gave to scholarly publishers at the International Society of Managing and Technical Editors last month.

Geoff’s presentation is outlined, to a large extent, in an interview Geoff gave 18 months ago (search “key messages” to find the good bits). I hope to blog further about these, because Geoff has so many good things to say, which deserve unpacking!

I especially love the timeline from slide 159, which shows that we’re just past the incunabula age of the Internet:

The Early Modern Internet

We're still in the Early Modern era of the Internet. Compare to the history of print.

Tags: , , , , , ,
Posted in future of publishing, information ecosystem, PhD diary, scholarly communication, semantic web, social semantic web, social web | Comments (3)

A Model-View-Controller perspective of scholarly articles

November 13th, 2010

A scholarly paper is not a PDF. A PDF is merely one view of a scholarly paper. To push ‘beyond the PDF’, we need design patterns that allow us to segregate the user interface of the paper (whether it is displayed as an aggregation of triples, a list of assertions, a PDF, an ePub, HTML, …) from the thing itself.

Towards this end, Steve Pettifer has a Model-View-Controller perspective on scholarly articles, which he shared in a post on the Beyond the PDF listserv, where discussions are leading up to a workshop in January. I am awe-struck: I wish I’d thought of this way of separating the structure and explaining it.

I think a lot of the disagreement about the role of the PDF can be put down to trying to overload its function: to try to imbue it with the qualities of both ‘model’ and ‘view’. … One of the things that software architects (and I suspect designers in general) have learned over the years is that if you try to give something functions that it shouldn’t have, you end up with a mess; if you can separate out the concerns, you get a much more elegant and robust solution.

My personal take on this is that we should keep these things very separate, and that if we do this, then many of the problems we’ve been discussing become more clearly defined (and I hope, many of the apparent contradictions, resolved).

So… a PDF (or come to that, an e-book version or a html page) is merely a *view* of an article. The article itself (the ‘model’) is a completely different (and perhaps more abstract) thing. Views can be tailored for a particular purpose, whether that’s for machine processing, human reading, human browsing, etc etc.

[paragraph break inserted]

The relationship between the views and their underlying model is managed by the concept of a ‘controller’. For example, if we represent an article’s model in XML or RDF (its text, illustrations, association nanopublications, annotations and whatever else we like), then that model can be transformed in to any number of views. In the case of converting XML into human-readable XHTML, there are many stable and mature technologies (XSLT etc). In the case of doing the same with PDF, the traditional controller is something that generates PDFs.

[paragraph break inserted]

The thing that’s been (somewhat) lacking so far is the two-way communication between view and model (via controller) that’s necessary to prevent the views from ossifying and becoming out of date (i.e. there’s no easy way to see that comments have been added to the HTML version of an article’s view if you happen to be reading the PDF version, so the view here can rapidly diverge from its underlying model).

[paragraph break inserted, link added]

Our Utopia software is an attempt to provide this two-way controller for PDFs. I believe that once you have this bidirectional relationship between view and model, then the actual detailed affordances of the individual views (i.e. what can a PDF do well / badly, what can HTML do well / badly) become less important. They are all merely means to channeling the content of an article to its destination (whether that’s human or machine).

The good thing about having this ‘model view controller’ take on the problem is that only the model needs to be pinned down completely …

Perhaps separating out our concerns in this way — that is, treating the PDF as one possible representation of an article — might help focus our criticisms of the current state of affairs? I fear at the moment we are conflating the issues to some degree.

– Steve Pettifer in a Beyond the PDF listserv post

I’m particularly interested in hearing if this perspective, using the MVC model, makes sense to others.

Tags: , , , , , , ,
Posted in books and reading, future of publishing, information ecosystem, library and information science, scholarly communication, social semantic web | Comments (9)

Organizing a PDF library: Mendeley for information extraction, Zotero for open source goodness

August 27th, 2009

I’ve been using Zotero for awhile now. I make no secret of the fact that I’m a big fan. In early July I was testing out Mendeley to give a workshop with a colleague who’s been excited about it.

I wanted to see whether Mendeley could reduce any of my pain points. While I’m not moving to Mendeley*, I do plan to take advantage of its whizz-bang PDF organization. When Mendeley offers Zotero integration, I think I’ll be set. *Zotero is opensource; Mendeley is merely free at the moment. Zotero also offers web archiving features while Mendeley is strictly for PDF organization.

I spend a lot of time reading and pulling materials into my library; I spend far less time organizing materials. So I decided I’d try the PDF metadata functions of each. Zotero can pull in materials lots of different ways, but it doesn’t yet have a “pull this PDF in from this URL” button for reports and things that aren’t in databases. I don’t want to spend my time typing up metadata (I’m lazy and busy, what can I say), but I do want to have an organized library. (Hey, got an organizing business? I’d pay for your services.) So the “get metadata for this PDF” features are of prime interest to me.

I usually have a “to read” pile lying around. I did a very non-scientific test, starting with a folder of 44 PDFs (“PDFs to read”). I dragged them into each program.

Zotero had a small point of failure: I expected “get PDF metadata” to be in the Preferences menu, but I had to look up its location on their website. Happily, it’s easy to find from the Support page of Retrieve PDF Metadata. The page explains that metadata comes from Google Scholar, based on the DOI if it’s embedded. That sounds like a reasonable methodology, but one that’s only going to work for recent journal articles and books published by e-savvy publishers. Most of the files I dump into “PDFs to read” are preprints from personal websites or reports from nonprofits’ websites. DOIs aren’t expected in that context.

Of my 44 test cases, Zotero says “No matching references found.” on 26 of them. Results from the 18 “successful” matches are spottier. The first one I checked leads me to believe that things haven’t changed since the last time I tried out this feature, maybe 8 or 10 months ago. It’s an article called A New Approach to Search [PDF], by Joe Weinman, and it’s available from his website. I can identify the source as Business Communications Review, October 2007 from small type in the footer. So can Mendeley. But Zotero calls it Peters, R. S. 1970. Ethics and education. Allen & Unwin Australia. I’m not really sure why. Google search, perhaps?

Zotero’s ‘identification’ of the next article is even stranger:
Capital, R. Sheriff’s Office moves to new facility. Cell 224: 6547. (Notice: the title and journal don’t even belong together!) This article is actually the contest-winning federated search article published by Computers in Libraries [PDF]. It’s available from the publisher’s website. While Information Today publishes some great articles about technology, their HTML doesn’t have any semantic information. Since no one’s yet written a screenscraper for their site, Zotero can’t auto-grab the metadata. But Mendeley successfully identifies this PDF, too.

I wondered whether Mendeley was grabbing metadata from the files so I took a closer look at these two files. Nope, there was very little usable metadata. (Adobe Bridge is great for reading XMP metadata.) Furthermore, the first article (by Weinman) lists its creator as Sharon Wallach; clearly neither program is pulling that.

Onward and upward: overall there are 4 bad identifications and 22 good identifications of the 44, from Zotero. The false positive score of 9% is the part that bothers me the most.

Mendeley does better but it’s not perfect. At first it appears to have identified all 44 PDFs, but there’s a fair bit of missing information (for instance 13 missing the “Published in” field). When I looked closely, I found 26 bad data, 4 could be improved, 2 weren’t identified. Which means I’m satisfied with only 12 of these, but there’s another important factor: Mendeley marks these files as ‘unreviewed’, meaning that the metadata is suspect until I review and/or correct it. So the false positives are easy to detect. This is reassuring. Especially since (unlike Zotero) only one of Mendeley’s identifications was worse than none at all, and it was dead easy to spot:
Fohjoft, W. J., Jg, J. T., Vtfe, T. F., Jo, F., Epo, O., Bcpvu, N. E., et al. (n.d.). !12 3/4 “#$%&$’,5.

It’s interesting to look at where Mendeley fails: non-scientific articles and documents with non-standard title pages. Mendeley chokes on Open Provenance Model and Funny in Farsi (no metadata at all) and label a Master’s report only with the year (2000).

I’m most interested about Funny in Farsi; I would expect better metadata from Random House, but sure enough Bridge doesn’t find any. I like Mendeley’s auto-rename feature, but on the files it doesn’t label, that renaming is a big disadvantage: filenames are often reasonable metadata. These three filenames (opm-v1.01.pdf, Funny_in_Farsi.pdf, and 2576.pdf) give either information about the contents or a chance at refinding it with a search engine. For opm-v1.01.pdf , googling the filename finds it immediately. For Funny_in_Farsi.pdf, searching for Funny in Farsi provides 8 search results, and a savvy searcher could get more metadata (e.g. the publisher’s name) from the results. Searching for 2576.pdf clarke open source finds the third.

I’m also interested in what neither Zotero nor Mendeley got right. Neither correctly identified a PDF with Highlights of the National Museum of American History. Drag and drop of citations (with ugly special characters and all) gives

Parton, J. 2004. Revolutionary Heroes and Other Historical Papers. Kessinger Publishing.

Museum, N., & History, A. (2008). Star-Spangled Banner, 1814. Smithsonian.

Neither does well on the Palmer report, either:

Bird, A. 1994. Careers as repositories of knowledge: a new perspective on boundaryless careers. Journal of Organizational Behavior: 325-344.

Factors, I., Palmer, C. I., Teffeau, P. I., Newton, P. C., Assistant, R., Research, I., et al.
(2008). No title. Library, (August).

With a closer look, you can see Mendeley takes the authors as:
Factors, Identifying
Palmer, C I C Institutional Repository Development Final Report Carole L
Teffeau, Principal Investigator Lauren C
Newton, Project Coordinator Mark P
Assistant, Research
Research, Informatics

If you want more details, please leave a comment or drop me a line; I had hoped to add info but decided just to push this out of my queue. I was thinking about it because Mendeley really does help me review the papers I’ve been meaning to read. Guess it’s time to think about that Mendeley to Zotero workflow again!

Tags: , , , , ,
Posted in information ecosystem, reviews | Comments (7)