A Model-View-Controller perspective of scholarly articles

November 13th, 2010
by jodi

A scholarly paper is not a PDF. A PDF is merely one view of a scholarly paper. To push ‘beyond the PDF’, we need design patterns that allow us to segregate the user interface of the paper (whether it is displayed as an aggregation of triples, a list of assertions, a PDF, an ePub, HTML, …) from the thing itself.

Towards this end, Steve Pettifer has a Model-View-Controller perspective on scholarly articles, which he shared in a post on the Beyond the PDF listserv, where discussions are leading up to a workshop in January. I am awe-struck: I wish I’d thought of this way of separating the structure and explaining it.

I think a lot of the disagreement about the role of the PDF can be put down to trying to overload its function: to try to imbue it with the qualities of both ‘model’ and ‘view’. … One of the things that software architects (and I suspect designers in general) have learned over the years is that if you try to give something functions that it shouldn’t have, you end up with a mess; if you can separate out the concerns, you get a much more elegant and robust solution.

My personal take on this is that we should keep these things very separate, and that if we do this, then many of the problems we’ve been discussing become more clearly defined (and I hope, many of the apparent contradictions, resolved).

So… a PDF (or come to that, an e-book version or a html page) is merely a *view* of an article. The article itself (the ‘model’) is a completely different (and perhaps more abstract) thing. Views can be tailored for a particular purpose, whether that’s for machine processing, human reading, human browsing, etc etc.

[paragraph break inserted]

The relationship between the views and their underlying model is managed by the concept of a ‘controller’. For example, if we represent an article’s model in XML or RDF (its text, illustrations, association nanopublications, annotations and whatever else we like), then that model can be transformed in to any number of views. In the case of converting XML into human-readable XHTML, there are many stable and mature technologies (XSLT etc). In the case of doing the same with PDF, the traditional controller is something that generates PDFs.

[paragraph break inserted]

The thing that’s been (somewhat) lacking so far is the two-way communication between view and model (via controller) that’s necessary to prevent the views from ossifying and becoming out of date (i.e. there’s no easy way to see that comments have been added to the HTML version of an article’s view if you happen to be reading the PDF version, so the view here can rapidly diverge from its underlying model).

[paragraph break inserted, link added]

Our Utopia software is an attempt to provide this two-way controller for PDFs. I believe that once you have this bidirectional relationship between view and model, then the actual detailed affordances of the individual views (i.e. what can a PDF do well / badly, what can HTML do well / badly) become less important. They are all merely means to channeling the content of an article to its destination (whether that’s human or machine).

The good thing about having this ‘model view controller’ take on the problem is that only the model needs to be pinned down completely …

Perhaps separating out our concerns in this way — that is, treating the PDF as one possible representation of an article — might help focus our criticisms of the current state of affairs? I fear at the moment we are conflating the issues to some degree.

– Steve Pettifer in a Beyond the PDF listserv post

I’m particularly interested in hearing if this perspective, using the MVC model, makes sense to others.

Tags: , , , , , , ,
Posted in books and reading, future of publishing, information ecosystem, library and information science, scholarly communication, social semantic web | Comments (9)

  • Bruce says:

    I think MVC here is different jargon for what has otherwise been called “semantic documents.” What else is an XML format like NLM than a way to encode the abstract logic of an article in ways that can be easily repurposed for different views?

    FWIW, I’d really like see us push the boundaries of web technologies (HTML, RDF, JS, etc.). PDF is boring.

  • Jodi says:

    To me, MVC is more evocative than ‘semantic documents’: you can have semantic publishing without offering different interfaces. In other words, I see ‘semantic documents’ as dealing with the model, but not necessarily with the view.

    Further, people don’t understand ‘semantic’. MVC, while similarly opaque outside of CS, at least provides another way to explain the problem.

    Some semantic publishing evangelists don’t seem to see anything wrong with PDF-only and PDF-mainly publishing.

    Are you following ‘beyondthepdf’ discussions? I think you’d have a lot to add there, Bruce!

    Personally, I’m looking for a lot more ePub, which has all the goodness of HTML (JS and RDF depend on the ereader client software) with the advantages of single-document packaging that contribute to PDF’s popularity.

  • Steve says:

    PDF may be boring, but in spite of all the publishers’ efforts to do cool things with HTML/RDF/JS (and there are lots of these!), over 80% of scholarly articles are downloaded as PDF files (even when ostensibly ‘better’ alternatives are pushed in preference). So PDF seems to be doing something right for scientists that current on-line offerings don’t do.

  • Jodi says:

    I’m curious if the percentage of PDF downloads is different for open access journals. The “download now, read whereever/whenever” is still a selling point of PDF. Even with good end-user oriented web archiving tools (like the bibliographic management software Zotero), HTML is still not a download-without-checking archiving format.

    The CACM article discussed recently on beyondthepdf noted that formatting is also an issue–this is very true for well-typeset journals with insets and extra content that is often harder to notice and use in HTML versions.

  • I’m not sure I understand what MVC has to offer here, but I _am_ sure that it makes sense to treat a PDF as one possible representation of an article. In your software design, in your data modelling.

    I think there is more a data modelling question than a software design question, where MVC is more about software design. Data modelling questions have implications in how you should most usefully design URLs in your app; an MVC design may or may not be helpful there. Data modelling questions also have implications in how you store data internally in your app, and how you share it with others, which is not really about MVC at all.

  • [And if people don’t understand data modelling, I don’t think it’s helpful to try and ‘educate’ them by talking about software architecture as if it were data modelling. That will just confuse them further. MVC is about software architecture.]

  • Jodi says:

    You’re right, Jonathan–this is only useful for software architects.

    That said, I don’t know what sort of data modeling or document modeling, well, model, to suggest. How do you formalize “separate content and structure?”

  • Louis says:

    I don’t work in this area, but the reason PDF is so popular is clear: everybody has a reader/note taking program, authors can easily produce it, and the PDF captures the document the way the author imagined it in a single, highly portable file.

    This question is like asking why unstructured search demolished user-added RDF in the late 90’s. The reason is built into the publishing model, not the enabling technology. If few-author papers are replaced with polymath-style publications as the norm, PDF will probably be replaced by HTML.