» Organizing a PDF library: Mendeley for information extraction, Zotero for open source goodness

I’ve been using Zotero for awhile now. I make no secret of the fact that I’m a big fan. In early July I was testing out Mendeley to give a workshop with a colleague who’s been excited about it.

I wanted to see whether Mendeley could reduce any of my pain points. While I’m not moving to Mendeley*, I do plan to take advantage of its whizz-bang PDF organization. When Mendeley offers Zotero integration, I think I’ll be set. *Zotero is opensource; Mendeley is merely free at the moment. Zotero also offers web archiving features while Mendeley is strictly for PDF organization.

I spend a lot of time reading and pulling materials into my library; I spend far less time organizing materials. So I decided I’d try the PDF metadata functions of each. Zotero can pull in materials lots of different ways, but it doesn’t yet have a “pull this PDF in from this URL” button for reports and things that aren’t in databases. I don’t want to spend my time typing up metadata (I’m lazy and busy, what can I say), but I do want to have an organized library. (Hey, got an organizing business? I’d pay for your services.) So the “get metadata for this PDF” features are of prime interest to me.

I usually have a “to read” pile lying around. I did a very non-scientific test, starting with a folder of 44 PDFs (“PDFs to read”). I dragged them into each program.

Zotero had a small point of failure: I expected “get PDF metadata” to be in the Preferences menu, but I had to look up its location on their website. Happily, it’s easy to find from the Support page of zotero.org: Retrieve PDF Metadata. The page explains that metadata comes from Google Scholar, based on the DOI if it’s embedded. That sounds like a reasonable methodology, but one that’s only going to work for recent journal articles and books published by e-savvy publishers. Most of the files I dump into “PDFs to read” are preprints from personal websites or reports from nonprofits’ websites. DOIs aren’t expected in that context.

Of my 44 test cases, Zotero says “No matching references found.” on 26 of them. Results from the 18 “successful” matches are spottier. The first one I checked leads me to believe that things haven’t changed since the last time I tried out this feature, maybe 8 or 10 months ago. It’s an article called A New Approach to Search [PDF], by Joe Weinman, and it’s available from his website. I can identify the source as Business Communications Review, October 2007 from small type in the footer. So can Mendeley. But Zotero calls it Peters, R. S. 1970. Ethics and education. Allen & Unwin Australia. I’m not really sure why. Google search, perhaps?

Zotero’s ‘identification’ of the next article is even stranger:
Capital, R. Sheriff’s Office moves to new facility. Cell 224: 6547. (Notice: the title and journal don’t even belong together!) This article is actually the contest-winning federated search article published by Computers in Libraries [PDF]. It’s available from the publisher’s website. While Information Today publishes some great articles about technology, their HTML doesn’t have any semantic information. Since no one’s yet written a screenscraper for their site, Zotero can’t auto-grab the metadata. But Mendeley successfully identifies this PDF, too.

I wondered whether Mendeley was grabbing metadata from the files so I took a closer look at these two files. Nope, there was very little usable metadata. (Adobe Bridge is great for reading XMP metadata.) Furthermore, the first article (by Weinman) lists its creator as Sharon Wallach; clearly neither program is pulling that.

Onward and upward: overall there are 4 bad identifications and 22 good identifications of the 44, from Zotero. The false positive score of 9% is the part that bothers me the most.

Mendeley does better but it’s not perfect. At first it appears to have identified all 44 PDFs, but there’s a fair bit of missing information (for instance 13 missing the “Published in” field). When I looked closely, I found 26 bad data, 4 could be improved, 2 weren’t identified. Which means I’m satisfied with only 12 of these, but there’s another important factor: Mendeley marks these files as ‘unreviewed’, meaning that the metadata is suspect until I review and/or correct it. So the false positives are easy to detect. This is reassuring. Especially since (unlike Zotero) only one of Mendeley’s identifications was worse than none at all, and it was dead easy to spot:
ï»¿Fohjoft, W. J., Jg, J. T., Vtfe, T. F., Jo, F., Epo, O., Bcpvu, N. E., et al. (n.d.). !12 3/4 “#$%&$’,5.

It’s interesting to look at where Mendeley fails: non-scientific articles and documents with non-standard title pages. Mendeley chokes on Open Provenance Model and Funny in Farsi (no metadata at all) and label a Master’s report only with the year (2000).

I’m most interested about Funny in Farsi; I would expect better metadata from Random House, but sure enough Bridge doesn’t find any. I like Mendeley’s auto-rename feature, but on the files it doesn’t label, that renaming is a big disadvantage: filenames are often reasonable metadata. These three filenames (opm-v1.01.pdf, Funny_in_Farsi.pdf, and 2576.pdf) give either information about the contents or a chance at refinding it with a search engine. For opm-v1.01.pdf , googling the filename finds it immediately. For Funny_in_Farsi.pdf, searching for Funny in Farsi provides 8 search results, and a savvy searcher could get more metadata (e.g. the publisher’s name) from the results. Searching for 2576.pdf clarke open source finds the third.

I’m also interested in what neither Zotero nor Mendeley got right. Neither correctly identified a PDF with Highlights of the National Museum of American History. Drag and drop of citations (with ugly special characters and all) gives

Zotero:
Parton, J. 2004. Revolutionary Heroes and Other Historical Papers. Kessinger Publishing.

Mendeley:
ï»¿Museum, N., & History, A. (2008). Star-Spangled Banner, 1814. Smithsonian.

Neither does well on the Palmer report, either:

Zotero:
Bird, A. 1994. Careers as repositories of knowledge: a new perspective on boundaryless careers. Journal of Organizational Behavior: 325-344.

Mendeley:
ï»¿Factors, I., Palmer, C. I., Teffeau, P. I., Newton, P. C., Assistant, R., Research, I., et al.
(2008). No title. Library, (August).

With a closer look, you can see Mendeley takes the authors as:
Factors, Identifying
Palmer, C I C Institutional Repository Development Final Report Carole L
Teffeau, Principal Investigator Lauren C
Newton, Project Coordinator Mark P
Assistant, Research
Research, Informatics

If you want more details, please leave a comment or drop me a line; I had hoped to add info but decided just to push this out of my queue. I was thinking about it because Mendeley really does help me review the papers I’ve been meaning to read. Guess it’s time to think about that Mendeley to Zotero workflow again!

Organizing a PDF library: Mendeley for information extraction, Zotero for open source goodness

August 27th, 2009

by jodi

Tags: information extraction, mendeley, metadata, organizing PDFs, scholarly publishing, zotero
Posted in information ecosystem, reviews | Comments (7)

Jan says:

August 28, 2009 at 5:33 am

Hi Jodi,

thanks for this detailed test and report, and sorry if Mendeley doesn’t get everything right, but we are trying hard to further improve, as you can guess.

Would you mind sending the research papers/files where the relevant metadata has been extracted incorrectly to support@mendeley.com so that we can have a look at them?

BTW, as and additional comment, you can also use Mendeley to store and archive additional file types (e.g. HTML, DOC – simply attach them to the metadata set), and Mendeley’s bookmarklet will soon be able to grab whole web pages as well, but we really hope that the connection to Zotero will improve the user experience even more.

Let me know if you have any further questions – you can also have a look at our feedback forum (http://feedback.mendeley.com) to see upcoming features and improvements.

Thanks again
Jan
(jan.reichelt@mendeley.com)

Jodi Schneider says:

August 28, 2009 at 9:37 am

Thanks, Jan! I’ve shared the problem files with support@mendeley.com.
I know you’ve pushed out several Mendeley updates since I completed this test on July 2nd.

js says:

August 28, 2009 at 10:48 am

I’ve been experimenting with Zotero & Mendeley for a while now, both as a LIS student and professionally as a research assistant on a big old meta analysis project. Like you, I decided that Mendeley would be best put to use to organize the hundreds and potentially ultimately thousands of PDFs we are using in our study. So far, so good, though I’m also looking forward to the proposed collaboration between Mendeley & Zotero (more for my personal studies than at work).

My workflow is set up so that Mendeley watches a folder that our research associates save PDFs to, and I go in on a weekly basis and check the metadata for review (the new lookup features for questionable metadata are great in the newest release, especially if you’re lucky enough to be looking for articles that tend to be in PubMed, as I am. Google still leaves something to be desired.) and then tag and sort the PDFs in the “unsorted” um…pile in Mendeley into groups. I imagine that if you’re pulling PDFs through Zotero, you could set up Mendeley to watch the folder that Zotero saves to, so that you don’t have to do too much moving of PDFs.

Thanks for doing this side-by-side comparison. I love bibliographic tool geekouts.

-Jess

August 28, 2009 at 10:57 am

PS. I’m a Johnnie too. SF02!

Brook says:

September 30, 2009 at 1:27 am

So I’m a 1st year Ph.D. student in O&M (orgs and management – part business, part sociology), and I am debating which platform to use to organize the next 5-10 years worth of research… any thoughts on which would be a better option for those of us starting basically from square one?

October 9, 2009 at 9:34 pm

Brook,

Had to think about this. I’d stick with Zotero for now. You can push files from Zotero to Mendeley easily right now–but not vice versa yet.

NellyG says:

October 16, 2009 at 6:09 pm

Hi,
Thanks for the great review!
I am about to start to Zotero to manage my huge pdf collection (thousands). I prefer Mendeley ‘s interface but unfortunately, its capacity is limited to 500Mo so I chose Zotero instead.
I was wondering how to make the best of zotero and especially how you organise your pdfs: do you use categories? by project/ topic? tags alone? or combining tags and folders?
I want to take the right decision now and not having to rechange the whole thing in 6 months so I would really appreciate some advice.

Thanks!

jodischneider.com/blog

reading, technology, stray thoughts

Categories

Search

Organizing a PDF library: Mendeley for information extraction, Zotero for open source goodness

Recent Posts

Monthly

Meta