Yesterday the NYTimes announced a new API, TimesTags, “based on the taxonomy and controlled vocabulary used by Times indexers since 1851”. The browseable version of this vocabulary, http://topics.nytimes.com/ , is a great entry into NYTimes articles published since 1981.
Ed Summers did some scraping while also asking the Open NYTimes team for a SKOS version. Meanwhile, I’m playing around with the classification (online and scraped). Its quirks seem to reflect how it’s been used, and how it has evolved over time. Classification systems can highlight the material classified; they also tend to give insight into the worldview of the people classifying materials or creating the system. The interplay makes integration of classification systems, such as through topic maps, an interesting research area. But that’s a topic for another day.
Here are some things I’ve noticed while playing around with the vocabulary.
Overall Structure
The NYTimes’ main navigation lists 15 sections. The NYTimes taxonomy has 3 top-level categories: news, opinion, and reference. 7 sections fit within the news taxonomy. Opinion has its own category. Travel is an explicit subject within the reference category. Technology, arts, and style are topical, drawing primarily on the reference category. (Cooking, however, is similar to travel in its treatment.) The 3 advertising sections (jobs, real estate, and auto) are already classified, and thus, out of scope.
The remaining 7 sections we dub “news”. Here are examples of taxonomy terms, showing the category structure:
News
- World: international/countriesandterritories
http://topics.nytimes.com/top/news/international/countriesandterritories/canada - U.S.: national/usstatesterritoriesandpossessions/
http://topics.nytimes.com/top/news/national/usstatesterritoriesandpossessions/michigan - N.Y. / Region: newyork, newyorkregion
http://topics.nytimes.com/top/news/newyorkandregion/columns/lens/
http://topics.nytimes.com/top/news/nyregion/columns/clydehaberman/
nyregion and newyorkandregion are both used, but they are not interchangeable (in the sense that there aren’t redirects)
- Business: business/companies
http://topics.nytimes.com/top/news/business/companies/spicy-pickle-franchising-inc - Science: science/topics
http://topics.nytimes.com/top/news/science/topics/quasars
- Health: health/diseasesconditionsandhealthtopics
http://topics.nytimes.com/top/news/health/diseasesconditionsandhealthtopics/amnesia
As the name (diseases, conditions, health topics) suggests, this encompasses a wide range of topics: particular drugs such as Ritalin, categories of drugs such as antibiotics, topics such as smoking, sleep, teenage pregancy, and twins, and professional groups such as surgery and surgeons. - Sports: sports, olympics
http://topics.nytimes.com/top/news/sports/baseball/majorleague/philadelphiaphillies
http://topics.nytimes.com/top/news/sports/probasketball/nationalbasketballassociation/atlantahawks
Beyond sports, subcategory names vary considerably. Other sections, such as for the Olympics, are outside the main hierarchy:
http://topics.nytimes.com/olympics/2008/swimming
Opinion: opinion
http://topics.nytimes.com/top/opinion/editorialsandoped/oped/columnists/bobherbert
http://topics.nytimes.com/top/opinion/thepubliceditor/calame
Again, beyond opinion, there is variation. However, editiorialsandoped is the main subcategory.
Reference: reference
http://topics.nytimes.com/top/reference/timestopics/organizations/m/mozilla_foundation
http://topics.nytimes.com/top/reference/timestopics/subjects/s/swimming
Travel is handled as a subject: http://topics.nytimes.com/top/reference/timestopics/subjects/t/travel_and_vacations
Spelling Discrepancies
Drugs (Pharmaceuticals) has two spellings: drugs_pharmaceuticals and drugspharmaceuticals are aliases.
E TRADE Financial Corporation and E*Trade Financial Corporation, however, appears to be an error: they have some data in common, and other data not in common. Either an error or a bizarre story behind that.
Differences in usage
Where to put recipes
Apples is a subcategory of cooking (e.g. apples):
http://topics.nytimes.com/top/reference/timestopics/subjects/c/cooking_and_cookbooks/apples
Perhaps because apples tend to be used as a cultural reference? Still, where do apple recipes belong?
Pumpkins, on the other hand, has a subcategory for recipes:
http://topics.nytimes.com/top/reference/timestopics/subjects/p/pumpkins/recipes
Dogs are in science, but fossils are not
While most subjects are classified only alphabetically, there are exceptions. Compare fossils to dogs.
Fossils is a plain-old subject, (subjects/f):
http://topics.nytimes.com/top/reference/timestopics/subjects/f/fossils/
Dogs, however, is a science topic, (news/science/topics):http://topics.nytimes.com/top/news/science/topics/dogs/
I wonder if that’s because dogs are a more common subject than fossils?
Saying what you mean
Disambiguation, eh? Here, shrimp is a topic within science, so don’t expect recipes (except in the ads):
http://topics.nytimes.com/top/news/science/topics/shrimp
Category structure
Prominent subtopics
Subtopics are sometimes listed at the top level. For instance United States Attorneys seems to contain United States Attorneys: Editorials & Opinion. Both are listed at the top of the topics tree.
I find it fascinating that Cookies and Cookies, Recipes are separate topics. Again, culturally justified.
Depth of categories
There may be several levels of subcategories, e.g.
http://topics.nytimes.com/top/news/science/topics/space_shuttle/atlantis
http://topics.nytimes.com/top/reference/timestopics/subjects/w/wines/alsace
Mixing of keyword and controlled terminology
I’m surprised to find “hot dogs” as the top two “articles about dogs”, after some nice featured content. NYTimes may also want to refine handling of multiword terms.
Another example is “Baby Quasar(Skin Care Devise)” showing up under quasars.
By versus About
Times writers (e.g. Tom Zeller Jr.) are listed in italics and classified as people. The ‘by’ versus ‘about’ distinction is made primarily in meta tags. “PSST” seems to identify Times writers.For instance, compare the meta tags from Tom Zeller Jr’s page:
<meta name=”PT” content=”Topic” />
<meta name=”CG” content=”Times Topics” />
<meta name=”GTN” content=”Zeller, Tom Jr.” />
<meta name=”PST” content=”People” />
<meta name=”PSST” content=”Writer” />
to those on (non-Times) writer Toni Morrison’s page:
<meta name=”PT” content=”Topic” />
<meta name=”CG” content=”Times Topics” />
<meta name=”GTN” content=”Morrison, Toni” />
<meta name=”PST” content=”People” />
<meta name=”SCG” content=”The Public Editor” />
Final thoughts
The world of electronic publishing blurs the lines between producers and indexers. Archival content, served up by organization, person, or topic, is a great offering. The secondary publishing market (abstracting, indexing, etc.) is changing quickly. Source-based browsing, as at NYTimes Topics, is part of that change.