Archive for October, 2008

NYTimes Topics: Quirky, Useful Classification, Finding Aid

October 23rd, 2008

Yesterday the NYTimes announced a new API, TimesTags, “based on the taxonomy and controlled vocabulary used by Times indexers since 1851”. The browseable version of this vocabulary, http://topics.nytimes.com/ , is a great entry into NYTimes articles published since 1981.

NYTimes Topics

NYTimes Topics

Ed Summers did some scraping while also asking the Open NYTimes team for a SKOS version. Meanwhile, I’m playing around with the classification (online and scraped). Its quirks seem to reflect how it’s been used, and how it has evolved over time. Classification systems can highlight the material classified; they also tend to give insight into the worldview of the people classifying materials or creating the system. The interplay makes integration of classification systems, such as through topic maps, an interesting research area. But that’s a topic for another day.

Here are some things I’ve noticed while playing around with the vocabulary.

Overall Structure

The NYTimes’ main navigation lists 15 sections. The NYTimes taxonomy has 3 top-level categories: news, opinion, and reference. 7 sections fit within the news taxonomy. Opinion has its own category. Travel is an explicit subject within the reference category. Technology, arts, and style are topical, drawing primarily on the reference category. (Cooking, however, is similar to travel in its treatment.) The 3 advertising sections (jobs, real estate, and auto) are already classified, and thus, out of scope.

The remaining 7 sections we dub “news”. Here are examples of taxonomy terms, showing the category structure:

News

  1. World: international/countriesandterritories
    http://topics.nytimes.com/top/news/international/countriesandterritories/canada
  2. U.S.: national/usstatesterritoriesandpossessions/
    http://topics.nytimes.com/top/news/national/usstatesterritoriesandpossessions/michigan
  3. N.Y. / Region: newyork, newyorkregion
    http://topics.nytimes.com/top/news/newyorkandregion/columns/lens/

    http://topics.nytimes.com/top/news/nyregion/columns/clydehaberman/
    nyregion and newyorkandregion are both used, but they are not interchangeable (in the sense that there aren’t redirects)
  4. Business: business/companies
    http://topics.nytimes.com/top/news/business/companies/spicy-pickle-franchising-inc
  5. Science: science/topics
    http://topics.nytimes.com/top/news/science/topics/quasars

  6. Health: health/diseasesconditionsandhealthtopics
    http://topics.nytimes.com/top/news/health/diseasesconditionsandhealthtopics/amnesia
    As the name (diseases, conditions, health topics) suggests, this encompasses a wide range of topics: particular drugs such as Ritalin, categories of drugs such as antibiotics, topics such as smoking, sleep, teenage pregancy, and twins, and professional groups such as surgery and surgeons.
  7. Sports: sports, olympics
    http://topics.nytimes.com/top/news/sports/baseball/majorleague/philadelphiaphillies
    http://topics.nytimes.com/top/news/sports/probasketball/nationalbasketballassociation/atlantahawks

    Beyond sports, subcategory names vary considerably. Other sections, such as for the Olympics, are outside the main hierarchy:
    http://topics.nytimes.com/olympics/2008/swimming

Opinion: opinion

http://topics.nytimes.com/top/opinion/editorialsandoped/oped/columnists/bobherbert
http://topics.nytimes.com/top/opinion/thepubliceditor/calame
Again, beyond opinion, there is variation. However, editiorialsandoped is the main subcategory.

Reference: reference

http://topics.nytimes.com/top/reference/timestopics/organizations/m/mozilla_foundation

http://topics.nytimes.com/top/reference/timestopics/subjects/s/swimming

Travel
is handled as a subject: http://topics.nytimes.com/top/reference/timestopics/subjects/t/travel_and_vacations

Spelling Discrepancies

Drugs (Pharmaceuticals) has two spellings: drugs_pharmaceuticals and drugspharmaceuticals are aliases.

E TRADE Financial Corporation and E*Trade Financial Corporation, however, appears to be an error: they have some data in common, and other data not in common. Either an error or a bizarre story behind that.

Differences in usage

Where to put recipes

Apples is a subcategory of cooking (e.g. apples):
http://topics.nytimes.com/top/reference/timestopics/subjects/c/cooking_and_cookbooks/apples

Perhaps because apples tend to be used as a cultural reference? Still, where do apple recipes belong?

Pumpkins, on the other hand,  has a subcategory for recipes:
http://topics.nytimes.com/top/reference/timestopics/subjects/p/pumpkins/recipes

Dogs are in science, but fossils are not

While most subjects are classified only alphabetically, there are exceptions. Compare fossils to dogs.
Fossils is a plain-old subject, (subjects/f):
http://topics.nytimes.com/top/reference/timestopics/subjects/f/fossils/

Dogs, however, is a science topic, (news/science/topics):http://topics.nytimes.com/top/news/science/topics/dogs/
I wonder if that’s because dogs are a more common subject than fossils?

Saying what you mean

Disambiguation, eh? Here, shrimp is a topic within science, so don’t expect recipes (except in the ads):
http://topics.nytimes.com/top/news/science/topics/shrimp

Category structure

Prominent subtopics

Subtopics are sometimes listed at the top level. For instance United States Attorneys seems to contain United States Attorneys: Editorials & Opinion. Both are listed at the top of the topics tree.

I find it fascinating that Cookies and Cookies, Recipes are separate topics. Again, culturally justified.

Depth of categories

There may be several levels of subcategories, e.g.

http://topics.nytimes.com/top/news/science/topics/space_shuttle/atlantis

http://topics.nytimes.com/top/reference/timestopics/subjects/w/wines/alsace

Mixing of keyword and controlled terminology

I’m surprised to find “hot dogs” as the top two “articles about dogs”, after some nice featured content. NYTimes may also want to refine handling of multiword terms.

Hot dogs turn up in dogs

Hot dogs turn up in dogs

Another example is “Baby Quasar(Skin Care Devise)” showing up under quasars.

By versus About

Times writers (e.g. Tom Zeller Jr.) are listed in italics and classified as people. The ‘by’ versus ‘about’ distinction is made primarily in meta tags. “PSST” seems to identify Times writers.For instance, compare the meta tags from Tom Zeller Jr’s page:

<meta name=”PT” content=”Topic” />
<meta name=”CG” content=”Times Topics” />
<meta name=”GTN” content=”Zeller, Tom Jr.” />
<meta name=”PST” content=”People” />
<meta name=”PSST” content=”Writer” />

to those on (non-Times) writer Toni Morrison’s page:

<meta name=”PT” content=”Topic” />
<meta name=”CG” content=”Times Topics” />
<meta name=”GTN” content=”Morrison, Toni” />
<meta name=”PST” content=”People” />
<meta name=”SCG” content=”The Public Editor” />

Final thoughts

The world of electronic publishing blurs the lines between producers and indexers. Archival content, served up by organization, person, or topic, is a great offering. The secondary publishing market (abstracting, indexing, etc.) is changing quickly. Source-based browsing, as at NYTimes Topics, is part of that change.

Tags: , , , ,
Posted in old newspapers, reviews | Comments (2)