Tuesday, April 8, 2008

Of data and metadata

Anand Rajaraman on Datawocky writes about improving the sorting of database search results in contexts such as Netflix recommendations and Google search results. He concludes that adding independent data is much more important than improving algorithms; Netflix recommendations are improved by adding movie genre information from imdb, and Google search results are improved by adding links and anchortext. This has been bearing out in linguistics as well, where brute force statistical analysis (data) has defeated sophisticated theoretical analysis (algorithms) in allowing computers to process speech and text, and corpus linguistics has grown rapidly.

Rajaraman’s specific examples are really about adding metadata rather than adding data. When choosing among a set of movies, it makes sense that it would help to know more about the movies. Google uses in-bound links and the text used in those links to learn more about a web page. Google has also succeeded by increasing the size of the database (indexing more web pages), but a useful ranking of search results is even more important when the set of search results is larger. Simply adding data to the database has made searching Amazon far more frustrating in my experience, because the set of search results is now far larger and Amazon’s ranking of those results rarely matches what I’m really looking for. When I’m looking for information about a particular book that I’ve already found, it’s great to have metadata, descriptions, reviews, and the full text of the book. But none of that helps if I can’t find the book in the first place.

It got me thinking in several directions: Could we have Netflix-style personalized recommendations for YouTube? Isn’t this all an argument in favor of Total Information Awareness and its spawn? Wouldn’t spell-checking be better if the data on corrections were aggregated the way that Google adjusts search results and ads based on click-throughs? How user-friendly and transparent could we make privacy trade-offs, so that people could decide in a reasonable way what data to pass along to Netflix and actually understand what will happen to that data? And when are we going to stop using terms like data mining and start talking about database smart growth strategies?

No comments: