Thursday, October 4, 2007

Dark data and grey literature

Thomas Goetz writes in Wired about “dark data,” the negative results that come out of many or most scientific experiments. The traditional publication process favors results that establish correlations, so studies that do not find a correlation generally do not get reported. Yet studies which do not show a correlation can be useful as well, establishing that a particular off-label use of a drug is ineffective or showing that gay marriage doesn’t actually cause the collapse of, well, anything.

Negative results are of less general interest, but are incredibly important in assessing particular hypotheses. They are also a natural consequence of broad scientific research. The sort of research that can find startling results is often the sort of research that examines factors and variables that we would not expect to be correlated. Our intuition that there is not a correlation is frequently borne out, but we should not discourage counter-intuitive inquiry.

The biggest obstacle to publishing dark data has been the lack of space in journals. Why not publish positive results rather than negative results, if space is limited? The traditional grey literature of departmental working papers and conference proceedings is a farm system for the journals, where results and ideas compete for limited attention as the best ones graduate to the big show. Goetz correctly points out that the web provides plenty of space to host massive quantities of data, but he’s wrong when he posits that this is a full solution to the dark data problem. The web has allowed an enormous expansion of the grey literature, extending downwards into rough drafts. Publishing data is a further step down from that if it is not accompanied by full explanations of the methodology used to collect and analyze the data, and if that methodology is not scrutinized by a rigorous peer review process. Goetz falls into a common misunderstanding of science in believing that data show truth independent of methodology, and any publishing scientist will tell you that there is far more to publication than dumping data at a journal’s doorstep. Dumping data on the web is not sufficient either, no matter what type of data you’re dumping.

No comments: