The Open Data licensing issue

May 11, 2008

A little tied up this weekend, so will keep it brief. I have added a number of comments on Friendfeed to posts I have shared from Google Reader about what the licensing of data should be.

The whole thing started by Antony Williams announcing CC support for data on ChemSpider. That was followed by a chain of events and a ton of confusion. Let me add my voice to this debate, since Open Data is near and dear to my heart

I classify scientific data into the following categories

  • Raw data: This is the kind of data deposited in Tranche, or RCSB, or GenBank. Sequence data, structural data, raw proteomics data. There are associated metadata that are required for quality and reproducibility.
  • Processed data: These are the results of doing something with the raw data, e.g. molecular simulation results from a PDB structure and form a continuum

I can’t but agree with John Wilbanks. Here is the part that all of us should read again and again

The public domain is not an “unlicensed commons”. The public domain does not equal the BSD. It is not a licensing option.

It is the natural legal state of data.

It is a damn shame that we no longer think of the public domain as an option that is attractive. It’s a sign of the victory of the content holders that the free licensing movements work against that something without a license – something that is truly free, not just just free “as in” – is somehow thought to be worse. We’ve bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.

The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.

The public domain is the natural place for raw scientific data. That’s where it belongs and always has been. We, myself included, have been guilty of making things more complicated than they need to be. There is a data commons already. Our goal should be to make sure people respect it, and make data available in ways that we can take advantage of it.

Our discussion on content licensing should be limited to processed data, i.e. what we do with data in the public domain. There, we need to allow people to make choices, but keep the raw data unfettered. Those who want to associate copy left licenses with raw data are being dogmatic. Scientific data doesn’t have to be viral or anything like that, it’s there for the greater scientific good, and there’s only one logical mechanism for it. In fact, I would argue that putting copy left on it (a sequenced genome doesn’t belong to anyone) is as wrong as full on copy protection. You may have some embargo on making it publicly available, especially with things like structures where you might want to do something with it before anyone, but in the end the data belong in the public domain

I would like to thank John for putting this down so emphatically and clearly. A lot of us have been saying the same thing for a while, but this is the most clear distillation that I’ve read yet.

That does not mean we don’t have to have a discussion around how we make content (not raw data, but follow on content) available and the implications. Antony was confused for good reason.

Further reading

More from John
Cameron Neylon
Egon Willighaghen
More from Egon
Bill Hooker
Web as platform: Bret Taylor on Open Data
Open Science and licensing
Protocol for implementing open access data
bbgm post on protocol for open data

Comments

Viewing 2 Comments

Trackbacks

close Reblog this comment
blog comments powered by Disqus