[Corpora-List] Commercial/competitive value of training data

chris brew cbrew at acm.org
Wed Oct 22 17:28:46 CEST 2008

The premise is probably wrong. Annotated training data is expensive to produce,especially if the skills needed by the annotators are substantial. The value of an annotated data set is the information that it contains about what the correct answers are. There are probably a large number of learning algorithms and approaches that would be roughly as effective as each other at drawing this information out of the data set and making it available for deployment in an application.

The creators of the data have two advantages: the first is access, and the second is the possibility that the data might have been organized into a form that supports efficient learning with the particular software and algorithms that they have in mind to deploy. The first is a clear commercial advantage that cannot easily be nullified. But the second (the form of the data) is exactly the sort of thing that a clever programmer can easily adjust, so that advantage is easily nullified

There is no obvious reason why an ordinary commercial company that has prepared such a data set would give away the competitive advantage associated with access to the data. Sometimes big companies such as Netflix, Google, Yahoo or Microsoft do choose to do this anyway, because they are interested in directing the attention of the research community to the problems that matter to them. This has a payoff in terms of recruitment and so on (perhaps). I think the big companies are also betting that their advantages in terms of infrastructure and sheer numbers of deployable staff will outweigh any risks of giving away competitive advantage.

On Wed, Oct 22, 2008 at 9:28 AM, Seth Grimes <grimes at altaplana.com> wrote:

> Have others considered the competitive value of training data?
> I'm referring to data that would be usable for commercial purposes, unlike
> data provided through the Linguistic Data Consortium (LDC) for research
> purposes. The trade-off for a commercial organization is the opportunity
> to recapture the expense of annotating a data set against the risk of
> accelerating time to market, or promoting a sale at one's own expense, of
> a competing product or service.
> My premise is that a software system's greatest value lies in what it can
> do with the training data rather than in the training data itself. But
> what considerations do others see?
