[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Fri Feb 27 03:39:23 CET 2009

On Thu, Feb 26, 2009 at 10:36:51PM +0000, Jimmy O'Regan wrote:
> I think you'll find that their statement was carefully worded to
> merely portray the issues in the area without giving any direct,
> specific advice: these kinds of legal analyses are quite often given
> to law students to perform: they are not lawyers, and cannot give
> legal advice.

Not sure what you mean, they _are_ lawyers, pretty high-powered ones, paid by U Penn.

> You only focus on the 'economic harm' aspect; you should also consider
> that, if any of the publishers also produce corpora,

Who does? In general publishers are lousy about publishing their material in any but the most traditional format. Some content providers (WSJ, Reuters, etc) have a more enlightened attitude, but this is rare, and it is trivial to avoid stepping on their toes.

> or if any of the translators sell their translation memories,

After some looking I located exactly one vendor selling TM content, http://www.tmmarketplace.com, and their white paper seems to make the exact same legal argument that the UPenn lawyers made, check it out. May even check out their English-Hungarian material, made me curious, but of course I wouldn't dream of including it in our corpus. (On the other hand I wouldn't be shocked to find they are repackaging and selling our material, well, they are welcome.)

> then they have a very real case where you are causing them economic harm.
> Economic harm is far from the only factor in copyright; at best, you
> simply won't be held liable to pay a large amount in damages. Who
> wants to put the work into compiling a corpus, only to be hit with a
> cease and desist notice?

You sound as if you speak from vast experience about corpus linguists getting hit with all kinds of notices, being held liable for vast amounts of damages, and in the end getting tarred, feathered, and ran out of town. I would be interested in hearing about any such cases.

As for the "who wants to" question, there are always reasons not to do something. What if one of the sentences is offensive to some group of people, entices to violence, or advocates breaking the law? This is quite possible, we certainly didn't check over 2m sentences by hand. What if the corpus contains defamatory statements or somebody's trade secrets? Oh, the possibilities! (TM:-)

We did it, very real lawyers said it was OK (a similar opinion, also coming from real lawyers, was discussed in Corpora #4162) and I recommend this course to anyone who prefers to get work done to getting bogged down in phantom speculation based on armchair lawyering. One rarely, if ever? sees scholars sued for publishing their corpus, the risk seems to be bearable.

A more real issue, familiar from all branches of science, is that people are often reluctant to part with their data (which took a lot of effort to gather) before they fully exploited it themselves, and by the time they are done the material is often stale. I see a lot of debate in biology about the sharing of sequence data (which has, let us not delude ourselves, orders of magnitude more commercial value than the texts we tend to work with). I'm sure many of us have asked other workers in the field for some of the data they created and got nothing in response.

Andras Kornai

More information about the Corpora mailing list