They haven't provided a separate written statement (nor have we asked for one) but they did explain their reasoning. Let R be the copyright holder of some work, B be a potential buyer, and M be the maker of the corpus. The prime mover behind copyright cases is economic harm. As long as M sells copyrighted material, or even gives it away, M is taking away the reason of B to buy from the source that would pay royalties to R, so M is causing economic harm. Here it is clear that no harm is done, since the users of your corpus have not actually gained access to the copyrighted work and the corpus can't be exploited for pirate editions.
> Actually, I just looked up the licence agreement for the Hunglish
> "1.2. User shall not publish, retransmit, display, redistribute,
> reproduce or commercially exploit the Data in any form, except that User
> may include limited excerpts from the Data in articles, reports and
> other documents describing the results of User’s linguistic education
> and research. "
> So I guess the answer to my question is no.
This is the generic LDC policy, and again it doesn't enjoin you from the main goal you'd want to use a corpus for, namely training and testing computational linguistic models. Whether using the trained system in a for-profit system would be infringing I'm not sure, IANAL. But the world is full of systems that were optimized on LDC corpora, probably because these works, form an economic standpoint, do not harm the copyright holders. From a legal standpoint I'm not sure, this may even depend on the laws of the country you are in, but in a large corpus the impact of any single work on training is so minimal that "de minimis non curat lex" is probably applicable.
So the WSJ could possibly come after you if you used in a commercial system a model trained only on the WSJ (I say possibly since you still have the "transformative use" defense) but why would you ever want to do such a thing? A pure WSJ model already shows signs of strain on the NYT, and if your goal is a system that works on journalistic prose you are far better off training it on a broad mixture of newspaper sources. If, on the other hand, your goal is to do something value added specifically for WSJ readers, you should be getting the opinion of WSJ lawyers anyway.
Andras Kornai, NAL
PS. In the hope of steering back the conversation to Adam's original point, let me say here that even if one would be inclined to dispute the statement that the use of some copyrighted work is de minimis, surely corrections to this work are de minimis!