As for new training data, Chelba et al <http://arxiv.org/abs/1312.3005>'s arXiv paper may be of interest: 'One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling'; along with work by the Web As Corpus SIG: http://www.aclweb.org/anthology/sigwac.html
But as Adam says, many more issues here, including Anglo-centricism (note the Universal Dependencies <http://universaldependencies.github.io/docs/> project), representativeness / sample bias (note: Temnikova et al 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/675_Paper.pdf>, Hovy, Plank & Søgaard 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/476_Paper.pdf>), and so on.
On 2 February 2015 at 04:15, Adam Kilgarriff < adam.kilgarriff at sketchengine.co.uk> wrote:
> > surely the adequacy of the training data must be as important a factor
> as the algorithm.
> Absolutely! Well said, Ciarán. I often think, if NLP people put as much
> effort into the training data as the algorithms, our systems would perform
> much better.
> There's so much more I could say on the subject (on the problem of
> language-specific work in Computer Science, on what's open and what's
> proprietary, on generalising across text types, on statistical vs
> rule-based systems and in-built biases in favour or statistical ones, when
> we compare performance figures) - but the post would quickly get much too
> On 1 February 2015 at 17:26, Ciarán Ó Duibhín <
> ciaran at oduibhin.freeserve.co.uk> wrote:
>> Thanks to Alexandr for a very interesting survey.
>> Tagging programs (algorithms) for a given language are often compared,
>> but surely the adequacy of the training data must be as important a factor
>> as the algorithm. As far as I can see, most available models for English
>> are trained on the Wall Street Journal, which is a rather restrictive
>> domain. I use such a model myself when tagging "general" English text, and
>> unsurprisingly it mistags senses which would have been rare in the training
>> Take as a simple example "The dog bit the postman." A WSJ-trained model
>> is likely to tag "bit" as a noun. Likewise, words such as "round" or
>> "crude" are likely to be tagged as nouns rather than the more common (in
>> general text) adjective.
>> The tagger I use, and its supplied WSJ model, have been around for years,
>> and it surprises me that little effort is being made (that I am aware of)
>> to improve the model. I regularly see this tagger being used on large
>> corpora of English, and while the training model is not mentioned I would
>> assume it is the supplied WSJ model, and I doubt whether extensive manual
>> post-editing follows.
>> Comments anyone? Perhaps there are taggers which have a more
>> "domain-independent" model of English than that provided by WSJ? Is there
>> a survey of this aspect?
>> Ciarán Ó Duibhín.
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at sketchengine.co.uk
> Director Lexical Computing Ltd
> Visiting Research Fellow University of Leeds
> Blog <http://blog.kilgarriff.co.uk> at blog.kilgarriff.co.uk
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk/>
> and SKELL <http://skell.sketchengine.co.uk/>
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7613 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/dde56b92/attachment.txt>