Best, Oliver
On 16 November 2010 21:55, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> Oliver,
> there are easier ways that taking lg models beyond the sentence. (And even
> if we did, the clever spammer could use those lg models in 'generation'
> mode so keep ahead of us.) One is to piggyback on the large amounts of work
> that Google and Bing do to stay ahead of the spammers, eg by using BootCaT.
> They are putting lots of effort into not giving spam as top search hits, so
> if we use pages that they propose, we avoid most spam
>
> Adam
> On 16 November 2010 14:54, Oliver Mason <O.Mason at bham.ac.uk> wrote:
>>
>> I believe language models need to take structure beyond the sentence
>> into account. Then it would be fairly obvious that you're looking at a
>> list of sentences rather than a text; just as we can already
>> distinguish between a list of words and a proper sentence.
>>
>> The problem, then, is how to push language models up one level...
>>
>> Oliver
>>
>> On 16 November 2010 12:34, Justin Washtell <lec3jrw at leeds.ac.uk> wrote:
>> > Hi Serge,
>> >
>> > I can think of one or two half-hearted angles of attack, but nothing off
>> > the top of my head which couldn't readily be out-foxed by the very next wave
>> > of link-spammers. Indeed, any half-decent language models we do develop, are
>> > ripe for exploitation directly by the spammers. Given that very fundamental
>> > trait of language: its generative capacity, I am inclined to think that the
>> > spammers have the upper hand in this one. It's a bit like a war between
>> > viruses and anti-virus software, except in a world where a "legitimate"
>> > program is largely defined by the fact that it self-replicates and
>> > self-obfuscates. My initial suspicion is therefore that this is a genuinely
>> > hard - borderline impossible - problem. Mind you, that's exactly what makes
>> > it interesting... so I shall give it some more thought :-)
>> >
>> > Justin Washtell
>> > University of Leeds
>> >
>> > ________________________________________
>> > From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Serge
>> > Sharoff [s.sharoff at leeds.ac.uk]
>> > Sent: 16 November 2010 09:12
>> > To: corpora at uib.no
>> > Subject: [Corpora-List] Deviations in language models on the web
>> >
>> > Dear all,
>> >
>> > in doing webcrawls for linguistic purposes, I recently came across an
>> > approach to link spamming or SEO optimisation that involves taking
>> > sentences from a large range of texts (mostly out-of-copyright fiction),
>> > mixing the sentences randomly, injecting the name of a product (or other
>> > keywords) and creating thousands of webpages.
>> >
>> > The intent is probably to fool search engines into thinking these are
>> > product reviews or descriptions, but the implication for linguistics is
>> > that we get polluted language models, in which mobile phones collocate
>> > with horse drawn carriages.
>> >
>> > SEO-enhanced pages I came across in the past contained random word lists
>> > with keywords injected. It was possible to deal with such cases by
>> > n-gram filtering. However, this simple trick doesn't work any longer,
>> > as the sentences are to a very large extent entirely grammatical.
>> >
>> > Any experience from others and suggestions on how to deal with this
>> > phenomenon.
>> >
>> > Best,
>> > Serge
>> >
>> >
>> > _______________________________________________
>> > Corpora mailing list
>> > Corpora at uib.no
>> > http://mailman.uib.no/listinfo/corpora
>> >
>> > _______________________________________________
>> > Corpora mailing list
>> > Corpora at uib.no
>> > http://mailman.uib.no/listinfo/corpora
>> >
>>
>>
>>
>> --
>> Dr Oliver Mason
>> Technical Director of the Centre for Corpus Research
>> Head of Postgraduate Studies (Doctoral Research)
>> School of English, Drama, and ACS
>> The University of Birmingham
>> Birmingham B15 2TT
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
>
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> ================================================
>
-- Dr Oliver Mason Technical Director of the Centre for Corpus Research Head of Postgraduate Studies (Doctoral Research) School of English, Drama, and ACS The University of Birmingham Birmingham B15 2TT