in doing webcrawls for linguistic purposes, I recently came across an approach to link spamming or SEO optimisation that involves taking sentences from a large range of texts (mostly out-of-copyright fiction), mixing the sentences randomly, injecting the name of a product (or other keywords) and creating thousands of webpages.
The intent is probably to fool search engines into thinking these are product reviews or descriptions, but the implication for linguistics is that we get polluted language models, in which mobile phones collocate with horse drawn carriages.
SEO-enhanced pages I came across in the past contained random word lists with keywords injected. It was possible to deal with such cases by n-gram filtering. However, this simple trick doesn't work any longer, as the sentences are to a very large extent entirely grammatical.
Any experience from others and suggestions on how to deal with this phenomenon.