[Corpora-List] stopword lists for norwegian and danish

Trevor Jenkins trevor.jenkins at suneidesis.com
Thu Feb 14 19:34:28 CET 2008


On Mon, 11 Feb 2008, Roxana Angheluta <roxana at attentio.com> wrote:


> I am looking for stopword lists for Norwegian and Danish.

Sorry can't help with lists. But I'd like to swap hats from corpora to computing science and enquire why you're looking for such lists.

Back when I worked in the R&D of a major text retrieval system we deliberately did not support stop-lists; they increased the code complexity, the original intent behind stop-lists (of reducing the size of inverted index files) was no longer relevant with large discs and compressed indices, and more importantly very few end-users understood their purpose. During my 15 years with the company we only encountered one real requirement for imposing stop lists, which was to obfuscate controversial word usage by a British PM.

There are some anecdotal examples of English phrases where stop lists should not be applied: "Lloyds of London" (the insurance market), "Prince of Wales". The Lloyds example is particularly troublesome because Lloyds of London is situated in the City of London not far from the headquarters of Lloyds Bank, which is a separate institution, and around the corner from Lloyds the chemist. Removing "of" would reduce the adjacency of the words Lloyds London in all three examples into ambiguity that cannot be resolved easily if at all.

YMMV.

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list