In the practice of text classification, we prefer to use feature selection or even distributional word clustering as a better way of managing feature vector sizes, if necessary.
-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Trevor Jenkins Sent: Thursday, February 14, 2008 11:34 AM To: corpora at uib.no Subject: Re: [Corpora-List] stopword lists for norwegian and danish
On Mon, 11 Feb 2008, Roxana Angheluta <roxana at attentio.com> wrote:
> I am looking for stopword lists for Norwegian and Danish.
Sorry can't help with lists. But I'd like to swap hats from corpora to computing science and enquire why you're looking for such lists.
Back when I worked in the R&D of a major text retrieval system we deliberately did not support stop-lists; they increased the code complexity, the original intent behind stop-lists (of reducing the size of inverted index files) was no longer relevant with large discs and compressed indices, and more importantly very few end-users understood their purpose. During my 15 years with the company we only encountered one real requirement for imposing stop lists, which was to obfuscate controversial word usage by a British PM.
There are some anecdotal examples of English phrases where stop lists should not be applied: "Lloyds of London" (the insurance market), "Prince of Wales". The Lloyds example is particularly troublesome because Lloyds of London is situated in the City of London not far from the headquarters of Lloyds Bank, which is a separate institution, and around the corner from Lloyds the chemist. Removing "of" would reduce the adjacency of the words Lloyds London in all three examples into ambiguity that cannot be resolved easily if at all.
<>< Re: deemed!
_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora