I wonder if there is a measure to assess heterogeneity of a particular corpus, for example, from the semantic or structural point of view. The background of my question is the question in my LinkedIn contribution (see below). I would appreciate if you would share your ideas. Thanks in advance!
------------------------------------------- Genre identification vs. opinion mining
You may know that Marina Santini is working on automatic genre identification; I was exploring opinion mining in my thesis. Recently, I made a controversial statement: it doesn't matter much what algorithm of automatic identification you use -- the siginificant issue is feature extraction. In fact, in my experiments, I found out that NaiveBayes or SVM are good enough -- sometimes NaiveBayes is better, sometimes SVM, but these are always two usual suspects that are frequently used for classification. What significantly changes classification results is feature extraction.
Since genre identification and opinion mining are similar tasks (the data are texts and the result is obtained through statistical analysis) I asked Marina to give me her data ( http://www.nltg.brighton.ac.uk/home/Marina.Santini/<http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Enltg%2Ebrighton%2Eac%2Euk%2Fhome%2FMarina%2ESantini%2F&urlhash=7wkh&_t=tracking_anet>) to test if I get similar results on genre classification using "my" features as it was the case in opinion mining. For simplicity, I extracted only stopwords.
I performed a brief analysis of Marina's corpus. I used my InfoFramework to process these data that contains 1400 html files corresponding to 7 genres -- BLOG, ESHOP, FAQS, FRONTPAGE, LISTING, PHP, SPAGE. I built my dataset automatically extracting features that correspond to 526 stopwords in WEKA. I have compared the obtained results with Marina's dataset where recall value using SMO were 89% recall; 89.07% precision and using NaiveBayes -- 67,14% recall; 68.86% precision.
The main news: the corpus is very unusual. I analyzed already several corpora and the result was always about triple choice by chance. So for a corpus with 9 classes the classification result was about 3 x 11.1(%)=33.3%.
In the case of Marina's corpus, it is something different. The results using SMO were unexpectedly 65.27% recall and 71.27% precision that is about five times of choice by chance. Almost the same, 55.79% recall and 59.67% precision are results using NaiveBayes. I optimized my dataset using FFS and obtained 56.57% (-8.72%) recall and 65.64% precision using SMO and 59.64% (+3.85%) recall and 64.4% precision using NaiveBayes. Although I didn't think that I can optimize Marina's dataset, I ran FFS-optimization over her dataset. For SMO, I got 87.79% (-1.21%) recall and 87.9% precision. Incredibly, but I got a significant 80.14% (13%) recall and 79.77% precision improvement for NaiveBayes -- I even repeated classification manually on the WEKA GUI.
The classification results of my dataset are amazingly high if we consider that my features are extracted in groups and Marina's not. I assume it is even possible to improve classification results. In my opinion, such improvement can be the result of corpus composition, however, I would appreciate if you tell me your opinion.
Alexander -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4116 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121105/3587e01a/attachment.txt>