[Corpora-List] InfoFramework

Alexander Osherenko osherenko at gmx.de
Tue Apr 10 12:25:06 CEST 2012


Hi George,

I considered features, better known in authorship attribution, in my thesis on opinion mining and assessed their impact in classification of opinions in texts. I call these features stylometric (the group of stylometric features) and extract word lengths' features, letter features, standard deviation of sentence lengths, standard deviation of word lengths, digrams. I extracted stylometric features for opinion mining in 4 textual corpora.

However, the framework is working with any type of information and not only with texts. It uses instances of custom classes in the generation step to generate datasets and a custom generator function. These classes and the function are, of course, in every sense modality-DEPENDENT etc. and can be added or deleted if necessary. After generation the processing is fully modality-INDEPENDENT.

For example, in opinion mining the framework generates datasets by using combinations of an instance of a custom class for word lengths and an instance of a custom class for sentence lengths and an instance of a custom class for standard deviation of lengths etc. The generator function returns sequences of analyzed information (in my case, movie reviews).

Hence, if I want to work with new data I implement my custom feature classes that implement a specific class interface and can be considered as parts of a data mining instance. The custom generator function supplies information to process. For instance, to work with neurobiological data I implement neurobiological classes that represent features for brain regions and a generator function that returns sequences of useful neurobiological information segments. These classes and the generator function were in opinion mining often very small and pilot studies of new corpora required hence only very little time because such steps such as evaluating, fusing, optimizing the datasets etc. are already implemented. In your case, you would write a PERL-Jython wrapper class and let the framework do its work.

Next, what does the framework actually do with custom classes? The framework instantiates them and composes combinatorial combinations of class instances resulting mathematically in (2 power N)-1 combinations where N is the number of features. In an exhaustive study, you create datasets with all combination sets of features. For instance, for 5 stylometric features you generate 31 datasets with all feature combinations.

You can normalize feature values. BTW, what sort of normalization do you mean in your email? The framework relies on information segments, for example, sequences of movie reviews. You can derive and store useful information globally as the ngrams-related information. For example, in opinion mining in movie reviews I consider the frequency list of BNC as a global variable. I considered normalization of feature values using the length of a sentence in words or words in characters. However, such normalization wasn't beneficial (see the thesis).

Hope I could answer your questions.

Best Alexander

2012/4/9 Georgios Mikros <gmikros at isll.uoa.gr>


> Alexander Hi,****
>
> Your framework seems very interesting. What kind of features can be
> counted? My research focuses in authorship attribution and I use many
> different scripts in PERL for counting many different feature sets for my
> experiments. I was wondering whether your framework can count and normalize
> for text length character and word ngrams.****
>
> Best****
>
> George Mikros****
>
> ** **
>
>
> ------------------------------------------------------------------------------
> ****
>
> Dr. George K. Mikros,****
>
> Associate Professor of Computational Linguistics and Quantitative
> Linguistics****
>
> Department of Italian Language and Literature****
>
> School of Philosophy,****
>
> National and Kapodistrian University of Athens****
>
> Panepistimioupoli Zografou, GR 15784****
>
> Athens****
>
> Greece****
>
> Tel/Fax: +30 210 6511344****
>
> Email: gmikros at isll.uoa.gr ****
>
> Web: http://users.uoa.gr/~gmikros/****
>
> ** **
>
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7514 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120410/b813726f/attachment.txt>



More information about the Corpora mailing list