[Corpora-List] Comparing word lengths

Alexander Osherenko osherenko at gmx.de
Thu Jan 12 17:19:54 CET 2017


Dear Muhammad,

I studied impact of stylometric features, for example, word lengths in my dissertation about opinion mining and lexical affect sensing. I also studied if analysis improves if I normalize words by length of words in a text abstract or not. Accordingly, in statistical analysis I extracted features known from authorship attribution: standard deviation of word lengths, standard deviation of sentence lengths, digrams, letters. I didn't use the mean word lengths.

I used SVM and NaiveBayes as classification algorithms -- it doesn't seem that normalization significantly improves classification. Moreover, stylometric features are not the best, but Bag of Words.

You might be interested in this PhD thesis <http://www.worldcat.org/title/opinion-mining-and-lexical-affect-sensing/oclc/725467116> or for free on the Internet.

HIH, Alexander

-- Alexander Osherenko, Dr. rer. nat. Senior HCI architect

Founder and R&D Socioware Development <http://www.socioware.de/osherenko_page.html>

Humboldt Innovation <http://www.humboldt-innovation.de/> Humboldt-Universitšt zu Berlin <http://www.hu-berlin.de/~osherena/>

Profile: ResearchGate <https://www.researchgate.net/profile/Alexander_Osherenko> Channel: LinkedIn <https://www.linkedin.com/pub/alexander-osherenko/1/30a/a74> Channel: Google+ <https://plus.google.com/105305790720313348886>, Google Scholar <https://scholar.google.com/citations?user=q_0QJBoAAAAJ&hl=en> Channel: Youtube <https://www.youtube.com/user/MrOsherenko> Channel: Twitter <https://twitter.com/mrosherenko>

Social interaction, globalization and computer-aided analysis <http://www.springer.com/us/book/9781447162599> at Springer

2017-01-03 13:18 GMT+01:00 Muhammad Shakir Aziz <true.friend2004 at gmail.com>:


> Dear Corpora Members
> I am dealing with online conversational texts which contain a lot of short
> hand spellings. I have normalized these spellings (longer standard
> spellings like brother for bro) or (short standard spellings like so for
> sooooooooo). Since word length is an important variable for my analysis, I
> just want to make sure that there is no significant /overall difference
> between normalized and non-normalized texts. The question: is it OK to
> simply compare mean word lengths from each corpus category? Or should I put
> mean score from each file in two columns (normalized versus non-normalized)
> and apply some significance test?
> PS: My guess is that about 10% words (at maximum) are affected by this
> normalization process, but I just wanted to make sure it is negligible.
> Regards
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5551 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170112/3ddff068/attachment.txt>



More information about the Corpora mailing list