[Corpora-List] AntConc 3.2.2 released for Windows and Mac OS X

Laurence Anthony anthony0122 at gmail.com
Wed Apr 13 16:32:45 CEST 2011

Dear Michal and all,

I'll reply about the two issues separately.

>I have two issues that haven't been either noticed or perhaps required in Antconc.
>The first is that Antconc does not read files with file names containing 2-bite characters (even after changing the encoding in Global Settings). Since you work in Japan, didn't you >have any problems with that?

The problem with developing AntConc as a multi-language program is that I have to deal with the horrible character encoding issues on Windows systems. Basically, all (pre Win 7?) windows systems had their own legacy encodings, which varied from country to country. So, even if you have a file saved as UTF8, the file *name* is saved in the legacy encoding. AntConc only offers one encoding setting, and assumes that the file *and* the filename are the same. But, this will cause problems as you have noticed. The files will still open, but the filename will just become jumbled in the display. (Actually, I would recommend everyone to stick with ascii filenames regardless of the system they use.

Saying that, I just tried to get AntConc 3.2.2 to display a Japanese filename (in ShiftJis) without success! It opened the file correctly and displayed the internal UTF8 without problem, but when I selected Shiftjis, the filename appeared blank. It works properly in AntConc 3.2.1, so perhaps Perl 5.10 (which I use to program with) is doing something a little differently. (I'll check and release another bug fix).

>The second is calculating ranks of words. I noticed that words that have the same occurrence (hit-rate) have subsequent ranks (which probably comes from alphabetical sorting). This >means that if there is 1000 words of only 1 occurrence per each, the word starting with "Aa" will have rank = 1, and word starting with "Zz" will have rank = 1000, although statistically >they should be of the same rank.
>Do you consider the above as issues or is it irrelevant in your research?

As Mike Scott says, the Rank column is not a rank of the frequencies, it's a rank of the word in the sort order. But, I can understand the issue. Perhaps "Index" or "Sort Rank" would be better. (Thank you for the kind comment, Mike!)

William Fletcher writes,

> One way to avoid the problem of assigning different ranks to
> items with the same frequency is to use "shared ranks"
> instead, so that all items with the same frequency have the
> same rank.
> Shared rank is the mean of the lowest index (=position in
> list) and the highest index of items with the same frequency.
> In Michal's example all items would have the rank 500.5
> (1 + 1000) / 2
> Bill Fletcher

Perhaps this could be added as a separate statistic. Let me think about it.


More information about the Corpora mailing list