[Corpora-List] Criteria to Building a Corpus for Text Classification

Mohsen Al-Thubaity althubaity at gmail.com
Thu Jun 15 17:58:01 CEST 2006

Hi all

My sincere thanks to Ylva, Eric and Ozlem for their response. All responses
are included in this E-mail.

What I mean by "text classification" is " *a program or algorithm to decide
what genre or domain a text document belongs to *".

Actually, I am aware of text size.

Is it possible to have different text sizes ranging from 100 words to
several thousands of words?

Governmental reports, as an example, have this variation in text size.

News papers articles does not have this variation.

Best wishes


On 15/06/06, Mohsen Al-Thubaity < althubaity at gmail.com> wrote:

Hi all

I am working on a research project investigating Arabic text classification.

The first part of this project, required building a corpus to train and test
the classifier.

Are there are any criteria or standards must be followed to build such a

Any suggestions or references are most appreciated.

Best wishes



On 15/06/06, Ylva Berglund < ylva.berglund at oucs.ox.ac.uk> wrote:

Dear Mohsen,

Selection of texts for a (training) corpus is a very complex and
important issue. Unfortunately I don't think there are any hard and fast
rules defining what to include. You would have to consider not only what
kind of text classes there are and what would be suitable examples of
these, but also what is available to you (text resources as well as
time, money, expertise etc). Some issues relating to corpus creation
(including text selection) are discussed in the fairly recent book:
'Developing Linguistic Corpora: A Guide to Good Practice' which is
available online at
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ (hard copies
from Oxbow books: http://www.oxbowbooks.com/bookinfo.cfm/ID/32969 ).
Maybe that can be of use to you.

Good luck with your project.

-- Ylva

On 15/06/06, Eric Atwell < eric at comp.leeds.ac.uk> wrote:


You dont say what you mean by "text classification" - do you mean you
are developing a program or algorithm to decide what genre or domain
a text document belongs to? Or are you trying to develop a set of
genres which cover needs of Arabic corpus linguistics? Or something

My colleage Latifa Al-Sulaiti and i have looked into text-types or
genres whcih Arabic language teachers and language engineers would like
to see in a Corpus of Contemporary Arabic, see

Al-Sulaiti, Latifa; Atwell, Eric. The Design of a Corpus of Contemporary
Arabic. To appear in International Journal of Corpus Linguistics,
vol.11, 2006. [Preprint at http://www.comp.leeds.ac.uk/eric/rae/ ]

Another colleage, Serge Sharoff, has developed a set of text
classification categories which he has demonstrated apply to
100-million-word corpora covering a range of languages, see

- I beleive he has a paper forthcoming on this topic, you will have to
ask him direct for a preprint.

Please let me have any publication(s) you have on your work, I would
like to find out more as we have interests in common


Eric Atwell

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20060615/a4ec6ed4/attachment.html

More information about the Corpora-archive mailing list