[Corpora-List] (no subject)

Mike Maxwell maxwell at umiacs.umd.edu
Fri May 25 14:33:50 CEST 2012

On 5/23/2012 2:24 AM, fatima zuhra wrote:
> Some of my work includes the development of a corpus, a
> morphological analyzer, a parser and a transliterator for Pashto langauge. I have also worked on
> a part of speech tagger for Pashto and the work is in progress. I am interested in the knowledge
> and discussions about copyright rules. In my view, a more severe problem is that if someone
> integrates in his/her software an algorithm (or even the software code) from another scholar's
> work (e.g. my morphological analyzer code and methodology) without the knowledge of the scholar.
> It will be very hard to check the code of such a larger software for 'plagiarism'!!!!

Very few researchers today would create an algorithm to do morphological parsing of some language. Rather, most morphological analyzers these days are based on three components: a language-agnostic parsing engine (which contains algorithms); a set of grammar rules for morphology; and a lexicon. Commonly used parsing engines include the Xerox finite state transducer (xfst) and the Stuttgart finite state transducer (sfst), among others.

If two groups use the same engine for the same language, there will be significant similarities in their code--the same affixes, for example. It could be hard to demonstrate plagiarism there, simply because the code *has* to be similar. Even morphosyntactic feature names will often be the same (how many ways can you say "tense" or "number"?).

On the other hand, if there are significant morpho-phonological processes, that part of the grammar could and probably would differ in analysis, because there are different ways to describe the natural classes involved, or to order the rules. Or if there is not an agreed-on set of declension classes (as there is not, for Pashto), there would likely be differences in that part of the grammar on the part of different teams. --

Mike Maxwell

maxwell at umiacs.umd.edu

"My definition of an interesting universe is

one that has the capacity to study itself."

--Stephen Eastmond

More information about the Corpora mailing list