[Corpora-List] quantities of publicly available parallel text?

Alexandre Rafalovitch arafalov at gmail.com
Wed Feb 27 20:01:15 CET 2008


On Wed, Feb 27, 2008 at 10:46 AM, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> But aren't all these official, centralised corpora both of rather peculiar
> genres, and rather small? More interesting, to my mind, is Tiedemann and
> Nygard's work, based on the neat observations that

Peculiar genre, perhaps. So is legal and biomedical domain and that has been getting some recent attention.

As to small, what is considered to be too small? I have 5 million (uncleaned) tokens for one language in one subtype of documents (Resolutions of the General Assembly). Is that too small for the kind of work you envisage?

If so, what would be a good number? Apologies, if this question has already been answered before.

Regards,

Alex.



More information about the Corpora mailing list