[Corpora-List] quantities of publicly available parallel text?
arafalov at gmail.com
Wed Feb 27 20:01:15 CET 2008
On Wed, Feb 27, 2008 at 10:46 AM, Adam Kilgarriff
<adam at lexmasterclass.com> wrote:
> But aren't all these official, centralised corpora both of rather peculiar
> genres, and rather small? More interesting, to my mind, is Tiedemann and
> Nygard's work, based on the neat observations that
Peculiar genre, perhaps. So is legal and biomedical domain and that
has been getting some recent attention.
As to small, what is considered to be too small? I have 5 million
(uncleaned) tokens for one language in one subtype of documents
(Resolutions of the General Assembly). Is that too small for the kind
of work you envisage?
If so, what would be a good number? Apologies, if this question has
already been answered before.
More information about the Corpora