> Can the WebBootCaT tool you mention be used independently of
SketchEngine ... No, but the price is affordable. BootCaT is available for free, so may well suit people with the skills to run perl scripts. WebBootCaT handles the processes of cleaning up the data, removing duplicates, POS-tagging and lemmatising (for quite a few lgs) and loading into the corpus tool, and hosts the data, which, even for some people with perl skills, will be worth a couple of cups of coffee a month.
Filtering texts where there is evidence that they are not written in good English is current research. I'm not sure if that fits what you mean by unauthoritative sources. There is usually a tradeoff between "getting exactly what you want" and taking too narrow a view of the language type you are seeking.
The other trouble with 'authoritative sources' is it implies checking them one-by-one, with corpora correspondingly much smaller and slower to produce. So people are often stuck with a choice: get a corpus that is large, quick, and on target but without knowing exactly what is in it OR make do with one that is much smaller and/or doesn't really fit your research agenda or teaching plan.
2008/4/28 <M.I.Friedbichler at uibk.ac.at>:
> Michael Friedbichler wrote on Sat, 26 Apr 2008 11:21:27 +0200:
> *> > You should be aware, though, that this is not a project you can *
> *> > complete within a few weeks.*
> Adam Kilgarriff wrote on Mon, 28 Apr 2008 07:58:07 +0100:
> *> This kind of corpus-building can be done very quickly using*
> *> BootCaT and related tools, eg WebBootCaT (available at*
> *> http://www.sketchengine.co.uk ).*
> *> The basic process takes a few minutes, and a series of*
> *> iterations, to refine and improve the corpus, may be a day or two's
> work. We also*
> *> build in lemmatising, POS-tagging and loading into a corpus query tool.
> Adam, dear corpora list members:
> If one doesn't mind the noise in corpora derived from the web, this is
> indeed an elegant solution. Getting rid of all the unauthoritative
> sources, however, might be a time-consuming task lurking behind the
> seemingly instant harvest from the web.
> Whether WaC-tools (Web as Corpus) like WebBootCaT -- which represent a
> great step forward in compiling DIY corpora for computer-assisted
> translation (isn't this where BootCaT got its name?) -- are also ideal for
> the purpose at hand, is open to question. For teaching purposes, esp. in
> ESP, I think I'd rather have authoritative sources. After all,
> distinguishing between professional language use and unreliable, poorly
> edited sources is evidently not a task for language learners. You're not
> going to get clear water from a mudpot!
> Another point of interest in this context: Can the WebBootCaT tool you
> mention be used independently of SketchEngine or is it accessible only for
> those who have purchased the corpus query tool?
> Michael Friedbichler
> Innsbruck Medical University
-- ================================================ Adam Kilgarriff http://www.kilgarriff.co.uk Lexical Computing Ltd http://www.sketchengine.co.uk Lexicography MasterClass Ltd http://www.lexmasterclass.com Universities of Leeds and Sussex adam at lexmasterclass.com ================================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.uib.no/mailman/public/corpora/attachments/20080429/71fd8315/attachment.html