[Corpora-List] Corpus of threats?

Tristan Miller miller at ukp.informatik.tu-darmstadt.de
Fri Nov 2 16:20:19 CET 2012


Greetings.

On 01/11/12 05:59 PM, Tyler Schnoebelen wrote:
> I was looking over the records of searches that led to my corpus blog
> (http://corplinguistics.wordpress.com) and came across:
>
> “death threat corpus linguistics”
>
> This actually is a pretty interesting idea for a corpus. Does anyone
> know about such a corpus or something similar that would help
> researchers investigate the language of threatening/intimidation?

You might be able to construct one yourself semi-automatically using Wikipedia. Editors sometimes post death threats against other editors or against the organization which hosts the encyclopedia. Since this contravenes Wikipedia's policies, other editors often remove these threats, leaving clues in their edit summary such as "rv death threat".

If you obtain a Wikipedia database dump which includes the revision history, and the appropriate API to process it (e.g., JWPL), you could identify and extract these removal edits (including the exact text which was removed).

Legal threats are also against Wikipedia policy but they're not usually removed by other editors, so they're not as easy to identify automatically. It's no problem identifying editors which have been blocked or banned for issuing legal threats, since this information is normally included in the block message posted on their user page, but identifying which of their edits constituted the threat itself would be problematic.

Regards, Tristan

-- Tristan Miller, Doctoral Researcher Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitšt Darmstadt Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 259 bytes Desc: OpenPGP digital signature URL: <https://mailman.uib.no/public/corpora/attachments/20121102/aa79f2d8/attachment.asc>



More information about the Corpora mailing list