[Corpora-List] Clean corpus including user relationships (Enron? Twitter?)

Vincent mailinglists at vinnl.nl
Sun Apr 6 15:42:20 CEST 2014


Hi all,

For my master's thesis I want to compare relationship-based community detection methods with text-based methods. Hence, I need a corpus that includes both.

Currently, I'm thinking of the Enron email dataset. It includes relationships (who mailed whom?) and text (the actual emails). It has a few issues though:

- Users can have multiple email addresses. - Not all text is produced by the sender of the email/humans (think quotes, signatures, spam and whatnot).

Does anyone have access to a cleaned up dataset that includes both, or perhaps a script that cleans up email text to include only content representative of the email sender? Alternatively, a different clean dataset that includes both people relationships and text produced by those persons - e.g. Twitter comes to mind?

Thanks in advance, -- Vincent -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1034 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140406/977f24f5/attachment.txt>



More information about the Corpora mailing list