[Corpora-List] Reducing n-gram output

Chunyu Kit ctckit at cityu.edu.hk
Sat Nov 1 04:29:06 CET 2008


one may have a look into this paper if interested in suffix array stuff ... didn't expect so many papers on it in recent years, but there are some smart algorithms for its implementation.

best, kit

A taxonomy of suffix array construction algorithms <citation.cfm?id=1242471.1242472&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>

Simon J. Puglisi <author_page.cfm?id=81100272116&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>, W. F. Smyth <author_page.cfm?id=81338490986&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>, Andrew H. Turpin <author_page.cfm?id=81100229960&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>

July 2007

Computing Surveys (CSUR) , Volume 39 Issue 2 Publisher: ACM Full text available: PdfPdf <ft_gateway.cfm?id=1242472&type=pdf&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920> (381.17 KB)

Additional Information: full citation <citation.cfm?id=1242471.1242472&jmp=cit&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#CIT>, abstract <citation.cfm?id=1242471.1242472&jmp=abstract&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#abstract>, references <citation.cfm?id=1242471.1242472&jmp=references&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#references>, index terms <citation.cfm?id=1242471.1242472&jmp=indexterms&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#indexterms>

Bibliometrics: Downloads (6 Weeks): 53, Downloads (12 Months): 649, Citation Count: 3

In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction ...

Alexandre Rafalovitch wrote:


>One of the interesting papers around suffix arrays is by Chunyu KIT:
>"The Virtual Corpus approach to deriving n-gram statistics from large
>scale corpora"
>http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf
>
>I have some (scary) java code based around those concepts that can do
>statistical analysis on n-grams with n above 140 and does look at
>(n-1)-grams and (n+1)-grams with the same length.
>
>If that's of interest, I would be happy to discuss this further in a
>direct email (to not bother the list).
>
>Regards,
> Alex.
>
>Personal blog: http://blog.outerthoughts.com/
>Research group: http://www.clt.mq.edu.au/Research/
>
>
>
>On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <alopez at inf.ed.ac.uk> wrote:
>
>
>>>I was wondering whether anybody is aware of ideas and/or automated
>>>processes to reduce n-gram output by solving the common problem that
>>>shorter n-grams can be fragments of larger structures (e.g. the 5-
>>>gram 'at the end of the' as part of the 6-gram 'at the end of the
>>>day')
>>>
>>>I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>>>
>>>
>>Suffix trees, suffix arrays, and their relatives are compact data
>>structures following exactly the intuition that smaller strings are
>>substrings of larger ones. They represent all possible n-grams
>>(without limit on n) of a text in space proportional to the length of
>>the text and support efficient retrieval, counting, and other queries
>>on substrings of the text; there is a vast literature on their various
>>applications (and theory linking them to compressibility, etc.).
>>
>>
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>
>

-- Chunyu Kit, PhD Associate Professor, Computational Linguistics

Dept. of Chinese, Translation & Linguistics City University of Hong Kong 83 Tat Chee Ave., Kowloon

http://personal.cityu.edu.hk/~ctckit/ Tel: (+852)2788 9310 (O), 9380 1738 (M)

(+86)135 3948 3937 (China Mobile) Fax: (+852)2788 8706, 2788 7320

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7528 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081101/a011260a/attachment.txt -------------- next part -------------- A non-text attachment was scrubbed... Name: imagetypes\pdf_logo.gif Type: image/gif Size: 355 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081101/a011260a/attachment.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: doc_blank.gif Type: image/gif Size: 51 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081101/a011260a/attachment-0001.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: blanks.gif Type: image/gif Size: 54 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081101/a011260a/attachment-0002.gif



More information about the Corpora mailing list