[Corpora-List] Syntax-based Sentence Similarity measures

Jason Eisner jason at cs.jhu.edu
Sat Nov 22 19:17:03 CET 2008

2008/11/22 ben dbabis samira <bendbabis_samira at yahoo.fr>:
> I'm working on sentence similarity, I want to know if there are
> measures that calculate the similarity between two sentences using the
> syntactic information (grammatical category, dependencies relations,...) i.e
> : measures that take into account the structure of the whole sentence (not a
> word level measure that considers a sentence as a bag of words)

There have been a number of papers on various tree kernels and path kernels (easily found by searching). Each parse tree is mapped to a high-dimensional vector that records the counts of various substructures such as complete and incomplete subtrees, subcategorization frames, and/or dependency paths. The similarity of two trees is then defined as the dot product of their vectors. This dot product can typically be found efficiently by dynamic programming over the pair of trees, without having to expand out the actual high-dimensional vector for each tree. (An instance of the "kernel trick.")

Alternatively for an asymmetric measure, see work on quasi-synchronous grammar, e.g., "What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA" by Mengqiu Wang, Noah A. Smith, and Teruko Mitamura (EMNLP 2007). http://www.cs.cmu.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf

Most of these methods can be extended naturally to work efficiently over packed forests of parse trees, so that you don't have to commit to a single parse tree for each sentence.

-cheers, jason

On Sat, Nov 22, 2008 at 6:33 AM, Paul McNamee <paul.mcnamee at jhuapl.edu> wrote:
> Cui et al. had a paper at SIGIR 2005, "Question Answering Passage Retrieval
> Using Dependency Relations":
> http://doi.acm.org/10.1145/1076034.1076103
> They looked for sentences that might contain an answer to a question
> for experiments in question answering at TREC. And I believe some of
> their source code was made publicly available.
> You might also find some relevant work from the RTE evaluations:
> http://www.nist.gov/tac/tracks/2008/rte/
> - Paul

More information about the Corpora mailing list