[Corpora-List] looking for natural language questions on computer science publication domain

Lushan Han lushan1 at umbc.edu
Tue Aug 27 16:11:25 CEST 2013

Dear Corpora List,

We are developing a question-answering system on a publication dataset combining data from DBLP, CiteSeerX and ArnetMiner. Our system is now able to interpret many interesting questions from simple ones, like “who published papers on the CIKM conference in 2009” or “give me papers in the subject decision trees”, to complicate ones like “give me the institutions of the authors with whom Lushan Han at UMBC has co-authored” or “list papers that are cited by papers in the conference SIGMOD in the year 2012”.

However, we need a dataset containing user queries to evaluate our system and set its parameters. We expect the queries are in the form of natural language questions. We have made some by ourselves but we still need more questions and, especially, more rephrases. A question can be expressed in many different ways, which a QA system has to deal with. For example, the citation relation can be queried using “give me paper y that cites the paper x” or “give me paper y that references the paper x” or “give me the citations of paper x” or “give me the references of publication y”.

Moreover, we can also ask “who cites the paper x” in which the citation is no longer a direct relation between papers.

Does anyone know the existing of such datasets? Any is good because it can help by adding more variations either in the content or expression to the questions in our dataset. Your help is highly appreciated.


Lushan Han -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1861 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130827/9d8cc52e/attachment.txt>

More information about the Corpora mailing list