[Corpora-List] WebIsALOD - Large-scale Hypernymy Dataset Released

Koos Wilt kooswilt at gmail.com
Thu May 18 20:00:04 CEST 2017


I promised to submit Python software that amounts to a demo of (1) how ensembles of 'linguistic tricks' cumulatively improve document classification of MEDLINE abstracts, and by extension, I would surmise, summarization, clustering, and IR itself (2) how taking hypernyms into account leads to even greater improvement.

To see how SUBJECT PREDICATE OBJECT triples, NOT (yet) in prescribed URI Semantic Web format, but convertible to same, improves classification, run classify.py and 1classify.py and compare the numbers (343/400 correct vs. 363/400 correct). @: https://github.com/Koos12/Cosine_sim_Python-w.n.w.o. Parser

This code, slightly adjusted from the original, shows how POS (linguistics) and SPO Triples (linguistics) enhanced w/ Hypernyms (linguistics/semantics) increase classification perormance.

@: https://github.com/Koos12/Cos.-Sim_POS_Hypernyms

A draft of a paper discussing similar work on 20newsgroups is at:

https://www.academia.edu/27207951/Linguistics_improves_statistical_classification_with_KLD_NB_TF_IDF_K-NN_the_positive_effects_of_reducing_feature_dimensionality_or_grammatical_feature_selection._Koos_van_der_Wilt

Hope all this made sense and comments welcome.

Regards,

-K

2017-05-18 16:35 GMT+02:00 Koos Wilt <kooswilt at gmail.com>:


> Dr Paulheim's post nicely dovetails with the content of the following two
> links:
>
>
> http://www.sciencedirect.com/science/article/pii/S1532046403001175
>
>
> https://semrep.nlm.nih.gov/
>
>
> The first link discusses the role of hypernyms, the subject of Dr
> Paulheim's post. We see this kind of effort has long been in the making.
> My code showing the functioning of the ensemble SUBJECT PREDICATE
> OBJECT, POS tagging, hypernyms version of SUBJECT PREDICATE OBJECT to
> increase correct classification form 343/400 to 377/400, is essentially a
> primitive-ish implementation of these two links.
>
> The work Dr Paulheim talks about is the stuff in the first link write
> large, and the potential for transforming the web into a functional
> repository of Linked Date is tremendous, if a headache to organize and keep
> track of.
>
> I will attempt to post all the software yielding the 377/400 correct
> classification to GitHub this evening. If you have more than a passing
> interest, I am sure you will know how to find it.
>
>
> Best regards,
>
> -Koos
>
> 2017-05-18 12:01 GMT+02:00 Koos Wilt <kooswilt at gmail.com>:
>
>> Heiko and others,
>>
>>
>> I hope my response is appropriate. I am conducting a series of
>> experiments to show 'linguistics improves Text Analytics'. (philosophical
>> underpinnings: low-hanging fruit in comparison to tedious Neural Net
>> Studies; linguistics and statistics are complements, as Ken Church asks us
>> to consider in A PENDULUM SWUNG TOO FAR.) Remaining on-topic, one of my
>> experiments concerns hypernyms. In an ensemble with POS tagging, and
>> regular SVO Triples, akin and applicable to Semantic Web stuff, SUBJECT
>> PREDICATE OBJECT triples expanded with hypernyms brings correct
>> classification from 343 out of 400 (baseline) to 377/400, quite an
>> improvement.
>>
>> My point is studying hypernyms and semantic in general is well worth it.
>> And timely: I claim: all the parsers makes for us having conquered syntax.
>>
>> I do not have the code for all this ready to present, but here's a taste
>> of what I already have but not uploaded to GitHub yet: the code showing the
>> improvement of just plain SUBJECT PREDICATE OBJECT triples (343/400 -->
>> 363/400). Disclaimer: code not reviewed, written hurriedly for
>> proof-of-concept.
>>
>> https://github.com/Koos12/Cosine_sim_Python-w.n.w.o.Parser
>>
>>
>> Best regards,
>>
>>
>> -Koos
>>
>>
>>
>> 2017-05-18 10:25 GMT+02:00 Heiko Paulheim <heiko at informatik.uni-mannheim
>> .de>:
>>
>>> Dear all,
>>>
>>> the Data and Web Science group at University of Mannheim is happy to
>>> announce the first release of the WebIsA database [1] as a Linked Open Data
>>> endpoint. The dataset contains 11.7 million hypernym or subsumption
>>> relations ("is a") collected from the Web (e.g., "iPhone 4 is a
>>> smartphone"), using a set of Hearst-like patterns (see the paper [2] for
>>> details). We provide the data together with confidence scores, rich
>>> provenance information, as well as interlinks to DBpedia and YAGO. All in
>>> all, the dataset contains more than 470M triples.
>>>
>>> The dataset is available at [3] as a Linked Data endpoint, a SPARQL
>>> endpoint, and downloadable dumps.
>>>
>>> All the best,
>>> Sven Hertling
>>> Heiko Paulheim
>>>
>>> [1] http://webdatacommons.org/isadb
>>> [2] Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli, Robert
>>> Meusel, Heiko Paulheim and Simone Paolo Ponzetto: A Large Database of
>>> Hypernymy Relations Extracted from the Web. In: LREC 2016.
>>> [3] http://webisa.webdatacommons.org/
>>>
>>>
>>> --
>>> Prof. Dr. Heiko Paulheim
>>> Data and Web Science Group
>>> University of Mannheim
>>> Phone: +49 621 181 2652
>>> B6, 26, Room B1.16
>>> D-68159 Mannheim
>>>
>>> Mail: heiko at informatik.uni-mannheim.de
>>> Web: www.heikopaulheim.com
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8234 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170518/0a700d77/attachment.txt>



More information about the Corpora mailing list