[Corpora-List] Geometrical representation of NL phrases for similarity comparison

Fabio Massimo Zanzotto fabio.massimo.zanzotto at uniroma2.it
Tue Oct 23 15:38:10 CEST 2018


Dear Alexander,

There is another approach to encode sentences in vectors that takes into consideration their syntactic structures: the Distributed Tree Kernel <https://icml.cc/Conferences/2012/papers/111.pdf> and its semantic extension <http://aclweb.org/anthology/C14-1068>. This idea stems from the "convolutional conjecture" <https://doi.org/10.1162/COLI_a_00215> that underlies this kind of vector-based representations.

Hope it helps!

Best, Fabio

On Fri, Oct 19, 2018 at 6:41 PM vinicius at open.inf.br <vinicius at open.inf.br> wrote:


> Dear Alexander,
>
> If you use a vectorial representation of a sentence you will have more
> than 50 Dimensions (e.g using word2vec vectors). I think with a 2d/3d
> simplification you will lose a lot of information that could make difficult
> the real representation of the sentence - even for visualization purposes.
>
> Best,
> Vinicius
>
> On Fri, Oct 19, 2018 at 12:26 PM Alexander Osherenko <osherenko at gmx.de>
> wrote:
>
>> Thanks, guys, for your input -- very interesting, I will evaluate all
>> approaches. Actually, I am looking primarily for a geometric representation
>> of phrases that I can use for different purposes, for example, for
>> comparison. As I saw, some approaches calculate a scholar value that can be
>> represented as a point on the line -- it is a good beginning for my
>> evaluation. I am curious íf there are other representations of phrases that
>> can be visualized, for instance, as a point in the 2D/3D plane.
>>
>>
>> Am Fr., 19. Okt. 2018 um 17:06 Uhr schrieb Daniel Cer <cer at google.com>:
>>
>>> Hi Alexander,
>>>
>>> You could try using the Universal Sentence Encoder:
>>> https://tfhub.dev/google/universal-sentence-encoder/2
>>>
>>> It performs well on sentence level semantic textual similarity. There's
>>> an online demo / notebook available here:
>>> https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
>>>
>>> The demo includes a simple pairwise similarity visualization as well as
>>> example code for using it for STS. If you want to get the best results I
>>> would recommend using the transformer based model
>>> (universal-sentence-encoder-large).
>>>
>>> Disclaimer: I'm one of the authors. We'll be presenting it during the
>>> EMNLP demo section later this month.
>>>
>>> Dan
>>>
>>>
>>>
>>> On Fri, Oct 19, 2018 at 3:03 AM Jindrich Libovicky <
>>> libovicky at ufal.mff.cuni.cz> wrote:
>>>
>>>> Hi Alexander,
>>>>
>>>> I would recommend something like ELMo: https://allennlp.org/elmo,
>>>> https://arxiv.org/abs/1802.05365
>>>>
>>>> It is a large pre-trained language model that works well on most of the
>>>> semantic tasks (https://gluebenchmark.com). There are bunch of models
>>>> that perform even better, but I am not sure how easily available they are.
>>>> For ELMo, you just need to install AllenNLP.
>>>>
>>>> Regards,
>>>> Jindřich
>>>>
>>>> ----- Original Message -----
>>>> From: "Ignacio J. Iacobacci" <iiacobac at gmail.com>
>>>> To: osherenko at gmx.de
>>>> Cc: corpora at uib.no
>>>> Sent: Friday, 19 October, 2018 11:13:42
>>>> Subject: Re: [Corpora-List] Geometrical representation of NL phrases
>>>> for similarity comparison
>>>>
>>>> Hello Alexander,
>>>>
>>>> There are many options, much better that this one, but doc2vec, the
>>>> extension of word2vec for sentences and documents will work for you
>>>> [ https://radimrehurek.com/gensim/models/doc2vec.html |
>>>> https://radimrehurek.com/gensim/models/doc2vec.html ]
>>>>
>>>> All the best!
>>>>
>>>> Ignacio
>>>>
>>>>
>>>> El vie., 19 oct. 2018 a las 10:10, Alexander Osherenko (< [ mailto:
>>>> osherenko at gmx.de | osherenko at gmx.de ] >) escribió:
>>>>
>>>>
>>>>
>>>> Thanks, Mohammad. Unfortunately, I looking for a geometric
>>>> representation of phrases, not of words.
>>>>
>>>> Best, Alexander
>>>>
>>>>
>>>> Am Fr., 19. Okt. 2018 um 11:01 Uhr schrieb Mohammad Akbari < [ mailto:
>>>> akbari.ma at gmail.com | akbari.ma at gmail.com ] >:
>>>>
>>>>
>>>>
>>>> Hello Alexander,
>>>>
>>>> Word embedding models, such as word2vec, and glove, are common
>>>> approaches; where words represented with a numerical vector ( [
>>>> https://arxiv.org/pdf/1310.4546.pdf |
>>>> https://arxiv.org/pdf/1310.4546.pdf ] , [
>>>> https://code.google.com/archive/p/word2vec/ |
>>>> https://code.google.com/archive/p/word2vec/ ] ). When you have word
>>>> embedding, you can do geometric computations based other vectors. A common
>>>> approach is to compute the average embedding of all words in a phrase; You
>>>> can check fasttext for this purpose.
>>>>
>>>>
>>>> Regards,
>>>> Mohammad
>>>>
>>>>
>>>>
>>>>
>>>> On 19 Oct 2018, at 09:41, Alexander Osherenko < [ mailto:
>>>> osherenko at gmx.de | osherenko at gmx.de ] > wrote:
>>>>
>>>> Hi,
>>>>
>>>> I wonder if it is possible to represent NL phrases geometrically, for
>>>> example, to compare their similarity. For example, the phrase "Hey man,
>>>> that chick is such a catch! " and more formal "..., this girl is pretty!"
>>>> should be represented geometrically nearby because they are semantically
>>>> similar.
>>>>
>>>> I am aware of LSA vectors that represent particular words and
>>>> similarity could be evaluated as a distance between these word vectors in
>>>> the LSA space. However, the LSA approach only works for individual words
>>>> and no phrases and it is IMHO too numerical because it doesn't consider
>>>> semantics of participating words.
>>>>
>>>> Best, Alexander
>>>> --
>>>> Alexander Osherenko, Dr. rer. nat.
>>>> Senior HCI architect
>>>> Founder and R&D
>>>> [ http://www.socioware.de/osherenko_page.html | Socioware Development
>>>> ]
>>>> Profile: [ https://www.researchgate.net/profile/Alexander_Osherenko |
>>>> ResearchGate ]
>>>> [
>>>> https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization
>>>> | Implementing Social Smart Environments with a Large Number of Believable
>>>> Inhabitants in the Context of Globalization ] at Springer
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: [ http://mailman.uib.no/options/corpora |
>>>> http://mailman.uib.no/options/corpora ]
>>>> Corpora mailing list
>>>> [ mailto:Corpora at uib.no | Corpora at uib.no ]
>>>> [ https://mailman.uib.no/listinfo/corpora |
>>>> https://mailman.uib.no/listinfo/corpora ]
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: [ http://mailman.uib.no/options/corpora |
>>>> http://mailman.uib.no/options/corpora ]
>>>> Corpora mailing list
>>>> [ mailto:Corpora at uib.no | Corpora at uib.no ]
>>>> [ https://mailman.uib.no/listinfo/corpora |
>>>> https://mailman.uib.no/listinfo/corpora ]
>>>>
>>>>
>>>> --
>>>> Men who become accustomed to worrying about the needs of machines
>>>> become callous about the needs of men
>>>> (Isaac Asimov)
>>>>
>>>> Ignacio J. Iacobacci
>>>> [ mailto:iiacobac at gmail.com | iiacobac at gmail.com ]
>>>> [ mailto:iiacobacci at dc.uba.ar | iiacobacci at dc.uba.ar ]
>>>> [ mailto:iacobacci at di.uniroma1.it | iacobacci at di.uniroma1.it ]
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> https://mailman.uib.no/listinfo/corpora
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> https://mailman.uib.no/listinfo/corpora
>>>>
>>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13356 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181023/eaca5977/attachment.txt>



More information about the Corpora mailing list