[Corpora-List] Geometrical representation of NL phrases for similarity comparison

Stefan Dumitrescu sdumitrescu at racai.ro
Fri Oct 19 18:33:53 CEST 2018


Hi Alexander,

The basic idea is that you encode a phrase in any way you want (average of word embeddings, doc2vec, elmo, the final state of an LSTM network etc.) and you get an n-dimensional vector for that phrase. Cosine distance between two "encoded" phrases is a pretty standard and solid way to measure similarity.

Furthermore, as you mentioned earlier, dimensionality reduction is one way to go to visualize phrases in 2D/3D. Maybe TSNE is what you might want? There are a lot of resources out there for TSNE, here's one I picked from the first page of google: https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b . There are a lot of pretty neat TSNE visualizers online.

I think of it this way, if a word has an n-dimensional representation in a semantic space then so does a phrase encoded as an n-dimensional vector. So basically, what you do with words you could do with phrases, sentences or even documents.

Stefan Dumitrescu

On 10/19/2018 6:20 PM, Alexander Osherenko wrote:
> Thanks, guys, for your input -- very interesting, I will evaluate all
> approaches. Actually, I am looking primarily for a geometric
> representation of phrases that I can use for different purposes, for
> example, for comparison. As I saw, some approaches calculate a scholar
> value that can be represented as a point on the line -- it is a good
> beginning for my evaluation. I am curious íf there are other
> representations of phrases that can be visualized, for instance, as a
> point in the 2D/3D plane.
>
>
> Am Fr., 19. Okt. 2018 um 17:06 Uhr schrieb Daniel Cer <cer at google.com
> <mailto:cer at google.com>>:
>
> Hi Alexander,
>
> You could try using the Universal Sentence Encoder:
> https://tfhub.dev/google/universal-sentence-encoder/2
>
> It performs well on sentence level semantic textual similarity.
> There's an online demo / notebook available here:
> https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
>
> The demo includes a simple pairwise similarity visualization as
> well as example code for using it for STS. If you want to get the
> best results I would recommend using the transformer based model
> (universal-sentence-encoder-large).
>
> Disclaimer: I'm one of the authors. We'll be presenting it during
> the EMNLP demo section later this month.
>
> Dan
>
>
>
> On Fri, Oct 19, 2018 at 3:03 AM Jindrich Libovicky
> <libovicky at ufal.mff.cuni.cz <mailto:libovicky at ufal.mff.cuni.cz>>
> wrote:
>
> Hi Alexander,
>
> I would recommend something like ELMo:
> https://allennlp.org/elmo, https://arxiv.org/abs/1802.05365
>
> It is a large pre-trained language model that works well on
> most of the semantic tasks (https://gluebenchmark.com). There
> are bunch of models that perform even better, but I am not
> sure how easily available they are. For ELMo, you just need to
> install AllenNLP.
>
> Regards,
> Jindřich
>
> ----- Original Message -----
> From: "Ignacio J. Iacobacci" <iiacobac at gmail.com
> <mailto:iiacobac at gmail.com>>
> To: osherenko at gmx.de <mailto:osherenko at gmx.de>
> Cc: corpora at uib.no <mailto:corpora at uib.no>
> Sent: Friday, 19 October, 2018 11:13:42
> Subject: Re: [Corpora-List] Geometrical representation of NL
> phrases for similarity comparison
>
> Hello Alexander,
>
> There are many options, much better that this one, but
> doc2vec, the extension of word2vec for sentences and documents
> will work for you
> [ https://radimrehurek.com/gensim/models/doc2vec.html |
> https://radimrehurek.com/gensim/models/doc2vec.html ]
>
> All the best!
>
> Ignacio
>
>
> El vie., 19 oct. 2018 a las 10:10, Alexander Osherenko (< [
> mailto:osherenko at gmx.de <mailto:osherenko at gmx.de> |
> osherenko at gmx.de <mailto:osherenko at gmx.de> ] >) escribió:
>
>
>
> Thanks, Mohammad. Unfortunately, I looking for a geometric
> representation of phrases, not of words.
>
> Best, Alexander
>
>
> Am Fr., 19. Okt. 2018 um 11:01 Uhr schrieb Mohammad Akbari < [
> mailto:akbari.ma at gmail.com <mailto:akbari.ma at gmail.com> |
> akbari.ma at gmail.com <mailto:akbari.ma at gmail.com> ] >:
>
>
>
> Hello Alexander,
>
> Word embedding models, such as word2vec, and glove, are common
> approaches; where words represented with a numerical vector (
> [ https://arxiv.org/pdf/1310.4546.pdf |
> https://arxiv.org/pdf/1310.4546.pdf ] , [
> https://code.google.com/archive/p/word2vec/ |
> https://code.google.com/archive/p/word2vec/ ] ). When you have
> word embedding, you can do geometric computations based other
> vectors. A common approach is to compute the average embedding
> of all words in a phrase; You can check fasttext for this
> purpose.
>
>
> Regards,
> Mohammad
>
>
>
>
> On 19 Oct 2018, at 09:41, Alexander Osherenko < [
> mailto:osherenko at gmx.de <mailto:osherenko at gmx.de> |
> osherenko at gmx.de <mailto:osherenko at gmx.de> ] > wrote:
>
> Hi,
>
> I wonder if it is possible to represent NL phrases
> geometrically, for example, to compare their similarity. For
> example, the phrase "Hey man, that chick is such a catch! "
> and more formal "..., this girl is pretty!" should be
> represented geometrically nearby because they are semantically
> similar.
>
> I am aware of LSA vectors that represent particular words and
> similarity could be evaluated as a distance between these word
> vectors in the LSA space. However, the LSA approach only works
> for individual words and no phrases and it is IMHO too
> numerical because it doesn't consider semantics of
> participating words.
>
> Best, Alexander
> --
> Alexander Osherenko, Dr. rer. nat.
> Senior HCI architect
> Founder and R&D
> [ http://www.socioware.de/osherenko_page.html | Socioware
> Development ]
> Profile: [
> https://www.researchgate.net/profile/Alexander_Osherenko |
> ResearchGate ]
> [
> https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization
> | Implementing Social Smart Environments with a Large Number
> of Believable Inhabitants in the Context of Globalization ] at
> Springer
> _______________________________________________
> UNSUBSCRIBE from this page: [
> http://mailman.uib.no/options/corpora |
> http://mailman.uib.no/options/corpora ]
> Corpora mailing list
> [ mailto:Corpora at uib.no <mailto:Corpora at uib.no> |
> Corpora at uib.no <mailto:Corpora at uib.no> ]
> [ https://mailman.uib.no/listinfo/corpora |
> https://mailman.uib.no/listinfo/corpora ]
>
> _______________________________________________
> UNSUBSCRIBE from this page: [
> http://mailman.uib.no/options/corpora |
> http://mailman.uib.no/options/corpora ]
> Corpora mailing list
> [ mailto:Corpora at uib.no <mailto:Corpora at uib.no> |
> Corpora at uib.no <mailto:Corpora at uib.no> ]
> [ https://mailman.uib.no/listinfo/corpora |
> https://mailman.uib.no/listinfo/corpora ]
>
>
> --
> Men who become accustomed to worrying about the needs of
> machines become callous about the needs of men
> (Isaac Asimov)
>
> Ignacio J. Iacobacci
> [ mailto:iiacobac at gmail.com <mailto:iiacobac at gmail.com> |
> iiacobac at gmail.com <mailto:iiacobac at gmail.com> ]
> [ mailto:iiacobacci at dc.uba.ar <mailto:iiacobacci at dc.uba.ar> |
> iiacobacci at dc.uba.ar <mailto:iiacobacci at dc.uba.ar> ]
> [ mailto:iacobacci at di.uniroma1.it
> <mailto:iacobacci at di.uniroma1.it> | iacobacci at di.uniroma1.it
> <mailto:iacobacci at di.uniroma1.it> ]
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> https://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> https://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 18364 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181019/6bb89b40/attachment.txt>



More information about the Corpora mailing list