TowardsMachineLearning

Find word similarity using PPMI score

Introduction :-

So in our last article we’ve understood that how to calculate PMI score. Now in this article we’ll talk about how can we find word similarity using PPMI score.

Identify word similarity or Vector similarity :-

In distributional models, every word is a point in n-dimensional space. How do we measure the similarity between two points/vectors?

  • So, let’s assume we have context vectors for two target words [latex] \overrightarrow { V } [/latex] & [latex] \overrightarrow { W } [/latex]
  • Each contains PMI (or PPMI) values for all context words
  • One way to think of these vectors: as points in high-dimensional space

Ex. in 2-dim space: cat = (v1, v2), computer = (w1, w2)

One obvious choice could be Euclidean distance.

Find word similarity using Euclidean Distance:-

We could measure similarity(distance)  using Euclidean distance

Euclidean Distance [latex]=\sqrt { \sum _{ i=1 }^{ N }{ { \left( { x }_{ i }-{ y }_{ i } \right)  }^{ 2 } }  }[/latex]

But this doesn’t work well if even one dimension has an extreme value

There’re few more possible similarity measures to calculate similarity-

Other possible similarity measures-

Following are predominantly used approaches to calculate distance or similarity –

  • Manhattan Distance ( Levenshtein Distance , L1 norm) –
    • [latex] { dist }_{ L1 }(\overrightarrow { x } ,\overrightarrow { y } )=\sum _{ i=1 }^{ N }{ \left| { x }_{ i }-{ y }_{ i } \right|  } [/latex]
  • Euclidean distance (l2 norm) –
    • [latex] { dist }_{ L2 }(\overrightarrow { x } ,\overrightarrow { y } )=\sqrt { \sum _{ i=1 }^{ N }{ { \left( { x }_{ i }-{ y }_{ i } \right)  }^{ 2 } }  } [/latex]
  • Jaccard similarity-
    • [latex] { Sim }_{ Jaccard }(\overrightarrow { V } ,\overrightarrow { w } )=\frac { \sum _{ i=1 }^{ N }{ min({ V }_{ i } } ,W_{ i }) }{ \sum _{ i=1 }^{ N }{ max({ V }_{ i } } ,W_{ i }) } [/latex]
  • Dice similarity-
    • [latex] { Sim }_{ Dice }(\overrightarrow { V } ,\overrightarrow { w } )=\frac { 2\times \sum _{ i=1 }^{ N }{ min({ V }_{ i } } ,W_{ i }) }{ \sum _{ i=1 }^{ N }{ max({ V }_{ i } } +W_{ i }) } [/latex]

Dot product :-

Another possibility is take the dot product of [latex] \overrightarrow { V } [/latex] & [latex] \overrightarrow { W } [/latex]

The dot product operator from linear algebra, also called the inner product:

[latex]{ Sim }_{ DP }(\overrightarrow { V } ,\overrightarrow { W } )=\overrightarrow { V } .\overrightarrow { W }[/latex]

[latex]{ Sim }_{ DP }(\overrightarrow { V } ,\overrightarrow { W } )={ V }_{ 1 }{ W }_{ 1 }+{ V }_{ 2 } W_{ 2 }+{ V }_{ 3 }{ W }_{ 3 }+……{ V }_{ n }W_{ n }[/latex]

Dot product gets high when  two vectors have high values in each direction, regardless of similarity. So more frequent words have higher dot products.

And low ( in fact 0) for orthogonal vectors with zeros in complementary distribution.

Vector length is given by-

[latex] \left| \overrightarrow { V }  \right| =\sqrt { \sum _{ i=1 }^{ N }{ { V }_{ i }^{ 2 } }  } [/latex]

But we don’t want a similarity metric that’s sensitive to word frequency.

Normalized dot product (or Cosine Similarity ):-

To correct problem in dot product method, we normalize: divide by the length of each vector:

[latex] { sim }_{ NDP }(\overrightarrow { V } ,\overrightarrow { W } )=\frac { (\overrightarrow { V } .\overrightarrow { W } ) }{ \left| \overrightarrow { V }  \right| \left| \overrightarrow { W }  \right| } [/latex]

[latex] { Sim }_{ NDP }(\overrightarrow { V } ,\overrightarrow { W } )=\frac { \sum _{ i=1 }^{ N }{ { V }_{ i }\times { W }_{ i } }  }{ \sqrt { \sum _{ i=1 }^{ N }{ { V }_{ i }^{ 2 } }  } \sqrt { \sum _{ i=1 }^{ N }{ { W }_{ i }^{ 2 } }  }  } [/latex]

[latex] =\frac { \overrightarrow { V } .\overrightarrow { W }  }{ \left| \overrightarrow { V }  \right| \left| \overrightarrow { W }  \right|  } [/latex]

If you really see , the above formula represents cosine angle between two vectors.

[latex] \overrightarrow { a } .\overrightarrow { b } =\left| \overrightarrow { a } \right| \left| \overrightarrow { b } \right| \cos { \theta } [/latex]

[latex]\cos { \theta =\frac { \overrightarrow { a } .\overrightarrow { b } }{ \left| \overrightarrow { a } \right| \left| \overrightarrow { b } \right| } }[/latex]

The normalized dot product is nothing but the cosine of the angle between vectors.

[latex] sim(\overrightarrow { V } ,\overrightarrow { W } )=cos(\overrightarrow { V } ,\overrightarrow { W } ) [/latex]

So [latex] sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad 1\quad [/latex] ; V & W are in the same direction.

[latex] sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad 0\quad [/latex] ; V & W are orthogonal.

[latex] sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad -1\quad [/latex] ; V & W are in the opposite direction.

Where

  • [latex] { V }_{ i } [/latex] is the PPMI value for word V in context i.
  • [latex] { W }_{ i } [/latex] is the PPMI value for word W in context i.

So cosine ranges from -1 (vectors pointing opposite directions) to 1 (same direction ).

So cosine ranges from -1 (vectors pointing opposite directions) to 1 (same direction ).

But in our case, since raw frequency or PPMI values are non-negative hence cosine range from 0 to 1.

Implement cosine similarity on our example-

References :-

https://cs.nyu.edu/courses/fall98/V22.0480-003/ambiguity

https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf

https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/15_slides-2×2.pdf

https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf

https://courses.helsinki.fi/sites/default/files/course-material/4680631/handout_09.pdf

https://towardsdatascience.com/art-of-vector-representation-of-words-5e85c59fee5

https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf

https://en.wikipedia.org/wiki/Pointwise_mutual_information

Article Credit:-

Name:  Praveen Kumar Anwla
6+ year of work experience

Leave a Comment