## Introduction :-

So in our last article we’ve understood that how to calculate PMI score. Now in this article we’ll talk about how can we find word similarity using PPMI score.

## Identify word similarity or Vector similarity :-

In distributional models, every word is a point in n-dimensional space. How do we measure the similarity between two points/vectors?

• So, let’s assume we have context vectors for two target words $\overrightarrow { V }$ & $\overrightarrow { W }$
• Each contains PMI (or PPMI) values for all context words
• One way to think of these vectors: as points in high-dimensional space

Ex. in 2-dim space: cat = (v1, v2), computer = (w1, w2)

One obvious choice could be Euclidean distance.

## Find word similarity using Euclidean Distance:-

We could measure similarity(distance)  using Euclidean distance

Euclidean Distance $=\sqrt { \sum _{ i=1 }^{ N }{ { \left( { x }_{ i }-{ y }_{ i } \right) }^{ 2 } } }$

But this doesn’t work well if even one dimension has an extreme value

There’re few more possible similarity measures to calculate similarity-

## Other possible similarity measures-

Following are predominantly used approaches to calculate distance or similarity –

• Manhattan Distance ( Levenshtein Distance , L1 norm) –
• ${ dist }_{ L1 }(\overrightarrow { x } ,\overrightarrow { y } )=\sum _{ i=1 }^{ N }{ \left| { x }_{ i }-{ y }_{ i } \right| }$
• Euclidean distance (l2 norm) –
• ${ dist }_{ L2 }(\overrightarrow { x } ,\overrightarrow { y } )=\sqrt { \sum _{ i=1 }^{ N }{ { \left( { x }_{ i }-{ y }_{ i } \right) }^{ 2 } } }$
• Jaccard similarity-
• ${ Sim }_{ Jaccard }(\overrightarrow { V } ,\overrightarrow { w } )=\frac { \sum _{ i=1 }^{ N }{ min({ V }_{ i } } ,W_{ i }) }{ \sum _{ i=1 }^{ N }{ max({ V }_{ i } } ,W_{ i }) }$
• Dice similarity-
• ${ Sim }_{ Dice }(\overrightarrow { V } ,\overrightarrow { w } )=\frac { 2\times \sum _{ i=1 }^{ N }{ min({ V }_{ i } } ,W_{ i }) }{ \sum _{ i=1 }^{ N }{ max({ V }_{ i } } +W_{ i }) }$

## Dot product :-

Another possibility is take the dot product of $\overrightarrow { V }$ & $\overrightarrow { W }$

The dot product operator from linear algebra, also called the inner product:

${ Sim }_{ DP }(\overrightarrow { V } ,\overrightarrow { W } )=\overrightarrow { V } .\overrightarrow { W }$

${ Sim }_{ DP }(\overrightarrow { V } ,\overrightarrow { W } )={ V }_{ 1 }{ W }_{ 1 }+{ V }_{ 2 } W_{ 2 }+{ V }_{ 3 }{ W }_{ 3 }+……{ V }_{ n }W_{ n }$

Dot product gets high when  two vectors have high values in each direction, regardless of similarity. So more frequent words have higher dot products.

And low ( in fact 0) for orthogonal vectors with zeros in complementary distribution.

Vector length is given by-

$\left| \overrightarrow { V } \right| =\sqrt { \sum _{ i=1 }^{ N }{ { V }_{ i }^{ 2 } } }$

But we don’t want a similarity metric that’s sensitive to word frequency.

## Normalized dot product (or Cosine Similarity ):-

To correct problem in dot product method, we normalize: divide by the length of each vector:

${ sim }_{ NDP }(\overrightarrow { V } ,\overrightarrow { W } )=\frac { (\overrightarrow { V } .\overrightarrow { W } ) }{ \left| \overrightarrow { V } \right| \left| \overrightarrow { W } \right| }$

${ Sim }_{ NDP }(\overrightarrow { V } ,\overrightarrow { W } )=\frac { \sum _{ i=1 }^{ N }{ { V }_{ i }\times { W }_{ i } } }{ \sqrt { \sum _{ i=1 }^{ N }{ { V }_{ i }^{ 2 } } } \sqrt { \sum _{ i=1 }^{ N }{ { W }_{ i }^{ 2 } } } }$

$=\frac { \overrightarrow { V } .\overrightarrow { W } }{ \left| \overrightarrow { V } \right| \left| \overrightarrow { W } \right| }$

If you really see , the above formula represents cosine angle between two vectors.

$\overrightarrow { a } .\overrightarrow { b } =\left| \overrightarrow { a } \right| \left| \overrightarrow { b } \right| \cos { \theta }$

$\cos { \theta =\frac { \overrightarrow { a } .\overrightarrow { b } }{ \left| \overrightarrow { a } \right| \left| \overrightarrow { b } \right| } }$

The normalized dot product is nothing but the cosine of the angle between vectors.

$sim(\overrightarrow { V } ,\overrightarrow { W } )=cos(\overrightarrow { V } ,\overrightarrow { W } )$

So $sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad 1\quad$ ; V & W are in the same direction.

$sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad 0\quad$ ; V & W are orthogonal.

$sim(\overrightarrow { V } ,\overrightarrow { W } )\quad =\quad -1\quad$ ; V & W are in the opposite direction.

Where

• ${ V }_{ i }$ is the PPMI value for word V in context i.
• ${ W }_{ i }$ is the PPMI value for word W in context i.

So cosine ranges from -1 (vectors pointing opposite directions) to 1 (same direction ).

So cosine ranges from -1 (vectors pointing opposite directions) to 1 (same direction ).

But in our case, since raw frequency or PPMI values are non-negative hence cosine range from 0 to 1.

## References :-

https://cs.nyu.edu/courses/fall98/V22.0480-003/ambiguity

https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf

https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/15_slides-2×2.pdf

https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf

https://courses.helsinki.fi/sites/default/files/course-material/4680631/handout_09.pdf

https://towardsdatascience.com/art-of-vector-representation-of-words-5e85c59fee5

https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf

https://en.wikipedia.org/wiki/Pointwise_mutual_information

0