Introduction:-
In our last article we’ve talked about below topics –
- What is vectorization
- Why Vector Semantics model?
- What is Distributional Hypothesis
- Distributional semantic models (vector-space models)
- Intution behind Distributional semantics
- Variants of co-occurrence matrix
So if you’ve not checked that out then I’d highly recommend to check it out here.
Point-wise mutual information (PMI) :-
In our last Article We’ve seen that raw counts are not a great measure to identify word association, therefore we want to use PMI values in lieu of raw frequency counts to measure word association.
PMI score tells us how much more/less likely the co-occurrence is than if the words were independent.
PPMI(W,C)=\log _{ 2 }{ \frac { p(W,C) }{ p(W)\quad p(C) } }
Positive Point-wise mutual information (PPMI ):-
PMI score could range from −∞ to + ∞
But the negative values are problematic
- Things are co-occurring less than we expect by chance
- Unreliable without enormous corpora
- Imagine w1 and w2 whose probability is each 10-6
- Hard to be sure p(w1,w2) is significantly different than 10-12
- Plus it’s not clear people are good at “un-relatedness”
So negative PMI score tells us that two words co-occur less than we expect. Because for infrequent words we do not have enough data to accurately determine negative PMI values .So to handle this problem we just replace negative PMI values by 0.
Positive PMI (PPMI) between word1 and word2 can be written as follows-
PPMI(Word1,Word2)=max(\log _{ 2 }{ \frac { p(Word1,\quad Word2) }{ p(Word1)\quad p(Word2) } ,0) }
Above equation can be rewritten as follows-
Where,
{ p(W,C)=\frac { { f }_{ ij } }{ \sum _{ i=1 }^{ W }{ \sum _{ j=1 }^{ C }{ { f }_{ ij } } } } } { p({ W }_{ i })=\frac { \sum _{ j=1 }^{ C }{ { f }_{ ij } } }{ N } } { p({ C }_{ j })=\frac { \sum _{ i=1 }^{ W }{ { f }_{ ij } } }{ N } }Where ,
- p(W,C) -is the probability of seeing Target word w and context word c together.
- p(W) & p(C) – the probability of occurring Target word w & context word C , if they’re independent
- { f}_{ ij } is number of times { W}_{ i } occurs in context { C}_{ j }
Computing PPMI on a term-context matrix:-
Let’s understand the PPMI with below matrix representation with W rows (words) and C columns (contexts, e.g. in the form of Words.)
Weighting PMI:-
If you notice from above matrix then you’ll know that PMI is biased toward infrequent events. Very rare words have very high PMI values . So we can improve PMI further with two possible solutions-
- Use add-k smoothing
- Give rare words slightly higher probabilities (which has a similar effect)
K-Smoothing in PMI computation:-
As we’ve seen PMI is biased toward infrequent events , in our case possibility of two words getting co-occurred together.
So we add 2 in every cell of co-occurrence matrix like below-
PPMI V/S k- smoother PMI:-
So in this article we’ve understood that how to calculate PMI score. Now in next article we’ll talk about how can we use these PMI scores to measure word similarity.
References :-
https://cs.nyu.edu/courses/fall98/V22.0480-003/ambiguity
https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf
https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/15_slides-2×2.pdf
https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf
https://courses.helsinki.fi/sites/default/files/course-material/4680631/handout_09.pdf
https://towardsdatascience.com/art-of-vector-representation-of-words-5e85c59fee5
https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture17HO.pdf