Introduction to vector space models

Introduction to Vector space models


Expressing power of notations used to represent a vocabulary of a language has been a great deal of interest in the field of linguistics. Languages in practice have Lexical & semantic ambiguity.

Here’re couple of examples of semantic ambiguity-

“John and Mary are married.” (To each other? or separately?)

“John kissed his wife, and so did Sam”. (Sam kissed John’s wife or his own?)

Humans can handle texts quite intuitively but what if we’ve enormous amount of text and documents being generated every single day.

A human doing this job is neither scalable nor productive.

So how do we advance a machine level understanding for linguistic modelling task? And that’s where scientists came up with “Vector semantics models” or “Vector space models of meaning” or “Distributional models of meaning” for building knowledge-based systems with a learning capability . And these methods have been in industry for quite a long time.

In this article I’ll introduce you to Vector semantics models that leverage power of vectorization to represent texts.

What is vectorization:-

It’s no hidden fact that machines are better at understanding numbers. This process of converting text to numbers is called vectorization.  Vectors then get together to form a vector space , which is continuous in nature, an algebraic model where rules of vector addition and similarity measures apply.

Before we take deep dive in to what are different vectorization approaches that we can leverage.  First , let’s see few more real time use cases where application of such linguistic models is needed.

Why Vector space model?-

Here’re couple of examples to understand that why do we need vector semantics model to identify word similarity in corpus.

gorgeous” is similar to “beautiful

Question:- What is height of Statue of Unity?

Answer:- Statue of Unity is 182m tall.

And here “tall” is similar to “height“.

Plagiarism checker leveraging word similarity methods for plagiarism detection-

So one possible solution could be, why can’t we use thesaurus to check word similarity ?

But let’s understand couple of problems leveraging thesaurus for it-

Problems with thesaurus :-

  • Every language may not have thesaurus and even if they’ve then there’s high possibility that many words and phrases may be not be present in it.
  • As many words keep on adding to dictionary so one need to have updated thesaurus for every year
  • For historical linguistics , one need to compare words meanings & their synonyms from year at time stamp t to year time stamp t+1.
  • Thesaurus don’t work effectively well for verbs & adjectives.

Distributional Hypothesis:-

So this problem of not able to use Thesaurus leads us to one more possible solution , where we can compute word similarity automatically. But how do we do it?

You shall know a word by the company it keeps

 (Firth, J. R. 1957:11)

As this above famous quote suggests , what if we could deduce the meaning of a word by looking at its surrounding words? And this whole idea that “similar contexts imply similar meanings” is known as Distributional Hypothesis. So two words are similar if they have similar word contexts.

Vector space models (Distributional semantic models ) :-

Let’s look at below snippet of text-

Corona virus has been declared a global pandemic. Corona virus primarily spread between people during close contact . Symptoms of Corona virus disease include fever, cough, fatigue, shortness of breath, and loss of sense of smell in person.

So with reference to above context ,if one asks you that what is Corona then you’d be able to tell that Corona is a virus . (Leave that Corona beverage , that’s out of context here :D)

Distributional semantic models also called vector space models capture the word similarity using the co-occurrence Matrices ( Matrix representation of co-occurrence of Target word and context words being together). So let’s understand basic intuition behind them.

Distributional Semantics : Basic Intuition :-

Represent each word [latex] { W }_{ i } [/latex] as a vector of its contexts. So in matrix representation if Target word [latex] { W }_{ i } [/latex] co-occur with context word then their respective cell would have 1 in its entry otherwise 0.

Note: –In real word , this matrix would be far more sparse.

So how can we implement this idea to capture semantic meaning of words

Variants of co-occurrence matrix:-

Let’s understand different variation of co-occurrence matrix.

Term-document matrix – It represents how often a word occurs in a document.

Word-context matrix- How two co-occur.  This is also called term-term matrix.

Let’s understand each variant and how can they be used to compute similarity.

Term-document matrix :-

Each cell would have a count of frequency of a word [latex] { W }_{ i } [/latex] in a document.

Each document is called a count vector , if you’re using this representation to assess similarity between two documents.

Each word is called a count vector , if you’re using this representation to assess similarity between two words.

So how can we implement this idea to capture semantic meaning of words

Term-document matrix :-

Each cell would have a count of frequency of a word Wi in a document.

Each document is called a count vector , if you’re using this representation to identify association between two documents.

Each word is called a count vector , if you’re using this representation to identify association association two words.

In this representation ,two documents are entitled similar , if they vectors are similar.

In this representation ,two words are entitled similar , if they vectors are similar.

Word-context matrix-

This approach emphasizes upon using certain size of context window (certain number of words before and after target word)  in lieu of using entire document to capture word similarity.

Here each word is defined by a vector over counts of context words.

sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment.

Cautiously she sampled her first pineapple and another fruit whose taste she likened

well suited to programming on the digital computer. In finding the optimal R-stage policy from

 for the purpose of gathering data and information necessary for the study authorized in the

Resulting word-word matrix:

  • f(w, c) = how often does word w appear in context
  • c: “information” appeared six times in the context of “data”

If size of vocabulary ( unique words in corpus ) is V then each vector would be of length V.

And size of word-context matrix would be V*V.

The size of windows depends on your goals

  • The shorter the windows , the more syntactic the representation ± 1-3 very syntacticy
  • The longer the windows, the more semantic the representation ± 4-10 more semanticy

Problems with raw counts in co-occurrence matrices:-

Now we understand what is word similarity and how co-occurrence matrix can be used to capture it. But now let’s understand few downside of using co-occurrence matrix with raw counts in its entries.

  • And with large data size , this approach can be computationally expensive.
  • Raw frequency is not a great measure of association between words.

So we should rather have an approach that asks whether a context word is particularly informative about the target word.

Should we use raw counts?

  •  For the Term-document matrix -We can use tf-idf instead of raw term counts . checkout this article to understand tf-idf in more detail.
  • For the Word-context matrix – Positive Pointwise Mutual Information (PPMI) is common

And this leads us to concept called Positive Pointwise Mutual Information (PPMI).

So in our next post we’ll talk about –

  • What is Positive Pointwise Mutual Information (PPMI).

References :-×2.pdf

Article Credit:-

Name:  Sameer Koleshwar
Designation: M.Tech in Signal Processing and Machine Learning, NITK
Research Area: Natural Language Processing

Leave a Comment