CBOW is a variant of the word2vec model predicts the center word from (bag of) context words.So given all the words in the context window (excluding the middle one), CBOW would tell us the most likely the word at the center.
For example, say we have a window size of 2 on the following sentence. Given the words (“PM”, “American”, “and”), we want the network to predict “Modi”.
The input of the network needs to change to take in multiple words. Instead of a “one hot” vector as the input, we use a “bag-of-words” vector. It’s the same concept, except that we put 1s in multiple positions (corresponding to the context words).
The CBOW architecture then looks like the following:
The training samples for CBOW look different than those generated for skip-gram.
With a window size of 2, skip-gram will generate (up to) four training samples per center word, whereas CBOW only generates one. With skip-gram, we saw that multiplying with a one-hot vector just selects a row from the hidden layer weight matrix. What happens when you multiply with a bag-of-words vector instead? The result is that it selects the corresponding rows and sums them together.
For the CBOW architecture, we also divide this sum by the number of context words to calculate their average word vector. So the output of the hidden layer in the CBOW architecture is the average of all the context word vectors. From there, the output layer is identical to the one in skip-gram.