My notations for the EM algorithm were a bit sloppy, as Nancy pointed out during the lecture. On pages 6, 7, 8, 10 of my slides [first part of my lecture, before Professor Eric Kolaczyk's talk], I have now written E(z_{ik}|X, \hat{\Theta}) instead of E(z_{ik}|X, \Theta), where \hat{\Theta} denotes the *current* parameter estimate, which gets updated at every EM iteration [a fact made explicit on page 6].
For students who didn't appreciate the subtlety here, my old notations could give the impression that w_{ik} was a function of \Theta. But that's not the case. In the M-step, w_{ik} takes on a particular value [computed by the E-step]; if w_{ik} were a function of \Theta, the maximization problem in the M-step would become a lot more complicated [and almost destroy the very purpose of the EM].
I have also improved page 13 of my slides [again, first part of my lecture]. When we start to model each word [rather than each document] as a mixture, we would, of course, expect the distribution p(\cdot;\theta_k) to take on a slightly different meaning.
On page 12, x_i = (x_{i1}, x_{i2}, ..., x_{id})^T, where each x_{ij} counts the number of times word j [or the j-th word in the vocabulary] appears in document i. In this case, each component of the mixture, p(x_i;\theta_k), is a "usual" multinomial distribution.
On page 13, I have now used the notation x_{it} [rather than x_{ij}, to avoid confusion] to denote the t-th word in document i, and it could be the first word in the vocabulary [x_{it}=1], the second word in the vocabulary [x_{it}=2], or the j-th word in the vocabulary [x_{it}=j], for j going all the way up to d. In this case, each component of the mixture, p(x_{it};\theta_k), is equal to \theta_{kj} for x_{it}=j. In other words, it is the probability that the t-th word in document i is the j-th word in the vocabulary. We can see that this distribution is still very much "multinomial" in spirit, except we are now tossing the die only once, rather than multiple times. I have revised page 13 of my slides to make this distinction more explicit.
For students who didn't appreciate the subtlety here, my old notations could give the impression that w_{ik} was a function of \Theta. But that's not the case. In the M-step, w_{ik} takes on a particular value [computed by the E-step]; if w_{ik} were a function of \Theta, the maximization problem in the M-step would become a lot more complicated [and almost destroy the very purpose of the EM].
I have also improved page 13 of my slides [again, first part of my lecture]. When we start to model each word [rather than each document] as a mixture, we would, of course, expect the distribution p(\cdot;\theta_k) to take on a slightly different meaning.
On page 12, x_i = (x_{i1}, x_{i2}, ..., x_{id})^T, where each x_{ij} counts the number of times word j [or the j-th word in the vocabulary] appears in document i. In this case, each component of the mixture, p(x_i;\theta_k), is a "usual" multinomial distribution.
On page 13, I have now used the notation x_{it} [rather than x_{ij}, to avoid confusion] to denote the t-th word in document i, and it could be the first word in the vocabulary [x_{it}=1], the second word in the vocabulary [x_{it}=2], or the j-th word in the vocabulary [x_{it}=j], for j going all the way up to d. In this case, each component of the mixture, p(x_{it};\theta_k), is equal to \theta_{kj} for x_{it}=j. In other words, it is the probability that the t-th word in document i is the j-th word in the vocabulary. We can see that this distribution is still very much "multinomial" in spirit, except we are now tossing the die only once, rather than multiple times. I have revised page 13 of my slides to make this distinction more explicit.
(MZ)