Personal tools
You are here: Home Events ANC/DTC Seminar: Cedric Archambeau, Amazon (Host TBC)

ANC/DTC Seminar: Cedric Archambeau, Amazon (Host TBC)

— filed under:

Latent IBP Compound Dirichlet Allocation: Sparse Topic Models Fit for Natural Languages.

  • ANC/DTC Seminar
When Jun 10, 2014
from 11:00 AM to 12:00 PM
Where IF 4.31/4.33
Add event to calendar vCal

I will introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages. The model, which we call latent IBP compound Dirichlet allocation (LIDA), allows for power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains.

If time permits I will discuss how this model can be used in the context of an idea management system.