[Corpora-List] EACL 2014 Tutorial on Structured Sparsity in Natural Language Processing

Andre Martins afm at cs.cmu.edu
Fri Mar 7 00:08:38 CET 2014



EACL 2014 Tutorial on Structured Sparsity in Natural Language Processing:

Models, Algorithms and Applications

André F. T. Martins <http://www.cs.cmu.edu/%7Eafm>, Mário A. T. Figueiredo <http://www.lx.it.pt/%7Emtf>, Noah A. Smith <http://www.cs.cmu.edu/%7Enasmith>, and Dani Yogatama

This tutorial will cover recent advances in sparse modeling with diverse applications in natural language processing (NLP). A sparse model is one that uses a relatively small number of features to map an input to an output, such as a label sequence or parse tree. The advantages of sparsity are, among others, compactness and interpretability; in fact, sparsity is currently a major theme in statistics, machine learning, and signal processing. The goal of sparsity can be seen in terms of earlier goals of feature selection and therefore model selection (Della Pietra et al., 1997; Guyon and Elisseeff, 2003; McCallum, 2003).

This tutorial will focus on methods which embed sparse model selection into the parameter estimation problem. In such methods, learning is carried out by minimizing a regularized empirical risk functional composed of two terms: a "loss term," which controls the goodness of fit to the data (e.g., log loss or hinge loss), and a "regularizer term," which is designed to promote sparsity.

The simplest example is L1-norm regularization (Tibshirani, 1996), which penalizes weight components individually, and has been explored in various NLP applications (Kazama and Tsujii, 2003; Goodman, 2004). More sophisticated regularizers, those that use mixed norms and groups of weights, are able to promote "structured" sparsity: i.e., they promote sparsity patterns that are compatible with a priori knowledge about the structure of the feature space. This kind of regularizers has been proposed in the statistical and signal processing literature (Yuan and Lin, 2006; Zhao et al., 2009; Bach et al., 2011) and is a recent topic of research in NLP (Eisenstein et al., 2011; Martins et al, 2011, Das and Smith, 2012, Nelakanti et al. 2013). Some regularizers are even able to encourage structured sparsity, without prior knowledge about this structure (Bondell et al., 2007; Zeng et al., 2013). Sparsity-inducing regularizers require the use of specialized optimization routines for learning (Bach et al., 2011, Wright et al., 2009; Xiao, 2009; Langford et al., 2009).

The tutorial will consist of three parts: (1) how to formulate the problem, i.e., how to choose the right regularizer for the kind of sparsity pattern intended; (2) how to solve the optimization problem efficiently; and (3) examples of the use of sparsity within natural language processing problems.

Tutorial outline:

1. Introduction (30 minutes):

- What is sparsity?

- Why sparsity is often desirable in NLP

- Feature selection: wrappers, filters, and embedded methods

- What has been done in other areas: the Lasso and group-Lasso, compressive sensing, and recovery guarantees

- Theoretical and practical limitations of previous methods to typical NLP problems

- Beyond cardinality: structured sparsity

2. Group-Lasso and Mixed-Norm Regularizers (45 minutes):

- Selecting columns in a grid-shaped feature space

- Examples: multiple classes, multi-task learning, multiple kernel learning

- Mixed L2/L1 and L?/L1 norms: the group Lasso

- Non-overlapping groups

- Example: feature template selection

- Tree-structured groups

- The general case: a DAG

- Coarse-to-fine regularization

- Unknown structure: feature grouping

- Open problems

Coffee Break (15 minutes)

3. Optimization Algorithms (45 minutes):

- Non-smooth optimization: limitations of subgradient algorithms

- Quasi-Newton methods: OWL-QN

- Proximal gradient algorithms: iterative soft-thresholding, forward-backward and other splittings

- Computing proximal steps

- Other algorithms: FISTA, Sparsa, ADMM, Bregman iterations

- Convergence rates

- Online algorithms: limitations of stochastic subgradient descent

- Online proximal gradient algorithms

- Managing general overlapping groups

- Memory footprint, time/space complexity, etc.

- The "Sparseptron" algorithm and debiasing

- Open problems (e.g., non-convex objectives)

4. Applications (30 minutes):

- Sociolinguistic association discovery

- Sequence problems: named entity recognition, chunking

- Multilingual dependency parsing

- Lexicon expansion

5. Closing Remarks and Discussion (15 minutes)


Looking forward to seeing you in the tutorial!

André, Mário, Noah, Dani

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 27639 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140306/7b2cca2d/attachment.txt>

More information about the Corpora mailing list