NIPS 2013 reading list
As usual, here is my nips 2013 reading list. It’s incomplete, and only has papers I could find online as of a couple of weeks ago.
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction; R. Johnson, T. Zhang
Another paper proposing a linear convergence rate for stochastic gradient descent. This builds upon the earlier stochastic average gradient approach, which works by storing the gradient of all training examples as well as their sum, and then in each iteration sampling a training example and updating the sum of gradients by how much the gradient with respect to this example changed in the past.
The contribution of this paper is that they prove that if instead of keeping the gradient for every example and a running sum of all gradients you keep an old weight vector around and the sum of gradients computed with respect to that weight vector you can efficiently “correct” this gradient sum with respect to one training example at a time and still get the same convergence rate.
A cool thing is that this lets you implement gradient updates for sparse feature vectors efficiently, since you can keep a multiplier on the full gradient you keep around, as the information that varyes from sample to sample is sparse.
An Approximate, Efficient LP Solver for LP Rounding; S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, C. Zhang
This paper has two intertwined contributions: first they show how to approximately solve many LPs that show up in machine learning with coordinate descent on a quadratically-smoothed version of the problem (which looks like ADMM), and then they show that approximate solutions to LP relataxions work almost as well as exact solutions when doing rounding to approximate ILPs.
The results are compelling, and I’m curious to try this quadratic approximation in some LPs and see what happens.
A simple example of Dirichlet process mixture inconsistency for the number of components; J. Miller, M. Harrison
I think this paper makes a very helpful point. In many places and casual conversations one hears dismissive remarks that using a Dirichlet process prior is the way to do inference in models when one doesn’t know the actual number of components. While this is clearly not always the case, as the Dirichlet process makes some quite restrictive assumptions about the sizes of the produced clusters (a few big clusters and many small clusters, essentially required for exchangeable models), this is the first work I know of that shows that it actually also misrepresents the size of the clusters (which is behavior anyone who tried fitting a DP mixture model to random data has observed).
Better Approximation and Faster Algorithm Using the Proximal Average; Y. Yu
This paper shows the neat idea that if you have many nonsmooth regularizers in an optimization problem you can still do something fista-like by instead of computing the expensive prox operator of all the regularizers together simply doing an accelerated gradient step on the smooth part and averaging the prox operator of each regularizer. While this won’t give you sparsity, it seems to be efficient, and it’s very easy to implement.
Distributed Representations of Words and Phrases and their Compositionality; T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean
This is the second paper from the google team working on word embeddings. As with the earlier work, the results are pretty impressive. It still strikes me as counterintuitive that these local bag-of-words or skip-bigram models end up capturing semantics sometimes better than more complex models which take account of things such as word order. The contributions in this one are using phrases, exploring negative-example training instead of the tree structure, subsampling frequent words, and more nice eye candy of the form A - B + C = D, which is always fun to stare at. To be fair, though, some human selection has to be done, because intuitive semantics is not as commutative as these operations. For example, “Berlin - Germany + France = Paris” looks good, but “Berlin - Paris + France = Germany” really doesn’t.
Dropout Training as Adaptive Regularization; S. Wager, S. Wang, P. Liang
This paper investigates what regularizer is actually implied by dropout on linear models. It turns out to be something suspiciously close to the regularizers from Adagrad and confidence-weighted learning algorithms, in that they depend on the covariance matrix of the data (which is intuitive, since randomly deleting features is going to have an effect proportional to how correlated the features are). They present somewhat good results on training models with the exact regularizer, as well as a variant of adagrad which is closer to dropout.
Estimation, Optimization, and Parallelism when Data is Sparse; J. Duchi, M. Jordan, B. McMahan
This paper analyses learnign in the common (for me) case of very high dimensional very sparse feature vectors. As has been empirically observed, adagrad does well, and this paper proves matching lower and upper bounds on performance which are achieved by adagrad, which is nice. Moreover, it proposes and proves good properties about a parallel implementation of dual averaging which is essentially what I’ve been using for the past year in factorie, with good results.
Generalized Denoising Auto-Encoders as Generative Models; Y. Bengio, L. Yao, G. Alain, P. Vincent
This is further work on the line of research interpreting autoencoders as generative models. Interestingly enough, the training algorithms (like Algorithm 1 in this paper) are looking a lot like training algorithms for RBMs, but hopefully with more stability and less issues with things like gaussian units. Looking at samples from the distributions defined by autoencoders is really interesting.
Learning word embeddings efficiently with noise-contrastive estimation; A. Mnih, k. kavukcuoglu
This is very similar to the Google paper I cited above. It shares one contribution, noting that using negative contrastive estimation is faster than a tree of classifiers (and doesn’t create the problem of coming up with a good tree structure). I’m puzzled by the fact that the numbers seem a bit worse than the Google numbers.
Projecting Ising Model Parameters for Fast Mixing; J. Domke, X. Liu
The idea is that since it is known that some model structures (involving low coupling) make sampling efficient, how about picking a model which doesn’t fit this template and somehow projecting its parameters into it to improve the efficiency of sampling? This paper explores many ways of doing so, and things based on the reverse KL divergence (minimizing the KL between the distributions induced approximate and true parameters), which is similar to a mean-field approximation, works well (though it barely outperforms loopy BP, as usual).
Reasoning With Neural Tensor Networks for Knowledge Base Completion; R. Socher, D. Chen, C. Manning, A. Ng
This paper is another one attacking the problem of reasoning in knowledge bases as tensor completion. Here the focus is on learning relations: each entity gets a vector, and each relation (in wordnet or freebase) gets a neural network which takes two entities and produces a score. The model is trained to minimize training loss and performs reasonably well on test data. One clever idea is to represent entities with the average representations of the words which compose them (coming from any neural language model, like Collobert & Weston’s); then knowledge is automatically shared across entities which share words, which can be good..
Streaming Variational Bayes; T. Broderick, N. Boyd, A. Wibisono, A. Wilson, M. Jordan
The idea of this paper is to pick ideas from the stochastic variational inference algorithms, which approximate the posterior given the prior and all observations by taking stochastic natural gradient steps and use these techniques for a streaming setting, where you want to approximate the posterior given the previous posterior and one observation. The technique is parallelizable, and seems to work pretty well.
Variance Reduction for Stochastic Gradient Optimization; C. Wang, X. Chen, A. Smola, E. Xing
This is another paper which has a variance-reduction strategy to improve stochastic gradient descent. The idea is to use control variates: you define a known quantity, which is also an expectation of something over the data, but whose errors in each sample you can compute exactly, and whose error is correlated with the stochastic gradients of your function. Then instead of using the actual stochastic gradients you apply to each gradient the correction that would have removed the error in the control variate. This then should minimize the negative effect of having too much variance in your stochastic gradients.
Deriving these control variates seems a bit like black magic to me, but that’s because I’m not familiar with them.
They test the technique on logistic regression and stochastic variational inference, and the results look ok for logistic regression but quite impressive for variational inference.
My thesis: Combinatorial algorithms and linear programming for inference in natural language processing.
I recently defended my phd thesis in UNICAMP, and I’ve decided to revive this blog (in a new address) with a version of the text I used in my defense.
So the thesis is about combinatorial and linear programming perspectives for inference in NLP, and how to intermingle good things from both approaches to make things faster, not asymptotically (which is not that interesting, as sentences aren’t getting any longer over time) but in practice, on models which are actually used.
The main tool I’ll use throughout is the linear program. A linear program is an optimization problem (maximization or minimization) where you optimize a linear objective subject to linear constraints. The general form of a linear program is
Linear programs have really nice properties. The big one for our cases is duality, which is that for any linear program with a well-defined optimum you can construct a lower-bounding linear program called the dual program.
First write the Lagrangian of the LP, leaving the positivity constraints implicit,
where lambda, coming from inequality constraints, is guaranteed to be positive
We can then collect terms to get
which is also the Lagrangian of the following optimization problem
That is, each variable of the primal problem is a constraint in the dual problem, and vice versa.
Any feasible setting of this dual problem upper-bounds the objective value of the primal problem. We also have strong duality, which says that the objective values of the two problems are the same. So if we can construct a primal setting and a dual setting with the same objective value we know we’ve solved the problem.
Moreover, if we have primal and dual optimal variables x* and lambda*, we have complimentary slackness, which says that either
A final interesting property is that if you have more constraints than variables (and you’re feasible) you are guaranteed to have optimal solutions in which most constraints are satisfied with a slack.
Then because of duality, if you have more variables than constraints you must have optimal solutions in which most variables are set to 0.
This leads to cutting planes and column generation. These techniques originated with the ellipsoid algorithm and the Dantzig-Wolfe decomposition, but they are very simple meta algorithms for solving LPs.
For cutting planes, first you initialize an LP with the primal variables and a subset of the constraints, and solve it to optimality. Then, you look for violated constraints, that is indices i such that
Then you pick one of these violated constraints, add them to the restricted problem, and solve it again. When you find no violated constraints you’re done, and hopefully this will happen before you add all constraints to your problem.
The dual of cutting planes is column generation, and it can be weird at first. You again initialize a restricted problem, but this time with only a subset of the variables. You solve it to optimality, and get the dual variables from the optimality certificate. Then you search over primal variables with positive reduced cost
These are variables which, added to your solution, are expected to bring your optimization value up. Then you iteratively add them and resolve the problem until there are no such variables with negative reduced cost, which means all your dual constraints are satisfied and you have a primal dual optimal pair.
The trick for making cutting planes and column generation algorithms efficient is having procedures which can search for violated primal/dual constraints in less time than it would take to enumerate all constraints.
There are also strong connections between dynamic programming and linear programming. For example, here is a linear program for weighted longest paths on directed graphs. We have an indicator variable for each edge, and if a node is in a path and is not the source or target its number of incoming edges equals its number of outgoing edges, and the target has one incoming edge:
The dual of this problem has a variable per non-source node and constraints for each edge, and is written as
which, if the graph is acyclic, can be solved by topologically sorting the edges, starting from the source, and iteratively setting the value of an edge’s endpoint to be the maximum of its current value and than the value of the edge’s start point plus the weight of the edge, which is the standard dynamic programming algorithm for longest paths.
The reason linear programs are so important to us now is that the problem of MAP inference is usually specifiable as an optimization problem with a linear objective, and in many cases linear constraints as well.
The problem of inference arises in the structured prediction task. This is when you have something which looks like classification, but the set of valid labels for any given example is too large. Examples from NLP and vision.
Each label then becomes many training examples for a binary classifier, one per part.
The idea of structured linear models is to represent each possible label as being defined by the presence / absence of smaller parts. This also allows learning: each part defines a binary classification example, and learning the parameters of a structured linear model trains all these classifiers jointly.
The most popular example of a structured linear model is a graphical model, but in the thesis we also make reference to hypergraph models, and literature still has others more.
The idea of a structured linear model is to express each label with a binary indicator vector of which parts are active and a set of valid such vectors, which is always representable as a possibly exponentially large set of linear constraints.
As you saw in the thesis, we can define learningm joint inference, and all sorts of other things just by assuming this linear objective and set of valid inputs structure of structured linear models.
The main problem we will concern ourselves in this thesis is the problem of MAP inference in a structured linear model, or finding a maximizer for
The first technique we will look at to accelerate algorithms for this problem is column generation, and for linear chain graphical models (though the technique does generalize beyond that, as seen in the thesis).
First, a review. The viterbi algorithm for MAP inference in a chain model looks like this:
Note that this is the longest path algorithm where t(x_i-1,x_i) + theta_i(x_i) plays the role of the edge cost, and edges connect variables and values. We can then write the primal LP
Now that we have a linear program let’s look at the main ingredients for a column-generation algorithm.
First, we need a way to define and solve many restricted versions of the original problem, with less primal variables.
Note that the Viterbi algorithm is really fast: it solves this LP in linear time, looking at each primal variable only once, and then revisiting only the primal variables which will be in the final solution. Hence feeding this LP into a solver doesn’t make sense, and we should use Viterbi itself to define and solve the restricted problems.
This is easy: removing a primal variable means removing a constraint, which is equivalent to doing the maximizations in the Viterbi algorithm over less values of the previous variable in the chain.
That said, we’re still not done. For this LP the reduced cost of a primal variable is
That is, the amount by which the corresponding Viterbi constraint is violated. However, this is a problem, as checking all the reduced costs then becomes as expensive as running Viterbi itself, which is what we’re trying to beat.
This comes down to the earlier point: the key to a column generation algorithm is a good reduced-cost oracle.
Fortunately, we can exploit the structure of the problem to construct such an oracle. Note that the terms in its equation decouple into things which depend only on x_i-1, things which depend only on x_i, and things which depend on both. Moreover, in many models the things which depend on both are a constant throughout the chain, and even when they are not it’s often easy to bound their magnitude. With such a bound, one can go and search first for all values of x_i-1 which could conceivably lead to a negative reduced cost, then conditioned on those values find the set of values of x_i which could lead to a negative reduced cost, and finally evaluate only those values.
While this is technically sufficient for an algorithm which is marginally faster than viterbi, it is possible to get a substantial further speedup by using reduced costs from a special-purpose LP which uses forward and backward information in the chain.
A sketch of its derivation is to first write the LP with two variables per edge, one going forwards and one going backwards. The dual of this LP is a dynamic programming algorithm for MAP message passing. But note that if you give me a solution in which the forward and backward variable are set differently I can give you another solution with the same scores in which they are set identically, so we can just use one set of variables in the primal problem. This leads to a dual problem that instead of having paired constraints like
has “merged” constraints like
and it’s easy to see that satisfying the paired constraints is sufficient but not necessary to satisfy the merged constraints, and yet the objective value of the LP is the same.
Using the reduced cost from the merged constraints has now a very attractive property: we can use global information from both sides of the chain around an LP variable when deciding whether to include it in our restricted LP. Experimentally this leads to big wins.
Here is one table of results: it’s easy to see that for the two tasks the column generation approach outperforms viterbi, is as fast as approximate inference, and yet is exact.
For more details and extensions please see the thesis.
Now we’re going to talk about the second contribution in the thesis, a new LP relaxation for joint inference.
Joint inference is a term that has been used a lot over the years to mean somewhat different things in different times. Most of them fall into one of the two following camps.
In the first, you have a thing you want to analyze, like a document, and there are many things you want to predict from it, like pos tags, NER, parse trees, role labels, etc. You can then label some data and train a pipeline of structured linear models for all these tasks, each building on the labels of the ones that come before. Or, you could do them jointly, by allowing later stages in the pipeline to revise the decisions made in earlier stages, like a role labeler changing the phrase attachment on a parse tree, or a parser changing POS tags.
The second meaning of joint inference is when you have many things to process, all similar, and you want to exploit this extra data to do better. For example, when POS tagging unknown words, having many occurrences of the word helps you know its contexts, and decide on its reasonable tags better than acting independently each time. This is what I refer to as corpus-wide inference.
In both these scenarios there are two competing desires: we want to reuse the known-good models and algorithms for the individual tasks, and we want to leverage information from accross tasks to improve performance.
Simply defining a big joint structured linear model for all tasks is tricky. Even if inference is tractable in the joint model (say if both tasks are sequence segmentation tasks over the same sequence), it might be much more expensive due to the combinatorial explosion over the number of labels.
Moreover, there is the intuition that the jointness is there really only to adjust the answers of already-good models.
A popular form of doing joint inference is dual decomposition. If you have two or more structured linear models which are mostly independent, and can express jointness constraints (they should agree on the things on which they overlap) with linear equality constraints, you can then write the following primal problem
and then dualize the constraint into a Lagrangian and maximize the Lagrangian over the primal variables to get a dual objective
which is convex and can be solved with subgradient ascent.
Note that to compute the value and subgradient for this optimization problem, which is all that is necessary to solve it, all we need to do is maximize each structured linear model independently, with reweighted costs. Hence we can reuse the combinatorial structure of the known algorithms for those problems to solve the more complex joint problem.
There are also block-coordinate descent methods for solving this optimization problem if some restrictions are placed on the matrices A and B, essentially that they define “variables” which partition the solution space of each structured linear model into mutually exclusive states.
Then you can define a max-marginal as
the best possible score attainable by a label in each part of the partition. Since max marginals over A are linear in dual variables premultiplying A, a locally optimal way of setting the dual variables is
which will equalize the max marginals of the variables defined by A and B, making them agree on the maximizing label and satisfy the primal constraint.
For local joint models this is fine, but for global joint models it’s a little unsatisfying. In that case you really want to just encourage different models to agree, not force them to. The standard way to encourage two models to agree with DD is to create a separate graphical model with one variable for A and one for B, and one factor connecting them, and then add dual decomposition constraints that force each copy to agree with the original variable.
Then the process of joint inference looks like this. Each model makes its independent prediction, the consensus graphical model disagrees with both, so subgradient messages are passed, inference is rerun in both models, there is still disagreement, and so subgradient messages are sent again, and finally each model finds out about the optimal values of the other model.
This is unsatisfying, because it takes two rounds of messages for the actual models to communicate, since their messages are mediated by an intermediary model.
And this brings us to the second main contribution of this thesis, which is a way of doing this style of joint inference, where factors connect different submodels, in a way that communicates between the models with every message, instead of every other message.
The key idea is to fold the factor of the auxiliary model into the optimization problem itself. First we represent the factor scores in terms of disagreement penalties instead of agreement scores, and get a primal problem like
Representing the max(0, x) with constraints and extra auxiliary variables, which we optimize out, we get a dual problem where, essentially, we’re allowed to remove up to c_ij from the score of a setting x such that Ax = e_i as long as By != e_j.
Moreover, the objective is functionally identical to the dual decomposition objective, except the dual variable lambda’s size is bounded by the penalties.
This means we can still reuse the known inference algorithms as black boxes, as in dual decomposition.
So it’s as if we’re trying to enforce a dual decomposition constraint but we give up if setting lambda sufficiently high does not enforce it.
Similarly to the dual decomposition case, it’s possible to derive a block coordinate descent algorithm for this objective, for faster convergence. To do that we need to restrict ourselves to agreement factors, where c_ij is nonzero only if i == j.
The idea is similar: use max-marginals to find conditions for optimality of the lambdas, and find a lambda satisfying these conditions. The actual execution is trickier than the one for dual decomposition for many reasons. First, the boundedness constraints on lambda mean it’s not possible to simply equalize the max marginals, even if this is what’s specified by the penalties. Secondly, it’s not obvious which value the primal maximizer will have.
Our contribution, however, finds such a value by enumeration and then finds lambdas that satisfy the necessary constraints. The algorithm is pretty hairy, and I won’t put it here.
Experimentally, we compared these algorithms to an implementation of dual decomposition in a corpus-wide task from a recent paper of Sasha Rush and Michael Collins.
Here are some representative experimental results
As you can see, it does empirically outperform the approach where auxiliary models are used.
As I explain in the thesis, the approach is very general, and all forms of pairwise factors can be represented using it. It’s also easy to make a symmetrized version of the objective which penalizes the absolute value distance between Ax and By, which leads to simpler algorithms.
So as we’ve seen there are many connections between linear programming and the combinatorial algorithms for MAP inference, and we can productively use reasoning in terms of LPs to accelerate the known inference algorithms.
So this is mainly it, and I hope the thesis is readable and doesn’t have too many mistakes on it. I also hope to have more time to blog now that this is out of the way.
My #NIPS2012 reading list, part 3 of 3
MAP Inference in Chains using Column Generation
D. Belanger, A. Passos, S. Riedel, A. McCallum
(disclaimer: this is my paper!) Usually when doing exact MAP inference in chains and tree graphical models one uses the viterbi algorithm (or max-product BP in trees), which will consider all possible settings of all pairs of adjacent variables before settling in the optimal solution. At the same time, many inference problems are “easy” in that simple techniques like beam search (only considering K possible “parents” for each variable instead of going over all of them) work really well, and reach solutions which are very similar to the optimum. What we do in this paper is use column generation, a meta-algorithm for solving linear programs that looks like cutting planes in the dual LP, to efectively run exact inference in the same time it takes to do beam search by efficiently pruning most “bad” states from even being considered during inference. We also find out that the pruning efficiency is much higher when we send messages forwards and backwards over the chains, instead of only in one direction, so we’re doing more work only because it will make doing less work easier.
Practical Bayesian Optimization of Machine Learning Algorithms
J. Snoek, H. Larochelle, R. Adams
One of the best methodological advances in machine learning recently is the trend towards explicitly dealing with hyperparameter optimization, which is setting things like batch sizes, regularization strengths, learning rates, pruning thresholds, etc, which has historically been dealt with by getting really smart people to build highly complex mental models of the learning algorithms, leading to brittleness and a general difficulty in reimplementing and reproducing good results. The special contribution of this paper is to extend the already-known Gaussian process-based Bayesian optimization approach to hyperparameter optimization to dealing explicitly with the covariance parameters of the GP kernel.
Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
K. Jiang, B. Kulis, M. Jordan
This is an extension (in a way) of the earlier cool paper by a subset of the authors on how one could generalize k-means to work with Dirichlet processes in a very simple way (essentially create new clusters for all points which are sufficiently far from any of the given cluster centers). This time the nonparametric k-means approach is generalized (in pretty much the same way) to work with any Bregman divergence (instead of necessarily the squared distance) and also extended to the hierarchical DP setting, giving a k-means-like algorithm for LDA, by directly clustering tokens.
Truncation-free Online Variational Inference for Bayesian Nonparametric Models
C. Wang, D. Blei
This proposes a new variational algorithm for Bayesian nonparametric models that doesn’t need an explicit a priori specification of the truncation parameter. This is achieved by first modifying the variational update on the latent variables to depend on the expected likelihood instead of the exponentiated expected log-likelihood, which they claim leads to a better estimation of the variance of the approximating distribution. They then use Gibbs sampling on the untruncated space to approximately compute these expectations. This idea apparently also extends to stochastic variational inference.
Emergence of Object-Selective Features in Unsupervised Feature Learning
A. Coates, A. Karpathy, A. Ng
This achieves similar results to the ICML google paper on deep neural networks and cat detectors, except with a lot less CPU time, working with patches instead of entire images, and learning invariant-ish features. It is worth reading if only to see how to cleverly use well-known efficient algorithms to learn deep representations that capture a lot of the interesting variability in the data.
Learned Prioritization for Trading Off Accuracy and Speed
J. Jiang, A. Teichert, H. Daume III, J. Eisner
The question asked by this paper is whether it is possible to use learning not to do some task (like parsing) better but to do it faster. This is done by applying reinforcement learning to pruning and prioritization in agenda-based parsing, where the reward function is a weighted linear combination of accuracy and time, and learning is done via policy gradient methods. Except this turns out not to work, because credit assignment is really difficult in this problem, so they add reward shaping (eagerly punishing actions which take time and eagerly rewarding actions which are known to lead to good trees), and also use ideas from apprenticeship learning to add an oracle to the policy gradient search by using a mixed policy that follows an oracle at first but slowly backs off to the learned policy after it learns to correctly mimic the oracle. It is clearly a work-in-progress, but has many interesting insightful nuggets.
Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
S. Cohen, M. Collins
Parsing with latent-variable PCFGs takes a long time mostly because of the tensor contractions necessary for the inside-outside algorithm, which are cubic in the number of states in the PCFG. The idea behind this paper is to represent these tensors with a low-rank representation, effectively computing approximate messages which is much cheaper than computing the exact ones. Interestingly they can also bound the error in the final parse as a function of the error in the individual low-rank approximations.
My #NIPS2012 reading list, part 2 of 3
FastEx: Fast Clustering with Exponential Families
A. Ahmed, S. Ravi, S. Narayanamurthy, A. Smola
Assuming you want to do MCMC for clustering models with arbitrary exponential families in the clusters, the main computational bottleneck is often evaluating the log-likelihood of each data point in each cluster and normalizing and sampling. This scales badly with the number of clusters, and sounds wasteful. What this paper does then is propose a few generic-ish tricks to speed-up sampling. The first trick is to use a taylor expansion of the likelihood to justify approximate computation of the acceptance probabilities. The second trick is to use hashing kernels to speed up the computation of the dot products between natural parameters and sufficient statistics.
Fast Variational Inference in the Conjugate Exponential Family
J. Hensman, M. Rattray, N. Lawrence
Another new variational inference paper. This one generalizes the collapsed variational algorithms in the literature and shows that in the new lower bound the older coordinate ascent updates are actually steepest descent updates. This opens up the possibility of using better updates, like conjugate gradient, which then seems to converge much more quickly. The technique looks general enough, and I’ll still need to read more carefully to understand what is going on in there.
Hamming Distance Metric Learning
M. Norouzi, R. Salakhutdinov, D. Fleet
The idea is to directly learn a mapping from real-valued inputs to binary vectors (not singleton binary vectors) in a way that preserves a known ranking-based similarity. The loss function is a list of triples stating that A and B are closer than A and C, from which you can then write down a hinge-like loss with a margin and optimize. The key trick seems to be an efficient way of performing loss-augmented inference over binary codes.
Identifiability and Unmixing of Latent Parse Trees
P. Liang, S. Kakade, D. Hsu
This paper presents two cool results. The first (and more interesting, in my opinion) is a numerical test for whether a set of moment-matching constraints is enough to identify the parameters of a latent variable model that uses the jacobian of the constraints. Intuitively, the parameters are identifiable if the Jacobian of the constraint violation vectors has the same rank as the dimensionality of your parameter vector (that is, whenever there is no linear subspace in the neighborhood of a constraint-satisfying parameter vector where all vectors are also constraint-satisfying). Then they use these tests to investigate identifiability of some models that look like PCFGs and dependency parsing models, and also they show how to derive some algorithms using this “unmixing” technique I don’t quite understand yet.
Interpreting prediction markets: a stochastic approach
N. Della Penna, M. Reid, R. Frongillo
This seems to be augmenting the known deep connections between prediction markets and online learning. So apparently not only (as it was known) are market makers optimally setting prices by looking at the gradient of a strongly convex loss function but the actual process by which the market works looks like mirror descent in the limit of high liquidity.
Learning as MAP Inference in Discrete Graphical Models
T. Caetano, X. Liu, J. Petterso
This is an unusual paper. It tries to learn a linear binary classifier using discrete instead of continuous optimization as its foundation, rethinking some of the tradeoffs that have been taken for granted in much of machine learning. They formulate loss functions such that the loss of making a certain prediction on a given example is a sum of loss functions that operate on possibly overlapping subsets of features (specificying cliques in a graph over the features) and the true label. This ends up looking similar to dropout for neural networks ( http://arxiv.org/pdf/1207.0580.pdf ), in that a large family of predictors that share parameters and operate on subsets of the possible features are all simultaneously trained to achieve low error. The specific way they do so, however, involves discretizing the parameter space and using the 0/1 loss directly, leading to learning being cast as inference in these large discrete graphical models.
Majorization for CRFs and Latent Likelihoods
T. Jebara, A. Choromanska
In the old days the parameters for logistic regression and conditional random fields used to be estimated using fairly specific upper-bound-based techniques, where a sequence of quadratic upper bounds on the log loss was defined in a way that each such bound could be optimized analytically while guaranteeing some improvement in the loss function. These methods, however, were eventually dropped in favor of generic gradient-based optimization techniques like L-BFGS and SGD. This paper presents a new way of optimizing CRFs using quadratic upper bounds, except these bounds are much tighter, and hence the optimization ends up being faster than generic techniques. More interestingly, this bound also applies for latent-variable CRFs and other interesting models, and there the gains are even more impressive.
My #NIPS2012 reading list, part 1 of 3
This is part 1 of my personal, biased, and incomplete list of interesting NIPS 2012 papers. It’s sorted roughly alphabetically, but not quite.
S. Boyd, C. Cortes, M. Mohri, A. Radovanovic
The key idea is to replace optimizing precision@k with optimizing accuracy at the “topmost” fraction of a results list. The trick is to use the same kind of quantile loss that has been used before in quantile regression, which looks like an absolute value loss where overpredicting and underpredicting are penalized by different amounts, in a way that not ranking a good thing on the top of the list is “less bad” than ranking a bad thing on top of the list. The main novelty here seems to be that they add a hinge-and-margin to this loss and use this to make generalization bounds.
Active Learning of Model Evidence Using Bayesian Quadrature
M. Osborne, D. Duvenaud, R. Garnett, C. Rasmussen, S. Roberts, Z. Ghahramani
The idea here is to use Bayesian methods to solve numerical integrals of hard problems. If you approximate the function being integrated with a Gaussian process, then you can integrate it analytically in an interval (by integrating the mean function of the GP), and also use the predicted variance of the Gaussian process to choose on which points to evaluate your unknown function, much like in Bayesian optimization. These things then turn out to be useful to compute partition functions, in a way responding to concerns by Larry Wasserman and others.
A quasi-Newton proximal splitting method
S. Becker, J. Fadili
The key contribution here is a new quasi-newton (like L-BFGS) algorithm for l1-regularized (and other nonsmooth) optimization problems. This comes from their derivation of a new way of computing a proximal step when the smooth part doesn’t decompose into a sum of independent terms. It seems to be more robust than state-of-the-art methods but doesn’t necessarily beat all of them in their examples.
A Spectral Algorithm for Latent Dirichlet Allocation
A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu
This is an “exact” algorithm for learning LDA. It’s a simple-ish extension of their work on mixture and multi-view models. Unfortunately, the actual algorithm in the paper is not particularly stable, and the power iteration from http://arxiv.org/abs/1210.7559 should be used instead. The key idea is to look at the third-order moment tensor of the words in LDA and realize that “decorrelating” its dimensions with PCA leads to a tensor that is the sum of outer products of orthogonal tensors, which then can be decomposed in many ways to get the true topics (assuming, of course, that the data was generated by the model).
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
N. Le Roux, M. Schmidt, F. Bach
This is really interesting. Apparently it is possible to make SGD converge much faster in later iterations if instead of approximating the gradient of the loss by the gradient in a single example you approximate it by the sum of the last time you computed the gradients on all examples (and then you compute the gradient on a new example and update this sum). The convergence rates then are as good as batch gradient descent but (ignoring the issue of sparsity) the cost per iteration is as good as the cost of a single stochastic gradient descent iteration.
Continuous Relaxations for Discrete Hamiltonian Monte Carlo
Z. Ghahramani, Y. Zhang, C. Sutton, A. Storkey
This does something I’ve always wanted to see done, which is usin gradient-based samplers (like Hamiltonian Monte Carlo) to sample from discrete distributions with pairwise interactions. The idea is to add an auxiliary variable that depends on the discrete variables in a way that the normalization constant of this auxiliary variable has a term that “cancels out” the interactions between the discrete variables (they call this the gaussian integral trick). Then conditioning on this “mean field” the discrete variables become independent logits, and can even be integrated out entirely. This means then that you can sample directly in the space of this “mean field” with HMC, which mixes much better than steps on the discrete variables directly. This technique looks relly promising, as it should often be possible to come up with these auxiliary continuous densities.
Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
O. Meshi, T. Jaakkola, A. Globerson
It’s hard to see convergence analysis of coordinate descent algorithms because, intuitively, ill-conditioning and ill-smoothing should make coordinate descent work arbitrarily bad (as the good descent directions will not align with the coordinates). So it’s interesting to see that the structure of smoothed MAP problems allows for rates of convergence that are almost as good as Fista’s and other algorithms.
EMNLP 2012 Reading List
This year’s EMNLP has a large number of really interesting papers. Here’s a small list with summaries of some that caught my eye (unfortunately I’m not going, so this is all done without the benefit of presentations or conversations).
Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
Sasha Rush, Roi Reichart, Michael Collins, Amir Globerson
http://aclweb.org/anthology-new/D/D12/D12-1131.pdf
This is another paper describing useful improvements from global inference, this time using dual decomposition. The idea is to constrain inference in the test set such that it agrees on “stuff” (edge pos tags in dependency parsing, word type pos tags in pos tagging) in the test set that was not seen in the training set. It should be easy to flip this around as a semi-supervised learning mechanism instead of a global inference one, and I’m curious to see how using dual decomposition during training would compare with generalized expectations or posterior regularization (both of which do constrained marginal instead of MAP inference).
Wiki-ly Supervised Part-of-Speech Tagging
Shen Li, João Graça, Ben Taskar
http://aclweb.org/anthology-new/D/D12/D12-1127.pdf
Not only the title is clever in this paper. The idea is to use dictionaries (wiktionary, in this case) as supervision for transferring knowledge of POS tagging from one language to others. The wikitionary has a definition and general-ish part-of-speech tag for each word together with links between “the same” word across languages. Exploiting these links turns out to be more useful than using parallel text or plain unsupervised learning for many languages. The best part is that this is mostly orthogonal to the state-of-the-art, so presumably even better results can be had by augmenting parallel text + label propagation with wikitionary.
Mahesh Joshi, Mark Dredze, William Cohen, Carolyn Rosé
http://aclweb.org/anthology-new/D/D12/D12-1119.pdf
This is one of a few very interesting empirical papers in this EMNLP (the other of note is the LDA paper I mention below). The question it tries to answer is when and why domain adaptation methods work, and it does so by finding some confounding factors and seeing their impact. I like that the empirical evaluation is done by messing with the training data (artificially distorting in-domain data to look like out-of-domain data in some particular ways without actually being so). The conclusions are really fuzzy, and cast a healthy amount of doubt on the “reality” of the results in domain adaptation papers. The most puzzling conclusion is that just randomly creating domains and using domain adaptation often helps performance more than using “real” domains, which suggests that maybe rebranding this line of research as “ensemble methods” and moving on would not be a bad idea.
Learning-based multi-sieve coreference resolution with knowledge
Lev Ratinov, Dan Roth
http://aclweb.org/anthology-new/D/D12/D12-1113.pdf
This does something I’ve wanted to do for a while (but never got around to doing) which is use the sieve-based structure of the stanford coreference resolution into a learning system (then the sieve idea is really inference folded into learning, which is pretty cool). The key insight (I quote) is that “different types of mention pairs may require a different co-ref model”. They argue that sentence structure matters more for within-document while string match matters more for across-document coreference. All the careful text samples help understanding why and how the system works.
Fast Large-Scale Approximate Graph Construction for NLP
I’m mostly bookmarking this for later reference. I’ve wasted far too much time implementing approximate nearest neighbor computation in large sparse datasets (there’s even a metaoptimize question I asked about this). The trick is to use PLEB: permute the coordinates randomly, sort, and use that to find some high-precision near neighbors; repeating this process allows for a better recall.
Exploring topic coherence over many models and many topics
Keith Stevens, Philip Kegelmeyer, David Andrzejwski, and David Butler
http://aclweb.org/anthology-new/D/D12/D12-1087.pdf
This is the other really good empirical paper I mentioned earlier. It sets out to compare three broadly distinct classes of “topic” models: matrix factorization (LSA), non-negative matrix factorization (NMF), and matrix factorization into probabilities (LDA), as well as revalidate recently published topic coherence scores (two very closely similar papers, one from UMass and the other from UCI, proposing almost the same metric). Some conclusions are as expected: unconstrained matrix factorization (LSA) generates better features and compresses better than the other models, which in turn provide “crisper” and more “natural-looking” topics (indeed the eye-candy was probably partially responsible for the success of LDA). Also the metrics are really sensitive to hyperparameter settings, and don’t seem to correlate that closely with human judgment (which was visible from their original papers), but computing them based on a larger corpus (as the UCI version does) seems to be more discriminative than computing them on the same corpus in which the topics were fit (as the UMass version does). It’s worth reading, and debating the conclusions.
Parse, price, and cut—Delayed column and row generation for graph based parsers
Sebastian Riedel, David Smith, Andrew McCallum
http://aclweb.org/anthology-new/D/D12/D12-1067.pdf
This paper is another clear win for LP relaxations as a way of reasoning about inference. It essentially uses cutting planes in the primal and dual inference problems (cutting planes in the dual being referred to as column generation) with oracle strategies informed by the domain to dramatically cut inference time in grandfather-arc dependency parsing models. One sad thing is that all this clever inference is only used at test time, and the models are trained with looby bp to approximate the likelihood gradient. I wonder if something can be done about that.
Spectral dependency parsing with latent variables
Paramveer Dhillon, Jordan Rodu, Michael Collins, Dean Foster, Lyle Ungar
http://aclweb.org/anthology-new/D/D12/D12-1019.pdf
A mix of two interesting things (spectral methods and latent variable parsing) that seems to work well enough. The setup of using a generative model to rerank the output of a discriminative one seems odd, and I’d like to see how this model would fare by itself in parsing.
ICML 2012 Reading List
Again, here’s my reading list for ICML 2012. I couldn’t summarize papers I didn’t find online as of last week, so those are missing from this list.
Online Structured Prediction via Coactive Learning
www.cs.cornell.edu/people/tj/publications/shivaswamy_joachims_12a.pdf
Pannaga Shivaswamy, Thorsten Joachims
The idea here is to work in a model that sits in the middle of online learning and bandits: instead of observing the reward of the current action, the system observes a different action, presented as an alternative. It matches really well with all sorts of implicit feedback models in the web, and has strong regret bounds. I really like this direction, and I think this is the way to go for all sorts of problems that rely on user data. The one disadvantage of this with regards to bandits is that there is no clear need for the system to explore if it assumes the user will always report what amounts to a margin violation. www.princeton.edu/~sjgershm/GershmanHoffmanBlei12.pdf Samuel Gershman, Matt Hoffman, David Blei The idea here is to replace the parametric approximations normally used in mean-field algorithms (like approximating random variables with others with the same functional forms but no dependence) with nonparametric approximations based on mixtures of gaussians. The cool thing is that to fit these mixtures of gaussians all you need are the first two derivatives of the log-likelihood, and not expectations, so you can apply these techniques to nonconjugate models, and also capture multi-modality (because mixtures of gaussians are by nature “bumpy”). This seems to take a lot of the pain out of deriving variational algorithms, although it still needs to be extended to a lot of settings (and maybe the pain will come back then). http://arxiv.org/pdf/1112.6209v2.pdf Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng When I first heard about the scale and results of this paper, it occurred to me that I would not have thought it was possible were I not seeing it. Essentially, learning a neural network from scratch on far more data and machines that anything I’ve heard of ca capture all sorts of high-level phenomena that people have been talking about for ages but of which there was no evidence so far. For example, this automatically learns face detectors from many images where faces show up in different positions and orientations. Also because it was trained in internet data it managed to learn a cat detector with no supervision. www.tiberiocaetano.com/papers/2012/DefCae12.pdf Aaron Defazio, Tiberio Caetano In collaborative filtering, a lot of very smart machine learning has been applied to low-rank and latent-space-based models, which makes sense because matrix factorizations are one thing which ML people love. On the other hand, neighborhood methods have been treated as hacky approximations most of the time. This paper uses MRFs to directly regularize the predictions of neighborhood models, and learns these regularizers from data using something that looks a little like pseudolikelihood. Thomas Desautels, Andreas Krause, Joel Burdick Bayesian optimization is a really interesting way of optimizing a very expensive function (you approximate it with a gaussian process and then query the gaussian process to decide where to test your function), but it doesn’t naturally lead itself to a batch setting (where you try many actions at once). Here they find surprisingly that greedily selecting actions in a batch assuming that earlier actions returned what was expected of them works well if you are careful about the variances. I can see this being useful in practice. http://www2.math.uu.se/~rmann/papers/optimal_learning.pdf Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff Schneider, Richard Mann The scenario described in this paper is akin to playing battleship: what if you want to run an algorithm to explore an unknown landscape, but you only get points for finding out specific “useful” features of this landscape, instead of generally reducing your uncertainty? This also maps really well to recommendation systems (where you want to recommend items at the same time that are good and that let you recommend other good items in the future). This paper does a bayesian decision theoretical treatment of this problem, and finds useful things such that looking ahead is almost always useful, and also finds efficient algorithms for this lookahead. www.cs.cmu.edu/~sahong/papers/icml2012-yue-hong-guestrin_long.pdf Yisong Yue, Sue Ann Hong, Carlos Guestrin The idea here is as follows: you want to learn something like a recommender, but you want to avoid exploring all possible directions. So you first constrain yourself to exploring only a linear subspace of the parameters, and relax this constraint as appropriate to make your classifier more accurate. It interpolates really nicely between the case where the assumption of the low-rank subspace is verified and the case where it is not, and looks pretty good in the experiments. http://probabilistic-optimization.org/KernelNewtonICML.pdf Philipp Hennig, Martin Kiefel Quasi-newton methods (like lbfgs) keep using gradient information across iterations to build an estimate of the Hessian. The idea in this paper is to try to do this estimation using bayesian inference. First they clarify a lot of the properties of known quasi-newton methods from an inference perspective, and then extend and propose a new nonparametric bayesian quasi-newton method which looks like it’s far more efficient in terms of function evaluations (how will it fare in terms of wall clock time is not so clear). The experiments are a little weak, in that no ML problems are used, but the idea seems really interesting. www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf Brian Kulis, Michael Jordan So it turns out that the hacky rule of “if a point is sufficiently far from any cluster when doing k-means just give it its own cluster” actually turns out to be derivable from the Dirichlet process. I really like this because it lets you turn things around and thing of other uses of the DP as essentially applying this simple rule probabilistically and in more complex settings. I’m not sure why they derived the updates from the sampler, however, as I think something like EM or the collapsed variational DP would behave similarly in the limit if you used hard instead of soft assignments in the variational distribution. Still, this is a must-read for those interested in NPbayes. www.cs.utah.edu/~piyush/recent/mfamtlICML12.pdf Alexandre Passos, Piyush Rai, Jacques Wainer, Hal Daume III My paper (actually Piyush is an equal contributor)! The idea here is to try to come up with a single model that can adapt itself at training time to the many different assumptions that have been succesfully used in transfer learning: things like clustering, low-rank parameter matrices, manifolds, etc. Just for fun we did this using nonparametric bayesian methods, using a nonparametric mixture of nonparametric factor analysers. While the final algorithm is not scalable as it is, I think it might be a basis for something actually useful in this field. www.di.ens.fr/~fbach/icml2012_herding_final.pdf Francis Bach, Simon Lacoste-Julien, Guillaume Obozinski I really like the papers on herding, as that seems to be a neat idea that is deeply connected with all sorts of other interesting things. This paper shows that herding is actually something like Frank-Wolfe for the problem of moment matching. Things break down a little when improving it from an optimization perspective leads to a worse learner, which I think indicates that the explanation is not yet complete. www.cs.princeton.edu/~blei/papers/MimnoHoffmanBlei2012.pdf David Mimno, Matt Hoffman, David Blei The idea here seems to do something like Matt Hoffman’s online variational LDA paper but starting from the collapsed representation of LDA (with only z variables for the tokens) rather than the standard representation (with zs but also with phi and theta for documents and topics), and also using a joint (rather than a factorized) variational distribution. The key trick is that while evaluating expectations over this joint q is hard, sampling from it is easy, as it looks a lot like Gibbs sampling for LDA except with digammas in it. Then you can use these samples to approximate the expectations and end up with a stochastic natural gradient algorithm that is sparse and scalable. http://arxiv.org/pdf/1103.4296.pdf John Duchi, Martin Wainwright, Peter Bartlett It is known that SGD and proximal methods converge really quickly to the optimal solution when the problem is smooth (that is, differentiable everywhere), but there are no known results for when the problem is not smooth. Here they propose a really simple algorithm (instead of using one sample of the objective function per iteration to compute the gradient, use many samples, taken not from the current value but from that value plus some added noise, and then this noise, which is annealed down over time, ensures that there are always derivatives) that has good provable bounds. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring www.ics.uci.edu/~sungjia/publications/SGFS_ICML12_v12_cameraready.pdf Sungjin Ahn, Anoop Korattikara, Max Welling I think this won the best paper award. It is a follow up on Welling and Teh’s 2011 paper on bayesian SGD with Langevin dynamics, except it fixes some problems with that algorithm. The idea here is to find something that works really well in the limit where the posterior is gaussian (lots of data) when possible, but also degrades smoothly to the langevin approach when not possible. This allows them to use rather large learning rates, and relying on the natural gradient rather than the actual gradient also helps converge more quickly. It seems really interesting, and I haven’t finished digesting it.Nonparametric variational inference
Building high-level features using large scale unsupervised learning
A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training
Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization
http://las.ethz.ch/files/desautels12parallelizing.pdf Bayesian Optimal Active Search and Surveying
Hierarchical Exploration for Accelerating Contextual Bandits
Quasi-Newton Methods: A New Direction
Revisiting k-means: New Algorithms via Bayesian Nonparametrics
Flexible Modeling of Latent Task Structures in Multi-Task Learning.
On the Equivalence between Herding and Conditional Gradient Algorithms
Sparse stochastic inference for latent Dirichlet allocation
Randomized Smoothing for (Parallel) Stochastic Optimization
Passive-aggressive perceptrons and trust-region methods
Recently I’ve been reading up on some optimization things, and I think I saw an interesting connection between trust-region methods for nonlinear optimization (things like Levenberg-Marquadt) and the passive-aggressive (PA) family of online learning algorithms (as well as many proximal-gradient methods). To do so I think I’ll first explain trust-region methods, then PA, and then the connection.
Trust-region methods, essentially, are a way of making sure that Newton’s method works. In Newton’s method you pick the function f you want to optimize, approximate it locally by a truncate taylor series, and minimize the approximation. So you’re actually minimizing
where H is the hessian (the matrix of second derivatives) of f. This has an analytic minimum at
(you could invert the matrix but it’s faster and arguably at least as stable to just solve the linear system above with least squares)
One problem with newton’s method is that you’re usually not solving a quadratic (otherwise f(x) = f’(x)), so there is an error in your taylor approximation (not only that, but in really bad cases your hessian can be not positive definite, which will lead to all sorts of weird behaviors when you try to solve the system above). There are, then, two strategies you can use to make it practical: line search (that is, trying many things in the diretion of H^-1 grad f and choosing the best), which is what is most commonly done, and trust-region methods, which consist of adding a diagonal term to the hessian to make the linear system “nicer” at the expense of a poorer approximation of f. So in a mile-high view all trust region methods would do as above using
instead of H.
Now to online learning. One problem with things like the perceptron is that they update willy-nilly, often taking unnecessarily big steps in the direction of something which was almost correctly classified and taking not nearly big enough steps in the direction of the actually hard examples. You can see the perceptron as doing SGD (technically, stochastic subgradient descent) in the following odd-looking loss function
for y being either 1 or -1. The gradient of this is x, if it’s misclassified, or 0, if it’s correctly classified. The passive-aggressive family of online learning algorithm roughly tries to fix this eagerness in the perceptron by doing conservative updates. That is, rather than doing stochastic gradient steps to minimize the weird loss function above, take the smallest possible step needed to make it zero. Technically, it’s the following optimization problem:
This problem is commonly solved by looking at its dual, which has an analytic solution. The lagrangian is
This is, of course, equivalent (in the sense that it leads to the same solution as)
and if you look at the Hessian of that, you’ll see that all the first term does is add a multiple of the identity matrix to the Hessian of the original loss function (which doesn’t even exist, as it’s not quite differentiable). The way you end up tuning the alpha step size in PA is very similar to how you’d choose the lambda in a trust-region method, with the difference that you have a closed-form in PA which you don’t in the trust-region method.
Hat tip to Michael Wick, with whom I discussed it this morning.
Algorithms versus ideas (or learning reductions versus perceptron updates)
One of the most entertaining parts of NIPS 2011 was sitting in Mark Reid’s talk and the panel session in the Relations between machine learning problems workshop in Sierra Nevada. There was a long discussion on mapping the landscape of machine learning in a useful and compact way. This is clearly necessary, as there are currently hundreds of variations of even simple things such as binary classification that are only loosely related to each other. There are also grand unifying paradigms, such as structural risk minimization, neural networks, bayesian machine learning, etc, which tell incompatible stories as far as assumptions, inputs/outputs, and algorithms go, although some mapping between these approaches is possible (for example you can pick almost any neural network, tack a probability distribution on top of it and call it bayesian, or you can call your empirical risk a “likelihood” and your regularizer a “prior”, and then trade days of computation time for more accuracy).
In the workshop I detected a bias towards the more formal, CS-y, kind of thinking, which is well expressed by the learning reductions formalism. In this paradigm you pick a simple, well-defined, baseline learning problem (such as binary classification with the 0-1 loss, but it could be really anything), and show how being able to build fairly accurate solutions to it allows you to build reasonably accurate solutions to more complex problems (and here SEARN works for structured learning, costing for cost-sensitive classification, etc), where “fairly” and “reasonably” are formally specified. This allows you to build a very CS-y map of the machine learning problems, the structures between them, and even how they relate to each other. In the future this might lead to interesting notions of completeness or hardness, etc.
In this spirit people at the workshop suggested that the way to go is formal specification of the interfaces between learning problems (going as far as suggesting learning-as-a-service), as this interface-specific knowledge would then lead to a better mapping of the relationships, and a better usage of reductions.
I’m not too happy with this view, however, for two reasons: first, because learning problems are social constructs, and simplifications of real-world needs, and secondly because this is not the pattern of reuse which has naturally developped, and it would be smart to understand this pattern before changing it.
For my first point I’d like to say that I’ve never seen a binary classification problem. What I’ve seen is many real-world, messy problems, mapped into the “binary classification” box so that they can be solved to a reasonable degree of accuracy. There is already a lot of heavy mapping going on between the needs of a practicioner and the abstraction of binary classification: you need to come up with some notion of instances which are reasonably independent, a labeled set of examples (and with these a labeling scheme) to train and test on, heuristic feature functions which are almost but not quite equivalent to the label, and finally a loss function that allows you to tell how right/wrong you are (which depends heavily on the labellings, which almost never occur naturally).
So all these issues are rightly swept under the rug when talking about building classifiers, because over time it was discovered that for a lot of problems there exist ways of making these mapping decisions which lead to useful results in the end, at a reasonable cost. At the same time, these mappings create binary classification problems with the same interface that are so different under the hood that entirely different abstractions are often used to attack them (think feature learning/engineering, online things, generative assumptions, etc).
My second point is, I think, even more important. If you look at the way machine learning has developed historically, researchers almost never trusted each other’s results as black boxes to which you can plug things in the input and output slots. Completely opposite to the learning reductions approach is how perceptron updates came to be in the center of way too many facets of machine learning. Instead of looking for clear, well-defined, input-output black boxes, machine learning researchers have pried open problems looking for a place where you can pick your parameter vector and add it to it something easily computed from the training example and the true label in a way that your parameter vector has a higher dot product with the true label than with the wrong labels. This simple, ugly, idea, which depends on all sorts of artificial notions such as parameter vectors, true labels, training examples, etc, can actually be found lying around inside most machine learning algorithms, at least the efficient ones. It can be used from binary classification to machine translation, and pretty much every problem in between. The open-box idea of perceptron updates, or more generally gradient/mirror descent in generalized linear models (the fancy way of calling it), has managed to be more useful than any black-box abstraction as far as machine learning goes. This is, of course, not at all unique to the perceptron update, and I could very well say something like belief propagation (which will then encompass most message-passing and some non-message-passing algorithms used in ML, with a loose enough definition of algorithm).
So, while thinking in terms of nice taxonomies, clean interfaces, reductions, and complexity is very appealing, is it really the right thing to do? Can a learning reduction make any black-box algorithm as useful as the idea of a perceptron update?
Finally, should we be really looking at reusing black-box algorithms, or should we try to surface as clearly as possibly the white-box ideas which can later be reused in countless other algorithms?
I think rather than attempting to reconceive the world in our image it often turns out that the best way to move forward is to do what already works in a better way.
(of course I don’t find reductions a bad idea at all. It is brilliant how some complex problems can be solved using them, and I’d like to work on them at some point. It’s just that maybe a black box view of the field is not ideal)
#NIPS2011 Workshop day 1 Highlights
Today was the first day of the NIPS worksops, and as usual it was really interesting.
In the morning I mostly stayed at the Bayesian Optimization workshop, which was really interesting. Remy Munos talked about his ideas on optimizing highly nonconvex and nonsmooth functions with stochastic bandit feedback. The idea involves using a given hierarchical partition of the space to do some clever sampling, plus slight assumptions on smoothness of the function only around the optimum. Even if you don’t know the actual smoothness around the optimum you can still get some pretty interesting results.
I found out about research that is very closely related to my recommendation systems work, by Roman Garnett and Katja Hofmann. I’m really looking forward to reading their papers on the subject, and possibly collaborating in these areas.
There was also Philip Hennig’s poster about doing bayesian optimization where you try to keep track of explicit beliefs about where the optimum is and what is its value, rather than keeping track of beliefs about the function value everywhere. Pieter Abbeel’s poster about safe exploration was really interesting. Apparently he has a generalization of minimax and expectimax objectives where you can tune one simple intuitive parameter to control how adversarial the world is, and route around that.
After this I watched the contributed talks at the Computational Tradeoffs in Learning workshop. Sameer Singh talked about his approximate parallel BP algorithm, which is really interesting. Suppose you run BP, and in doing so find out that the marginal on some nodes isn’t changing all that much. Then you can fix the distribution of those nodes, condition on them, and decompose your graphical model into many independent parts, on which you can then run inference in parallel. Apparently at 95% accuracy or so you can get almost linear speedup with up to eight cores or so on some realistic models.
Then there was Hal Daumé’s really interesting talk on learning time vs accuracy tradeoffs in complex inference problems. The setting is great, and I have high hopes for this line of research. Essentially, you have these nondeterministic inference problems, such as parsing, in which you can push and pop items into a chart in pretty much any order you want, and if you’re willing to wait forever you will find the optimal value. His objective then is to learn a heuristic function used to sort items popped off of the agenda to minimize time at a small expense in accuracy. The problem ends up being so nondeterministic, however, that it’s really hard to come up with reasonable gradients that lead you towards good solutions, so there is a lot of future work needed to make this approach feasible. Regardless, it’s something I wish more inference techniques tried to do, and even if this fails some other approach should be tried.
Following that there was Jason Eisner’s student Vaselin’s talk on something pretty similar, but different. He is following up on his earlier work where you essentially treat your model, inference, and loss function computation as a big black box, which you optimize on your training data using gradients computed by automatic differentiation. This has all sorts of nice consequences, such as not needing to have the artificial distinction between model and inference (which gets really dodgy with approximate inference), getting a nice thing to optimize that leverages properties from your inference, and being able to use literally any structure you want. The problem, unfortunately, ends up highly non-convex (which makes sense). This talk was about extending this idea to account for explicit computational cost tradeoffs by modeling early stopping and factor selection as part of the cost function you can optimize over. It’s pretty impressive, and the results are great.
In the afternoon I mostly stuck around the Relationships Between Machine Learning Problems workshop, which I generally found really interesting. I think machine learning as a field has solidified around some relatively suboptimal designs for problems and solutions, and exploring the relationships between these is a way of breaking out of this cycle. Mark Reid’s talk on possible anatomies for a learning problem was really interesting, not so much for the solutions it proposed (which were arguable and hence endlessly discussed in the panel sessions) but for making a broad sweep of the usual current ways of talking about machine learning problems. Also, for exposing me to the term “nips bingo”, which is writing papers by essentially stringing random buzzwords together (for an innocuous example, think of something like “A copula-based manifold optimization process for for semi-supervised identification of latent tree models”). The panel discussion was really enlightening, as was the follow-up discussion I had with Tiberio Caetano about some issues regarding current machine learning.
All in all this was really interesting, and I’m looking forward for tomorrow’s sessions. I’ll have a poster at the Challenges in Learning Hierarchical Models workshop, and I’m also planning on seeing some talks in the Philosphy and Machine Learning, Computational Social Science, Learning Semantics, and Domain Adaptation workshops. It’ll be a lot of running around.






