BACH Francis

< Back to ILB Patrimony
Topics of productions
Affiliations
  • 2012 - 2021
    Apprentissage statistique et parcimonie
  • 2020 - 2021
    Communauté d'universités et établissements Université de Recherche Paris Sciences et Lettres
  • 2012 - 2019
    Département d'Informatique de l'Ecole Normale Supérieure
  • 2013 - 2017
    Institut national de recherche en informatique et en automatique
  • 2016 - 2017
    Ecole normale supérieure Paris
  • 2013 - 2015
    Microsoft research & development france sas
  • 2012 - 2013
    Centre de recherche Inria de Paris
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • A Continuized View on Nesterov Acceleration.

    Raphael BERTHIER, Francis BACH, Nicolas FLAMMARION, Pierre GAILLARD, Adrien TAYLOR
    2021
    We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters. but a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters.
  • Machine learning: free programs (GPLv3) essential to the development of big data solutions.

    Massih reza AMINI, Francis BACH
    2020
    Machine Learning and Artificial Intelligence. Machine Learning is one of the fields of artificial intelligence that aims to design programs that are not explicitly coded to perform a particular task. The concepts in this field are based on inferential logic and attempt to derive general rules from a finite number of observations. A reference book. This book presents the scientific foundations of supervised learning theory, the most widely used algorithms developed in this field, and the two frameworks of semi-supervised learning and scheduling, at a level accessible to master's students and engineering students. The first edition, known as Machine Learning, was translated into Chinese by iTuring Publishing. In this second edition, a new chapter is dedicated to Deep Learning, on artificial neural networks, and we have reorganized the other chapters for a coherent presentation linking the theory to the algorithms developed in this field. You will also find in this edition some programs of the classical algorithms, written in Python and C languages (both simple and popular languages), and for the readers who want to know how these models, sometimes called black boxes, work. These free programs (GPLv3) essential to the development of big data solutions are progressively deposited on this gitlab (https://gricad- gitlab.univ-grenoble-alpes.
  • Accelerating conditional gradient methods.

    Thomas KERDREUX, Alexandre d ASPREMONT, Francis BACH, Alexandre d ASPREMONT, Francis BACH, Mikael JOHANSSON, Zaid HARCHAOUI, Sebastian POKUTTA, Martin JAGGI, Mikael JOHANSSON, Zaid HARCHAOUI
    2020
    Frank-Wolfe algorithms are methods for optimizing problems under constraints. They decompose a non-linear problem into a series of linear problems. This makes them the methods of choice for high-dimensional optimization and explains their use in many applied fields. Here we propose new Frank-Wolfe algorithms that converge more quickly to the solution of the optimization problem under certain fairly generic structural assumptions. In particular, we show that, unlike other types of algorithms, this family adapts to these assumptions without having to specify the parameters that control them.
  • Sampling subspaces using determinantal point processes.

    Ayoub BELHADJI, Pierre CHAINAIS, Remi BARDENET, Remi GRIBONVAL, Gersende FORT, Francis BACH, Agnes DESOLNEUX
    2020
    Determinantal point processes are probabilistic models of repulsion. These models have been studied in different fields: random matrices, quantum optics, spatial statistics, image processing, machine learning and recently quadratures.In this thesis, we study the sampling of subspaces using determinantal point processes. This problem is at the intersection of three branches of approximation theory: sub-selection in discrete sets, kernel quadrature and kernel interpolation. We study these classical questions through a new interpretation of these random models: a determinantal point process is a natural way to define a random subspace. In addition to giving a unified analysis of numerical integration and interpolation under PLRs, this new approach allows us to develop the theoretical guarantees of several PLR-based algorithms, and even to prove their optimality for certain problems.
  • Optimization with sparsity-inducing penalties.

    Francis BACH, Rodolphe JENATTON, Julien MAIRAL, Guillaume OBOZINSKI
    2019
    No summary available.
  • On efficient methods for high-dimensional statistical estimation.

    Dmitry BABICHEV, Francis BACH, Anatoli JUDITSKY, Olivier CAPPE, Francis BACH, Anatoli JUDITSKY, Olivier CAPPE, Arnak s. DALALYAN, Stephane CHRETIEN, Franck IUTZELER, Arnak s. DALALYAN, Stephane CHRETIEN
    2019
    In this thesis, we examine several aspects of parameter estimation for statistics and machine learning techniques, as well as optimization methods applicable to these problems. The goal of parameter estimation is to find the unknown hidden parameters that govern the data, for example parameters with unknown probability density. Constructing estimators through optimization problems is only part of the problem, finding the optimal value of the parameter is often an optimization problem that must be solved, using various techniques. These optimization problems are often convex for a large class of problems, and we can exploit their structure to obtain fast convergence rates. The first main contribution of the thesis is to develop moment matching techniques for multi-index nonlinear regression problems. We consider the classical nonlinear regression problem, which is infeasible in high dimensions due to the curse of dimensionality. We combine two existing techniques: ADE and SIR to develop the hybrid method without some of the weak aspects of its parents. In the second main contribution, we use a particular type of averaging for stochastic gradient descent. We consider conditional exponential families (like logistic regression), where the objective is to find the unknown value of the parameter. We propose the averaging of the moment parameters, which we call prediction functions. For finite dimensional models, this type of averaging can lead to a negative error, i.e. this approach provides us with an estimator that is better than any linear estimator can ever do. The third main contribution of this thesis concerns Fenchel-Young losses. We consider multi-class linear classifiers with losses of a certain type, so that their double conjugate has a direct product of simplices as support. The corresponding convex-concave point-saddle formulation has a special form with a bilinear matrix term and classical approaches suffer from time-consuming matrix multiplication. We show that for multi-class SVM losses with efficient sampling techniques, our approach has sublinear iteration complexity, i.e., we only have to pay O(n+d+k) three times: for the number of classes k, the number of features d, and the number of samples n, while all existing techniques are more complex.
  • Overcomplete Independent Component Analysis via SDP.

    Anastasia PODOSINNIKOVA, Amelia PERRY, Alexander WEIN, Francis BACH, Alexandre D ASPREMONT, David SONTAG
    AISTATS 2019 - 22nd International Conference on Artificial Intelligence and Statistics | 2019
    We present a novel algorithm for overcomplete independent components analysis (ICA), where the number of latent sources k exceeds the dimension p of observed variables. Previous algorithms either suffer from high computational complexity or make strong assumptions about the form of the mixing matrix. Our algorithm does not make any sparsity assumption yet enjoys favorable computational and theoretical properties. Our algorithm consists of two main steps: (a) estimation of the Hessians of the cumulant generating function (as opposed to the fourth and higher order cumulants used by most algorithms) and (b) a novel semi-definite programming (SDP) relaxation for recovering a mixing component. We show that this relaxation can be efficiently solved with a projected accelerated gradient descent method, which makes the whole algorithm computationally practical. Moreover, we conjecture that the proposed program recovers a mixing component at the rate k < p^2/4 and prove that a mixing component can be recovered with high probability when k < (2 - epsilon) p log p when the original components are sampled uniformly at random on the hyper sphere. Experiments are provided on synthetic data and the CIFAR-10 dataset of real images.
  • Stochastic algorithms with descent guarantees for ICA.

    Pierre ABLIN, Alexandre GRAMFORT, Jean francois CARDOSO, Francis BACH
    AISTATS 2019 - 22nd International Conference on Artificial Intelligence and Statistics | 2019
    Independent component analysis (ICA) is a widespread data exploration technique, where observed signals are modeled as linear mixtures of independent components. From a machine learning point of view, it amounts to a matrix factorization problem with a statistical independence criterion. Infomax is one of the most used ICA algorithms. It is based on a loss function which is a non-convex log-likelihood. We develop a new majorization-minimization framework adapted to this loss function. We derive an online algorithm for the streaming setting, and an incremental algorithm for the finite sum setting, with the following benefits. First, unlike most algorithms found in the literature, the proposed methods do not rely on any critical hyper-parameter like a step size, nor do they require a line-search technique. Second, the algorithm for the finite sum setting, although stochas-tic, guarantees a decrease of the loss function at each iteration. Experiments demonstrate progress on the state-of-the-art for large scale datasets, without the necessity for any manual parameter tuning.
  • Unsupervised Image Matching and Object Discovery as Optimization.

    Huy v. VO, Francis BACH, Minsu CHO, Kai HAN, Yann LECUN, Patrick PEREZ, Jean PONCE
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 2019
    No summary available.
  • Deep learning.

    Ian j. GOODFELLOW, Yoshua BENGIO, Aaron c. COURVILLE, Francis BACH, Fabien NAVARRO, Salima EL KOLEI, Benjamin GUEDJ, Christophe CHESNEAU, Nicolas BOUSQUET
    2018
    « Hello Dave, you’re looking well today ».
  • Asynchronous optimization for machine learning.

    Remi LEBLOND, Francis BACH, Simon LACOSTE JULIEN, Jean philippe VERT, Francis BACH, Simon LACOSTE JULIEN, Jean philippe VERT, John c. DUCHI, Hal DAUME III, Alexandre GRAMFORT, John c. DUCHI, Hal DAUME III
    2018
    The combined explosions of computational power and the amount of available data have made algorithms the new limiting factors in machine learning. The objective of this thesis is therefore to introduce new methods capable of taking advantage of large amounts of data and computational resources. We present two independent contributions. First, we develop fast optimization algorithms, adapted to the advances in parallel computing architecture to handle massive amounts of data. We introduce an analytical framework for asynchronous parallel algorithms, which allows us to make correct and simple proofs. We demonstrate its usefulness by analyzing the convergence and speedup properties of two new algorithms. Asaga is a parsimonious asynchronous parallel variant of Saga, a variance-reduced algorithm that has a fast linear convergence rate in the case of a smooth and strongly convex objective. Under the right conditions, Asaga is linearly faster than Saga, even in the absence of parsimony. ProxAsaga is an extension of Asaga to the more general case where the regularization term is not smooth. ProxAsaga also obtains a linear acceleration. We have performed extensive experiments to compare our algorithms to the state of the art. Second, we present new methods suitable for structured prediction. We focus on recurrent neural networks (RNNs), whose traditional training algorithm - based on the maximum likelihood principle (MLP) - has several limitations. The associated cost function ignores the information contained in the structured metrics. Moreover, it causes discrepancies between training and prediction. We therefore propose SeaRNN, a new training algorithm for RNNs inspired by the "learning to search" approach. SeaRNN relies on a state space exploration to define global-local cost functions, closer to the evaluation metric than the MLE objective. Models trained with SeaRNN perform better than those learned via MLE on three challenging tasks, including machine translation. Finally, we study the behavior of these models and perform a detailed comparison of our new approach to related research.
  • Relating Leverage Scores and Density using Regularized Christoffel Functions.

    Edouard PAUWELS, Francis BACH, Jean philippe VERT
    Neural Information Processing Systems | 2018
    Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.
  • Evaluating automatic speech recognition systems as quantitative models of cross-lingual phonetic category perception.

    Thomas SCHATZ, Francis BACH, Emmanuel DUPOUX
    The Journal of the Acoustical Society of America | 2018
    Theories of cross-linguistic phonetic category perception posit that listeners perceive foreign sounds by mapping them onto their native phonetic categories, but, until now, no way to effectively implement this mapping has been proposed. In this paper, Automatic Speech Recognition systems trained on continuous speech corpora are used to provide a fully specified mapping between foreign sounds and native categories. The authors show how the machine ABX evaluation method can be used to compare predictions from the resulting quantitative models with empirically attested effects in human cross-linguistic phonetic category perception.
  • Acceleration in optimization.

    Damien SCIEUR, Francis BACH, Alexandre d ASPREMONT, Antonin CHAMBOLLE, Francis BACH, Alexandre d ASPREMONT, Antonin CHAMBOLLE, Joseph SALMON, Yurii NESTEROV, Antonin CHAMBOLLE, Joseph SALMON
    2018
    In many fields, such as optimization, the performance of a method is often characterized by its convergence rate. However, accelerating an algorithm requires some knowledge of the problem structure and such improvements are the result of a case-by-case study. Many acceleration techniques have been developed in the last decades and are now massively used. Despite their simplicity, these methods are often based on purely algebraic arguments and generally lack intuitive explanations. Recently, a lot of work has been done to make connections between accelerated algorithms and other scientific domains, such as control theory or differential equations. However, these explanations are often based on complex arguments and most of them use unconventional tools in their analysis. One of the contributions of this thesis is an attempt to explain accelerated algorithms using the theory of integration methods, which has been extensively studied and enjoys a strong theoretical analysis. In particular, we show that optimization methods are in fact instances of integration methods, when integrating the gradient flow equation. With standard arguments, we intuitively explain the origin of the acceleration. On the other hand, accelerated methods need additional parameters, compared to slower methods, which are usually difficult to estimate. Moreover, these schemes are built for a particular configuration and cannot be used elsewhere. Here, we explore another approach to accelerating optimization algorithms, which uses generic acceleration arguments. In numerical analysis, these tools have been developed to accelerate sequences of scalars or vectors, by building in parallel another sequence with a better convergence rate. These methods can be combined with an iterative algorithm, accelerating it in most cases. In practice, these extrapolation methods are not so much used, notably because of their lack of convergence guarantees and their instability. We extend these methods by regularizing them, which will allow a deeper theoretical analysis and stronger convergence results, especially when applied to optimization methods.
  • Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains.

    Aymeric DIEULEVEUT, Alain DURMUS, Francis BACH
    2018
    We consider the minimization of an objective function given access to unbiased estimates of its gradient through stochastic gradient descent (SGD) with constant step-size. While the detailed analysis was only performed for quadratic functions, we provide an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general (non-quadratic) case. For this analysis, we bring tools from Markov chain theory into the analysis of stochastic gradient. We then show that Richardson-Romberg extrapolation may be used to get closer to the global optimum and we show empirical improvements of the new extrapolation scheme.
  • Averaging Stochastic Gradient Descent on Riemannian Manifolds.

    Nilesh TRIPURANENI, Nicolas FLAMMARION, Francis BACH, Michael i. JORDAN
    Computational Learning Theory (COLT) | 2018
    We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.
  • The minister's lightning rods.

    Francis BACH, Dimitri SPOLIANSKY, Claude RIVELINE
    2017
    No summary available.
  • Integration Methods and Accelerated Optimization Algorithms.

    Damien SCIEUR, Vincent ROULET, Francis BACH, Alexandre D ASPREMONT
    2017
    We show that accelerated optimization methods can be seen as particular instances of multi-step integration schemes from numerical analysis, applied to the gradient flow equation. In comparison with recent advances in this vein, the differential equation considered here is the basic gradient flow and we show that multi-step schemes allow integration of this differential equation using larger step sizes, thus intuitively explaining acceleration results.
  • On Structured Prediction Theory with Calibrated Convex Surrogate Losses.

    Anton OSOKIN, Francis BACH, Simon LACOSTE JULIEN
    The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) | 2017
    We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called “calibration function” relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.
  • Stochastic approximation in Hilbert spaces.

    Aymeric DIEULEVEUT, Francis BACH, Stephane BOUCHERON, Francis BACH, Stephane BOUCHERON, Arnak s. DALALYAN, Lorenzo ROSASCO, Francois GLINEUR, Arnak s. DALALYAN, Lorenzo ROSASCO
    2017
    The goal of supervised learning is to infer relationships between a phenomenon that we wish to predict and "explanatory" variables. To this end, we have observations of multiple realizations of the phenomenon, from which we propose a prediction rule. The recent emergence of very large-scale data sources, both in terms of the number of observations made (in image analysis, for example) and the large number of explanatory variables (in genetics), has brought to light two difficulties: on the one hand, it becomes difficult to avoid the pitfall of overlearning when the number of explanatory variables is much greater than the number of observations. On the other hand, the algorithmic aspect becomes a determining factor, because the resolution of a linear system in the spaces involved can become a major difficulty. Algorithms derived from stochastic approximation methods offer a simultaneous answer to these two difficulties: the use of a stochastic method drastically reduces the algorithmic cost, without degrading the quality of the proposed prediction rule, by avoiding overlearning. In particular, the focus of this thesis will be on stochastic gradient methods. The very popular parametric methods propose as predictions linear functions of a chosen set of explanatory variables. However, these methods often result in an imprecise approximation of the underlying statistical structure. In the non-parametric framework, which is one of the central themes of this thesis, the restriction to linear predictors is lifted. The class of functions in which the predictor is constructed depends on the observations. In practice, non-parametric methods are crucial for various applications, in particular for the analysis of non-vector data, which can be associated to a vector in a functional space via the use of a positive definite kernel. This allows the use of algorithms associated with vector data, but requires an understanding of these algorithms in the associated non-parametric space: the reproducing kernel space. Moreover, the analysis of the non-parametric estimation also provides a revealing insight into the parametric framework, when the number of predictors greatly exceeds the number of observations. The first contribution of this thesis consists in a detailed analysis of the stochastic approximation in the non-parametric framework, in particular in the framework of reproducing kernel spaces. This analysis allows to obtain optimal convergence rates for the averaged stochastic gradient descent algorithm. The proposed analysis applies to many settings, and particular attention is paid to the use of minimal assumptions, as well as to the study of settings where the number of observations is known in advance, or can evolve. The second contribution is to propose an algorithm, based on an acceleration principle, which converges at an optimal speed, both from the optimization and statistical point of view. This allows, in the non-parametric framework, to improve the convergence to the optimal rate, in some regimes for which the first analyzed algorithm remained sub-optimal. Finally, the third contribution of the thesis consists in extending the studied framework beyond the least squares loss: the stochastic gradient descent algorithm is analyzed as a Markov chain. This approach results in an intuitive interpretation, and highlights the differences between the quadratic framework and the general framework. A simple method to substantially improve the convergence is also proposed.
  • A Quantitative Measure of the Impact of Coarticulation on Phone Discriminability.

    Thomas SCHATZ, Rory TURNBULL, Francis BACH, Emmanuel DUPOUX
    Interspeech 2017 | 2017
    No summary available.
  • Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression.

    Aymeric DIEULEVEUT, Nicolas FLAMMARION, Francis BACH
    Journal of Machine Learning Research | 2017
    We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.
  • Qualitative and Descriptive Topic Extraction from Movie Reviews Using LDA.

    Christophe DUPUY, Francis BACH, Christophe DIOT
    Lecture Notes in Computer Science | 2017
    No summary available.
  • New methods for image classification, image retrieval and semantic correspondence.

    Rafael SAMPAIO DE REZENDE, Jean PONCE, Francis BACH, Matthieu CORD, Jean PONCE, Francis BACH, Patrick PEREZ, Frederic JURIE, Florent PERRONNIN
    2017
    The image representation problem is at the heart of the vision domain. The choice of image representation changes depending on the task we want to study. An image retrieval problem in large databases requires a compressed global representation, while a semantic segmentation problem requires a partitioning map of its pixels. Statistical learning techniques are the main tool for building these representations. In this manuscript, we address the learning of visual representations in three different problems: image retrieval, semantic matching and image classification. First, we study the Fisher vector representation and its dependence on the Gaussian mixture model employed. We introduce the use of several Gaussian mixture models for different types of backgrounds, e.g., different scene categories, and analyze the performance of these representations for classification purposes and the impact of the scene category as a latent variable. Our second approach proposes an extension of the SVM pipeline representation. We first show that replacing the SVM loss function with the square loss yields similar results at a fraction of the computational cost. We call this model the "square-loss exemplar machine", or SLEM in English. We introduce a variant of SLEM with cores that has the same computational advantages but with improved performance. We present experiments that establish the performance and efficiency of our methods using a wide variety of basic representations and image search datasets. Finally, we propose a deep neural network for the semantic matching problem. We use object boxes as matching elements to build an architecture that simultaneously learns appearance and geometric consistency. We propose new geometric coherence scores adapted to the neural network architecture. Our model is trained on pairs of images obtained from key points in a reference dataset and evaluated on multiple datasets, outperforming recent deep learning architectures and previous methods based on handcrafted features. We conclude the thesis by highlighting our contributions and suggesting possible future research directions.
  • Stochastic approximation and least-squares regression, with applications to machine learning.

    Nicolas FLAMMARION, Francis BACH, Alexandre d ASPREMONT, Eric MOULINES, Francis BACH, Alexandre d ASPREMONT, Eric MOULINES, Jerome BOLTE, Arnak s. DALALYAN, Jerome BOLTE
    2017
    Multiple problems in machine learning involve minimizing a smooth function on a Euclidean space. For supervised learning, this includes least squares and logistic regressions. While small-scale problems are efficiently solved with many optimization algorithms, large-scale problems require first-order methods from gradient descent. In this manuscript, we consider the special case of quadratic loss. In a first part, we propose to minimize it with a stochastic oracle. In a second part, we consider two of its applications to machine learning: data partitioning and shape constrained estimation. The first contribution is a unified framework for the optimization of non-strongly convex quadratic functions. It includes accelerated gradient descent and averaged gradient descent. This new framework suggests an alternative algorithm that combines the positive aspects of averaging and accelerating. The second contribution is to obtain the optimal prediction error rate for the least squares regression as a function of the noise dependence of the problem and the forgetting of the initial conditions. Our new algorithm is based on accelerated and averaged gradient descent. The third contribution deals with the minimization of composite functions, sum of the expectation of quadratic functions and a convex regularization. We extend the existing results for least squares to any regularization and to the different geometries induced by a Bregman divergence. In a fourth contribution, we consider the discriminative partitioning problem. We propose its first theoretical analysis, a parsimonious extension, its extension to the multi-label case and a new algorithm with a better complexity than the existing methods. The last contribution of this thesis considers the problem of seriation. We adopt a statistical approach where the matrix is observed with noise and we study the minimax estimation rates. We also propose a computationally efficient estimator.
  • Inference and applications for topic models.

    Christophe DUPUY, Francis BACH, Olivier CAPPE, Francis BACH, Olivier CAPPE, Francois CARON, Michalis TITSIAS, Patrick PEREZ, Christophe DIOT, Alexandre d ASPREMONT, Francois CARON, Michalis TITSIAS
    2017
    Most of the current recommendation systems are based on ratings (i.e., numbers between 0 and 5) to recommend a content (movie, restaurant.) to a user. The latter often has the possibility to comment on this content in the form of text in addition to rating it. It is difficult to extract information from a raw text while a simple note contains little information about the content and the user. In this thesis, we attempt to suggest a personalized readable text to the user to help him/her quickly form an opinion about a content. More specifically, we first build a thematic model predicting a personalized movie description from textual comments. Our model separates qualitative (i.e., opinionated) themes from descriptive themes by combining textual comments and number scores in an attached probabilistic model. We evaluate our model on an IMDB database and illustrate its performance through theme comparison. We then study parameter inference in large-scale latent variable models, including most theme models. We propose a unified treatment of online inference for latent variable models from non-canonical exponential families and make explicit the links between several previously proposed frequentist and Bayesian methods. We also propose a new inference method for frequentist parameter estimation that adapts MCMC methods to online inference of latent variable models by properly using local Gibbs sampling. For the latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work in which our new approach performs better than previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) that can be manipulated for parameter inference and learning in potentially sub-linear time in the number of objects. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous PPDs and PPDs defined over an exponential number of objects. We apply this class to the modeling of text documents as samples of a PPD on sentences and propose a conditional maximum likelihood formulation for modeling topic proportions, which is made possible without any approximation with our class of PPDs. We present an application to document summarization with a PPD on 2 to the power of 500 objects, where the summaries are composed of readable sentences.
  • Iterative hard clustering of features.

    Vincent ROULET, Fajwel FOGEL, Alexandre D ASPREMONT, Francis BACH
    2017
    We seek to group features in supervised learning problems by constraining the prediction vector coefficients to take only a small number of values. This problem includes non-convex constraints and is solved using projected gradient descent. We prove exact recovery results using restricted eigenvalue conditions. We then extend these results to combine sparsity and grouping constraints, and develop an efficient projection algorithm on the set of grouped and sparse vectors. Numerical experiments illustrate the performance of our algorithms on both synthetic and real data sets.
  • Robust Discriminative Clustering with Sparse Regularizers.

    Nicolas FLAMMARION, Balamurugan PALANIAPPAN, Francis BACH
    Journal of Machine Learning Research | 2017
    Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from " noise-looking " variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem. (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions. (c) we propose a natural extension for the multi-label scenario. (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form d = O(√ n) for the affine invariant case and d = O(n) for the sparse case, where n is the number of examples and d the ambient dimension. and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to O(nd 2), improving on earlier algorithms which had quadratic complexity in the number of examples.
  • On the Consistency of Ordinal Regression Methods.

    Fabian PEDREGOSA, Francis BACH, Alexandre GRAMFORT
    Journal of Machine Learning Research | 2017
    Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of Fisher consistency. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model. In this paper we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation. We will see that, for a family of surrogate loss functions that subsumes support vector ordinal regression and ORBoosting, consistency can be fully characterized by the derivative of a real-valued function at zero, as happens for convex margin-based surrogates in binary classification. We also derive excess risk bounds for a surrogate of the absolute error that generalize existing risk bounds for binary classification. Finally, our analysis suggests a novel surrogate of the squared error loss. We compare this novel surrogate with competing approaches on 9 different datasets. Our method shows to be highly competitive in practice, outperforming the least squares loss on 7 out of 9 datasets.
  • Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling.

    Christophe DUPUY, Francis BACH
    Journal of Machine Learning Research | 2017
    We study parameter inference in large-scale latent variable models. We first propose an unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We then propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling. Then, for latent Dirich-let allocation,we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. In particular, using Gibbs sampling for latent variable inference is superior to variational inference in terms of test log-likelihoods. Moreover, Bayesian inference through variational methods perform poorly, sometimes leading to worse fits with latent variables of higher dimensionality.
  • Stochastic Composite Least-Squares Regression with Convergence Rate O(1/n).

    Nicolas FLAMMARION, Francis BACH
    Proceedings of The 30th Conference on Learning Theory, (COLT) | 2017
    We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.
  • Kernel Square-Loss Exemplar Machines for Image Retrieval.

    Rafael REZENDE, Joaquin ZEPEDA, Jean PONCE, Francis BACH, Patrick PEREZ
    Computer Vision and Pattern Recognition 2017 | 2017
    Zepeda and Pérez have recently demonstrated the promise of the exemplar SVM (ESVM) as a feature encoder for image retrieval. This paper extends this approach in several directions: We first show that replacing the hinge loss by the square loss in the ESVM cost function significantly reduces encoding time with negligible effect on accuracy. We call this model square-loss exemplar machine, or SLEM. We then introduce a kernelized SLEM which can be implemented efficiently through low-rank matrix decomposition , and displays improved performance. Both SLEM variants exploit the fact that the negative examples are fixed, so most of the SLEM computational complexity is relegated to an offline process independent of the positive examples. Our experiments establish the performance and computational advantages of our approach using a large array of base features and standard image retrieval datasets.
  • A unified perspective on convex structured sparsity: Hierarchical, symmetric, submodular norms and beyond.

    Guillaume OBOZINSKI, Francis BACH
    2016
    In this paper, we propose a unified theory for convex structured sparsity-inducing norms on vectors associated with combinatorial penalty functions. Specifically, we consider the situation of a model simultaneously (a) penalized by a set-function defined on the support of the unknown parameter vector which represents prior knowledge on supports, and (b) regularized in p-norm. We show that each of the obtained combinatorial optimization problems admits a natural relaxation as an optimization problem regularized by a matching sparsity-inducing norm. To characterize the tightness of the relaxation, we introduce a notion of lower combinatorial envelope of a set-function. Symmetrically, a notion of upper combinatorial envelope produces the most concise norm expression. We show that these relaxations take the form of combinatorial latent group Lassos associated with min-cover penalties also known as block-coding schemes. For submodular penalty functions, the associated norm, dual norm and the corresponding proximal operator can be computed efficiently using a generic divide-and-conquer algorithm. Our framework obtains constructive derivations for the Lasso, group Lasso, exclusive Lasso, the OWL, OSCAR and SLOPE penalties, the k-support norm, several hierarchical penalties considered in the literature for chains and tree structures, and produces also new norms. It leads to general efficient algorithms for all these norms, recovering as special cases several algorithms proposed in the literature and yielding improved procedures for some cases. For norms associated with submodular penalties, including a large number of non-decomposable norms, we generalize classical support recovery and fast rates convergence results based respectively on generalization of the irrepresentability condition and the restricted eigenvalue condition .
  • Beyond CCA: Moment Matching for Multi-View Models.

    Anastasia PODOSINNIKOVA, Francis BACH, Simon LACOSTE JULIEN
    2016
    We introduce three novel semi-parametric extensions of probabilistic canonical correlation analysis with identifiability guarantees. We consider moment matching techniques for estimation in these models. For that, by drawing explicit links between the new models and a discrete version of independent component analysis (DICA), we first extend the DICA cumulant tensors to the new discrete version of CCA. By further using a close connection with independent component analysis, we introduce generalized covariance matrices , which can replace the cumulant tensors in the moment matching framework, and, therefore, improve sample complexity and simplify derivations and algorithms significantly. As the tensor power method or orthogonal joint diagonalization are not applicable in the new setting, we use non-orthogonal joint diago-nalization techniques for matching the cumulants. We demonstrate performance of the proposed models and estimation techniques on experiments with both synthetic and real datasets.
  • Learning Determinantal Point Processes in Sublinear Time.

    Christophe DUPUY, Francis BACH
    2016
    We propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on $2^{500}$ items.
  • Learning structured models on weighted graphs, with applications to spatial data analysis.

    Loic LANDRIEU, Guillaume OBOZINSKI, Francis BACH, Guillaume OBOZINSKI, Francis BACH, Matthew b. BLASCHKO, Jalal FADILI, Olivier BONIN, Jean christophe PESQUET, Bruno VALLET, Matthew b. BLASCHKO, Jalal FADILI
    2016
    The modeling of complex processes may involve a large number of variables with a complicated correlation structure between them. For example, spatial phenomena often have a strong spatial regularity, resulting in a correlation between variables that is stronger the closer the corresponding regions are. The formalism of weighted graphs allows to capture in a compact way these relations between variables, allowing the mathematical formalization of many spatial data analysis problems. The first part of the manuscript focuses on the efficient solution of spatial regularization problems, involving penalties such as total variation or total contour length. We present a preconditioning strategy for the generalized forward-backward algorithm, specifically adapted to solve problems structured by weighted graphs with high variability of configurations and weights. We then present a new algorithm called cut pursuit, which exploits the relationships between the flow algorithms and the total variation through a working set strategy. These algorithms show superior performance to the state of the art for geostatistical data aggregation tasks. The second part of this paper focuses on the development of a new model that extends continuous time Markov chains to the case of general undirected weighted graphs. This model allows a finer consideration of interactions between neighboring nodes for structured prediction, as illustrated for supervised classification of urban fabrics.
  • Exploiting Crowd Sourced Reviews to Explain Movie Recommendation.

    Sara EL AOUAD, Christophe DUPUY, Renata TEIXEIRA, Francis BACH, Christophe DIOT
    Lecture Notes in Computer Science | 2016
    No summary available.
  • ABX-discriminability measures and applications.

    Thomas SCHATZ, Francis BACH, Emmanuel DUPOUX, Danizel SWINGLEY, Jean luc SCHWARTZ, Ludovic DENOYER, Martine ADDA DECKER
    2016
    This thesis is, at first, an indirect contribution to the problem of modeling the acquisition of phonetic categories in children. The computational models already proposed have not yet been systematically tested to determine whether they are really able to account for a substantial part of the available empirical evidence. We develop an approach allowing a systematic evaluation of models based on ABX Discriminability Measures. We show the interest of our approach by applying it to two related problems: the processing of phonetic categories at birth and in adulthood. The next step will of course be to apply our approach to phonetic category acquisition patterns.The interest of ABX Discriminability Measures is not restricted to the particular case of evaluating phonetic category processing patterns. They are useful in the study of signals other than speech and of categories other than phonetic categories, as well as in disciplinary fields other than cognitive sciences, such as engineering, data mining or artificial intelligence for example. We justify this by studying the properties of these measures in a general abstract framework and by presenting three main families of applications: the evaluation of the ability of systems operating in the absence of explicit supervision to represent a categorical structure. the formulation of simple computational models of behavior in discrimination tasks. the definition of descriptive measures for representations associated with categorical data.
  • A weakly-supervised discriminative model for audio-to-score alignment.

    Remi LAJUGIE, Piotr BOJANOWSKI, Philippe CUVILLIER, Sylvain ARLOT, Francis BACH
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 2016
    In this paper, we consider a new discriminative approach to the problem of audio-to-score alignment. We consider the two distinct informations provided by the music scores: (i) an exact ordered list of musical events and (ii) an approximate prior information about relative duration of events. We extend the basic dynamic time warping algorithm to a convex problem that learns optimal classifiers for all events while jointly aligning files, using this weak supervision only. We show that the relative duration between events can be easily used as a penalization of our cost function and allows us to drastically improve performances of our approach. We demonstrate the validity of our approach on a large and realistic dataset.
  • A weakly-supervised discriminative model for audio-to-score alignment.

    Remi LAJUGIE, Piotr BOJANOWSKI, Philippe CUVILLIER, Sylvain ARLOT, Francis BACH
    41st International Conference on Acoustics, Speech, and Signal Processing (ICASSP) | 2016
    In this paper, we consider a new discriminative approach to the problem of audio-to-score alignment. We consider the two distinct informations provided by the music scores: (i) an exact ordered list of musical events and (ii) an approximate prior information about relative duration of events. We extend the basic dynamic time warping algorithm to a convex problem that learns optimal classifiers for all events while jointly aligning files, using this weak supervision only. We show that the relative duration between events can be easily used as a penalization of our cost function and allows us to drastically improve performances of our approach. We demonstrate the validity of our approach on a large and realistic dataset.
  • PAC-Bayesian Theory Meets Bayesian Inference.

    Pascal GERMAIN, Francis BACH, Alexandre LACOSTE, Simon LACOSTE JULIEN
    Neural Information Processing Systems (NIPS 2016) | 2016
    We exhibit a strong link between frequentist PAC-Bayesian risk bounds and the Bayesian marginal likelihood. That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization risk bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam's razor criteria, under the assumption that the data is generated by an i.i.
  • Learning with Clustering Structure.

    Vincent ROULET, Fajwel FOGEL, Alexandre D ASPREMONT, Francis BACH
    2016
    We study a supervised clustering problem seeking to cluster either features, tasks or sample points using losses extracted from supervised learning problems. We formulate a unified optimization problem handling these three settings and derive algorithms whose core iteration complexity is concentrated in a k-means clustering step, which can be approximated efficiently. We test our methods on both artificial and realistic data sets extracted from movie reviews and 20NewsGroup.
  • Machine learning: from theory to practice.

    Massih reza AMINI, Francis BACH
    2015
    No summary available.
  • Convex and spectral relaxations for phase retrieval, seriation and ranking.

    Fajwel FOGEL, Francis BACH, Alexandre d ASPREMONT, Milan VOJNOVIC, Erwan LE PENNEC, Stephan CLEMENCON, Anatoli JUDITSKY
    2015
    No summary available.
  • Machine learning: from theory to practice.

    Massih reza AMINI, Francis BACH
    2015
    No summary available.
  • Exploiting crowd sourced reviews to explain movie recommendation.

    Sara EL AOUAD, Christophe DUPUY, Renata TEIXEIRA, Christophe DIOT, Francis BACH
    2nd Workshop on Recommendation Systems for TELEVISION and ONLINE VIDEO | 2015
    Streaming services such as Netflix, M-Go, and Hulu use advanced recommender systems to help their customers identify relevant content quickly and easily. These recommenders display the list of recommended movies organized in sublists labeled with the genre or some more specific labels. Unfortunately , existing methods to extract these labeled sublists require human annotators to manually label movies, which is time-consuming and biased by the views of annotators. In this paper, we design a method that relies on crowd sourced reviews to automatically identify groups of similar movies and label these groups. Our method takes the content of movie reviews available online as input for an algorithm based on Latent Dirichlet Allocation (LDA) that identifies groups of similar movies. We separate the set of similar movies that share the same combination of genre in sublists and personalize the movies to show in each sublist using matrix factorization. The results of a side-by-side comparison of our method against Technicolor's M-Go VoD service are encouraging.
  • Non-parametric Stochastic Approximation with Large Step sizes.

    Aymeric DIEULEVEUT, Francis BACH
    2015
    We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS $\mathcal{H}$, even if the optimal predictor (i.e., the conditional expectation) is not in $\mathcal{H}$. In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.
  • Sample Complexity of Dictionary Learning and Other Matrix Factorizations.

    Remi GRIBONVAL, Rodolphe JENATTON, Francis BACH, Martin KLEINSTEUBER, Matthias SEIBERT
    IEEE Transactions on Information Theory | 2015
    Many modern tools in machine learning and signal processing, such as sparse dictionary learning, principal component analysis (PCA), non-negative matrix factorization (NMF), $K$-means clustering, etc., rely on the factorization of a matrix obtained by concatenating high-dimensional vectors from a training collection. While the idealized task would be to optimize the expected quality of the factors over the underlying distribution of training vectors, it is achieved in practice by minimizing an empirical average over the considered collection. The focus of this paper is to provide sample complexity estimates to uniformly control how much the empirical average deviates from the expected cost function. Standard arguments imply that the performance of the empirical predictor also exhibit such guarantees. The level of genericity of the approach encompasses several possible constraints on the factors (tensor product structure, shift-invariance, sparsity \ldots), thus providing a unified perspective on the sample complexity of several widely used matrix factorization schemes. The derived generalization bounds behave proportional to $\sqrt{\log(n)/n}$ w.r.t.\ the number of samples $n$ for the considered matrix factorization techniques.
  • Structured prediction for sequential data analysis.

    Remi LAJUGIE, Francis BACH, Sylvain ARLOT, Francis BACH, Sylvain ARLOT
    2015
    In this thesis, we are interested in machine learning problems in the context of structured outputs with a sequential structure. On the one hand, we consider the problem of learning similarity measures for two tasks: (i) the detection of breaks in multivariate signals and (ii) the problem of temporal distortion between pairs of signals. The methods generally used to solve these two problems depend heavily on a similarity measure. We learn a similarity measure from fully labeled data. We present standard structured prediction algorithms that are efficient for learning. We validate our approach on real data from various domains. On the other hand, we address the problem of weak supervision for the task of aligning an audio recording to the played score. We consider the score as a symbolic representation giving (i) complete information on the order of the symbols and (ii) approximate information on the shape of the expected alignment. We learn a classifier for each symbol with this information. We develop a learning method based on the optimization of a convex function. We demonstrate the validity of the approach on musical data.
  • Weakly-Supervised Alignment of Video with Text.

    P. BOJANOWSKI, R. LAJUGIE, E. GRAVE, F. BACH, I. LAPTEV, J. PONCE, C. SCHMID
    2015 IEEE International Conference on Computer Vision (ICCV) | 2015
    Suppose that we are given a set of videos, along with natural language descriptions in the form of multiple sentences (e.g.
  • Feature extraction and supervised learning on fMRI : from practice to theory.

    Fabian PEDREGOSA IZQUIERDO, Francis BACH, Alexandre GRAMFORT, Dimitri VAN DE VILLE, Alain RAKOTOMAMONJY, Ludovic DENOYER, Bertrand THIRION, Marcel VAN GERVEN
    2015
    Until the advent of non-invasive neuroimaging methods, knowledge of the brain was acquired through the study of its lesions, post-mortem analyses and invasive experiments. Nowadays, modern imaging techniques such as fMRI are able to reveal many aspects of the human brain at a progressively higher spatio-temporal resolution. However, in order to answer increasingly complex neuroscientific questions, technical improvements in acquisition must be coupled with new methods of data analysis. In this thesis, I propose different applications of statistical learning to fMRI data processing. Often, the data acquired by the fMRI scanner follows a variable selection step in which activation maps are extracted from the fMRI signal. The first contribution of this thesis is the introduction of a model named Rank-1 GLM (R1-GLM) for the joint estimation of activation maps and the hemodynamic response function (HRF). We quantify the improvement of this approach over existing procedures on different fMRI datasets. The second part of this thesis is devoted to the decoding problem in fMRI, i.e., the task of predicting some information about the stimuli from the brain activation maps. From a statistical point of view, this problem is difficult due to the high dimensionality of the data, often thousands of variables, while the number of images available for training is small, typically a few hundred. We consider the case where the target variable is composed from discrete and ordered values. The second contribution of this thesis is to propose the following two measures to evaluate the performance of a decoding model: absolute error and pairwise detuning. We present several models that optimize a convex approximation of these loss functions and examine their performance on fMRI data sets. Motivated by the success of some ordinal regression models for the fMRI-based decoding task, we turn to the study of some theoretical properties of these methods. The property we study is known as Fisher consistency. The third, and most theoretical, contribution of this thesis is to examine the consistency properties of a rich family of loss functions that are used in ordinal regression models.
  • Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering.

    Simon LACOSTE JULIEN, Fredrik LINDSTEN, Francis BACH
    18th International Conference on Artificial Intelligence and Statistics (AISTATS) | 2015
    Recently, the Frank-Wolfe optimization algorithm was suggested as a procedure to obtain adaptive quadrature rules for integrals of functions in a reproducing kernel Hilbert space (RKHS) with a potentially faster rate of convergence than Monte Carlo integration (and "kernel herding" was shown to be a special case of this procedure). In this paper, we propose to replace the random sampling step in a particle filter by Frank-Wolfe optimization. By optimizing the position of the particles, we can obtain better accuracy than random or quasi-Monte Carlo sampling. In applications where the evaluation of the emission probabilities is expensive (such as in robot localization), the additional computational cost to generate the particles through optimization can be justified. Experiments on standard synthetic examples as well as on a robot localization task indicate indeed an improvement of accuracy over random and quasi-Monte Carlo sampling.
  • Sparse and Spurious: Dictionary Learning With Noise and Outliers.

    Remi GRIBONVAL, Rodolphe JENATTON, Francis BACH
    IEEE Transactions on Information Theory | 2015
    A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse coding, or sparse dictionary learning, relies on a non-convex procedure whose local minima have not been fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, with high probability, sparse coding admits a local minimum around the reference dictionary generating the signals. Our study takes into account the case of over-complete dictionaries, noisy signals, and possible outliers, thus extending previous work limited to noiseless settings and/or under-complete dictionaries. The analysis we conduct is non-asymptotic and makes it possible to understand how the key quantities of the problem, such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the number of atoms, the sparsity and the number of observations.
  • An online em algorithm in hidden (semi-)Markov models for audio segmentation and clustering.

    Alberto BIETTI, Francis BACH, Arshia CONT
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 2015
    Audio segmentation is an essential problem in many audio signal processing tasks, which tries to segment an audio signal into homogeneous chunks. Rather than separately finding change points and computing similarities between segments, we focus on joint segmentation and clustering, using the framework of hidden Markov and semi-Markov models. We introduce a new incremental EM algorithm for hidden Markov models (HMMs) and show that it compares favorably to existing online EM algorithms for HMMs. We present results for real-time segmentation of musical notes and acoustic scenes.
  • Rethinking LDA: Moment Matching for Discrete ICA.

    Anastasia PODOSINNIKOVA, Francis BACH, Simon LACOSTE JULIEN
    NIPS 2015 - Advances in Neural Information Processing Systems 28 | 2015
    We consider moment matching techniques for estimation in latent Dirichlet allocation (LDA). By drawing explicit links between LDA and discrete versions of independent component analysis (ICA), we first derive a new set of cumulant-based tensors, with an improved sample complexity. Moreover, we reuse standard ICA techniques such as joint diagonalization of tensors to improve over existing methods based on the tensor power method. In an extensive set of experiments on both synthetic and real datasets, we show that our new combination of tensors and orthogonal joint diagonalization techniques outperforms existing moment matching methods.
  • Semidefinite and Spectral Relaxations for Multi-Label Classification.

    Remi LAJUGIE, Piotr BOJANOWSKI, Sylvain ARLOT, Francis BACH
    2015
    In this paper, we address the problem of multi-label classification. We consider linear classifiers and propose to learn a prior over the space of labels to directly leverage the performance of such methods. This prior takes the form of a quadratic function of the labels and permits to encode both attractive and repulsive relations between labels. We cast this problem as a structured prediction one aiming at optimizing either the accuracies of the predictors or the F 1-score. This leads to an optimization problem closely related to the max-cut problem, which naturally leads to semidefinite and spectral relaxations. We show on standard datasets how such a general prior can improve the performances of multi-label techniques.
  • An online EM algorithm in hidden (semi-)Markov models for audio segmentation and clustering.

    Alberto BIETTI, Francis BACH, Arshia CONT
    ICASSP 2015 - 40th IEEE International Conference on Acoustics, Speech and Signal Processing | 2015
    Audio segmentation is an essential problem in many audio signal processing tasks, which tries to segment an audio signal into homogeneous chunks. Rather than separately finding change points and computing similarities between segments, we focus on joint segmentation and clustering, using the framework of hidden Markov and semi-Markov models. We introduce a new incremental EM algorithm for hidden Markov models (HMMs) and show that it compares favorably to existing online EM algorithms for HMMs. We present results for real-time segmentation of musical notes and acoustic scenes.
  • From Averaging to Acceleration, There is Only a Step-size.

    Nicolas FLAMMARION, Francis BACH
    Proceedings of The 28th Conference on Learning Theory, (COLT) | 2015
    We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n 2), where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system , showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.
  • Dictionary learning for parsimonious representations.

    Remi GRIBONVAL, Rodolphe JENATTON, Francis BACH, Martin KLEINSTEUBER, Matthias SEIBERT
    46e Journées de Statistique | 2014
    A popular approach within the signal processing and machine learning communities consists in modelling high-dimensional data as sparse linear combinations of atoms selected from a dictionary. Given the importance of the choice of the dictionary for the operational deployment of these tools, a growing interest for \emph{learned} dictionaries has emerged. The most popular dictionary learning techniques, which are expressed as large-scale matrix factorization through the optimization of a non convex cost function, have been widely disseminated thanks to extensive empirical evidence of their success and steady algorithmic progress. Yet, until recently they remained essentially heuristic. We will present recent work on statistical aspects of sparse dictionary learning, contributing to the characterization of the excess risk as a function of the number of training samples. The results cover non only sparse dictionary learning but also a much larger class of constrained matrix factorization problems.
  • A Markovian approach to distributional semantics.

    Edouard GRAVE, Francis BACH, A renseigner BLEI, A renseigner YVON, A renseigner GALLINARI, A renseigner SAGOT, A renseigner BACH, A renseigner OBOZINSKI
    2014
    This thesis, organized in two independent parts, focuses on distributional semantics and variable selection. In the first part, we introduce a new method for learning word representations from large amounts of raw text. This method is based on a probabilistic model of the sentence, using a hidden Markov model and a dependency tree. We present an efficient algorithm for performing inference and learning in such a model, based on the online EM algorithm and approximate message propagation. We evaluate the obtained models on intrinsic tasks, such as predicting human similarity judgments or categorizing words and two extrinsic tasks~: named entity recognition and supersense labeling. In the second part, we introduce, in the context of linear models, a new penalty for variable selection in the presence of highly correlated predictors. This penalty, called Lasso trace, uses the norm trace of the selected predictors, which is a convex relaxation of their rank, as a complexity criterion. The Lasso trace interpolates the $\ell_1$ and $\ell_2$ norms. In particular, when all predictors are orthogonal, it is equal to the $\ell_1$ norm, while when all predictors are equal, it is equal to the $\ell_2$ norm. We propose two algorithms to compute the solution of the Lasso trace regularized least squares regression problem and perform experiments on synthetic data.
  • Language modeling using structured penalties.

    Anil kumar NELAKANTI, Francis BACH, A renseigner BACH, A renseigner ARCHAMBEAU, A renseigner ARTIERES, A renseigner AMINI, A renseigner BOUCHARD
    2014
    Natural language modeling is one of the fundamental challenges in artificial intelligence and interactive system design, with applications in dialogue systems, text generation and machine translation. We propose a discriminative log-linear model that gives the distribution of words that follow a given context. Due to data sparsity, we propose a penalty term that correctly encodes the structure of the functional space to avoid overlearning and improve generalization, while appropriately capturing long-term dependencies. The result is an efficient model that sufficiently captures long dependencies without causing a large increase in space or time resources. In a log-linear model, the learning and testing phases become more and more expensive with an increasing number of classes. The number of classes in a language model is the size of the vocabulary, which is usually very large. A common trick is to apply the model in two steps: the first step identifies the most likely cluster and the second step takes the most likely word from the chosen cluster. This idea can be generalized to a deeper hierarchy with multiple levels of clustering. However, the performance of the resulting hierarchical clustering system depends on the application domain and the construction of a good hierarchy. We investigate different strategies for constructing the category hierarchy of their observations.
  • Weakly Supervised Action Labeling in Videos under Ordering Constraints.

    Piotr BOJANOWSKI, Remi LAJUGIE, Francis BACH, Ivan LAPTEV, Jean PONCE, Cordelia SCHMID, Josef SIVIC
    Lecture Notes in Computer Science | 2014
    We are given a set of video clips, each one annotated with an ordered list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script. We seek to tem- porally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with ordering constraints. Each video clip is divided into small time intervals and each time interval of each video clip is assigned one action label, while respecting the order in which the action labels appear in the given annotations. We show that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner. We evaluate the proposed model on a new and challenging dataset of 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies.
  • Sparse Modeling for Image and Vision Processing.

    Julien MAIRAL, Francis BACH, Jean PONCE
    Foundations and Trends® in Computer Graphics and Vision | 2014
    In recent years, a large amount of multi-disciplinary research has been conducted on sparse models and their applications. In statistics and machine learning, the sparsity principle is used to perform model selection---that is, automatically selecting a simple model among a large collection of them. In signal processing, sparse coding consists of representing data with linear combinations of a few dictionary elements. Subsequently, the corresponding tools have been widely adopted by several scientific communities such as neuroscience, bioinformatics, or computer vision. The goal of this monograph is to offer a self-contained view of sparse modeling for visual recognition and image processing. More specifically, we focus on applications where the dictionary is learned and adapted to data, yielding a compact representation that has been successful in various contexts.
  • SegAnnDB: interactive Web-based genomic segmentation.

    Toby d HOCKING, Valentina BOEVA, Gudrun SCHLEIERMACHER, Isabelle JANOUEIX LEROSEY, Olivier DELATTRE, Wilfrid RICHER, Franck BOURDEAUT, Miyuki SUGURO, Masao SETO, Francis BACH, Jean philippe VERT, Guillem RIGAILL
    Bioinformatics | 2014
    Motivation: DNA copy number profiles characterize regions of chromosome gains, losses and breakpoints in tumor genomes. Although many models have been proposed to detect these alterations, it is not clear which model is appropriate before visual inspection the signal, noise and models for a particular profile. . Results: We propose SegAnnDB, a Web-based computer vision system for genomic segmentation: first, visually inspect the profiles and manually annotate altered regions, then SegAnnDB determines the precise alteration locations using a mathematical model of the data and annotations. SegAnnDB facilitates collaboration between biologists and bioinformaticians, and uses the University of California, Santa Cruz genome browser to visualize copy number alterations alongside known genes. . Availability and implementation: The breakpoints project on INRIA GForge hosts the source code, an Amazon Machine Image can be launched and a demonstration Web site is http://bioviz.rocq.inria.fr. . Contact: pj.ca.hcetit.sc.gs@ybot. . Supplementary information: Supplementary data are available at Bioinformatics online.
  • SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.

    Aaron DEFAZIO, Francis BACH, Simon LACOSTE JULIEN
    Advances In Neural Information Processing Systems | 2014
    In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.
  • On the sample complexity of sparse dictionary learning.

    M. SEIBERT, M. KLEINSTEUBER, R. GRIBONVAL, R. JENATTON, F. BACH
    2014 IEEE Workshop on Statistical Signal Processing (SSP) | 2014
    In the synthesis model signals are represented as a sparse combinations of atoms from a dictionary. Dictionary learning describes the acquisition process of the underlying dictionary for a given set of training samples. While ideally this would be achieved by optimizing the expectation of the factors over the underlying distribution of the training data, in practice the necessary information about the distribution is not available. Therefore, in real world applications it is achieved by minimizing an empirical average over the available samples. The main goal of this paper is to provide a sample complexity estimate that controls to what extent the empirical average deviates from the cost function. This estimate then provides a suitable estimate to the accuracy of the representation of the learned dictionary. The presented approach exemplifies the general results proposed by the authors in [1] and gives more concrete bounds of the sample complexity of dictionary learning. We cover a variety of sparsity measures employed in the learning procedure.
  • Large-Margin Metric Learning for Constrained Partitioning Problems.

    Remi LAJUGIE, Sylvain ARLOT, Francis BACH
    Proceedings of The 31st International Conference on Machine Learning | 2014
    We consider unsupervised partitioning problems based explicitly or implicitly on the minimization of Euclidean distortions, such as clustering, image or video segmentation, and other change-point detection problems. We emphasize on cases with specific structure, which include many practical situations ranging from meanbasedchange-point detection to image segmentation problems. We aim at learning a Mahalanobis metric for these unsupervised problems, leading to feature weighting and/or selection. This is done in a supervised way by assuming the availability of several (partially) labeled datasets that share the same metric. We cast the metric learning problem as a large-margin structured prediction problem, with proper definition of regularizers and losses, leading to a convex optimization problem which can be solved efficiently. Our experiments show how learning the metric can significantlyimprove performance on bioinformatics, video or image segmentation problems.
  • A Markovian approach to distributional semantics with application to semantic compositionality.

    Edouard GRAVE, Guillaume OBOZINSKI, Francis BACH
    International Conference on Computational Linguistics (Coling) | 2014
    In this article, we describe a new approach to distributional semantics. This approach relies on a generative model of sentences with latent variables, which takes the syntax into account by using syntactic dependency trees. Words are then represented as posterior distributions over those latent classes, and the model allows to naturally obtain in-context and out-of-context word representations, which are comparable. We train our model on a large corpus and demonstrate the compositionality capabilities of our approach on different datasets.
  • Metric Learning for Temporal Sequence Alignment.

    Damien GARREAU, Remi LAJUGIE, Sylvain ARLOT, Francis BACH
    Advances in Neural Information Processing Systems 27 (NIPS 2014) | 2014
    In this paper, we propose to learn a Mahalanobis distance to perform alignment of multivariate time series. The learning examples for this task are time series for which the true alignment is known. We cast the alignment problem as a structured prediction task, and propose realistic losses between alignments for which the optimization is tractable. We provide experiments on real data in the audio to audio context, where we show that the learning of a similarity measure leads to improvements in the performance of the alignment task. We also propose to use this metric learning framework to perform feature selection and, from basic audio features, build a combination of these with better performance for the alignment.
  • Learning sparse penalties for change-point detection using max margin interval regression.

    Guillem RIGAILL, Toby d. HOCKING, Francis BACH, Jean philippe VERT
    ICML 2013 | 2013
    In segmentation models, the number of change-points is typically chosen using a penalized cost function. In this work, we propose to learn the penalty and its constants in databases of signals with weak change-point annotations. We propose a convex relaxation for the resulting interval regression problem, and solve it using accelerated proximal gradient methods. We show that this method achieves state-of-the-art change-point detection in a database of annotated DNA copy number profiles from neuroblastoma tumors.
  • Multi-task statistical learning.

    Matthieu SOLNON, Sylvain ARLOT, Francis BACH
    2013
    The purpose of this thesis is to construct, calibrate and study multi-task estimators in a non-parametric and non-asymptotic frequentist framework. We place ourselves in the framework of kernel ridge regression and extend existing methods of multi-task regression. The key issue is the calibration of a matrix regularization parameter, which encodes the similarity between tasks. We propose a calibration method for this parameter, based on the estimation of the noise covariance matrix between tasks. We then give optimality guarantees for the obtained estimator, via an oracle inequality, and verify its behavior on simulated examples. We also obtain a precise framework for the risks of multi-task and single-task oracle estimators in some cases. This allows us to identify several interesting situations, where the multi-task oracle is more efficient than the single-task oracle, or vice versa. It also allows us to ensure that the oracle inequality forces the multi-task estimator to have a lower risk than the single-task estimator in the studied cases. The behavior of the multi-task and single-task oracles is verified on simulated examples.
  • Finding Actors and Actions in Movies.

    P. BOJANOWSKI, F. BACH, I. LAPTEV, J. PONCE, C. SCHMID, J. SIVIC
    2013 IEEE International Conference on Computer Vision | 2013
    We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
  • Hidden Markov tree models for semantic class induction.

    Edouard GRAVE, Guillaume OBOZINSKI, Francis BACH
    CoNLL - Seventeenth Conference on Computational Natural Language Learning | 2013
    In this paper, we propose a new method for semantic class induction. First, we introduce a generative model of sentences, based on dependency trees and which takes into account homonymy. Our model can thus be seen as a generalization of Brown clustering. Second, we describe an efficient algorithm to perform inference and learning in this model. Third, we apply our proposed method on two large datasets ($10^8$ tokens, $10^5$ words types), and demonstrate that classes induced by our algorithm improve performance over Brown clustering on the task of semi-supervised supersense tagging and named entity recognition.
  • Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline.

    Thomas SCHATZ, Vijayaditya PEDDINTI, Francis BACH, Aren JANSEN, Hynek HERMANSKY, Emmanuel DUPOUX
    Proceedings of INTERSPEECH 2013 | 2013
    We present a new framework for the evaluation of speech rep- resentations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky [1]. In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to de- compose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.
  • Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).

    Francis BACH, Eric MOULINES
    Advances in Neural Information Processing Systems (NIPS) | 2013
    We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/n^{1/2}). We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing that they often outperform existing approaches.
  • Learning smoothing models of copy number profiles using breakpoint annotations.

    Toby HOCKING, Gudrun SCHLEIERMACHER, Isabelle JANOUEIX LEROSEY, Valentina BOEVA, Julie CAPPO, Olivier DELATTRE, Francis BACH, Jean philippe VERT, Toby dylan HOCKING
    BMC Bioinformatics | 2013
    Many models have been proposed to detect copy number alterations in chromosomal copy number profiles, but it is usually not obvious to decide which is most effective for a given data set. Furthermore, most methods have a smoothing parameter that determines the number of breakpoints and must be chosen using various heuristics.
  • Intersecting singularities for multi-structured estimation.

    Emile RICHARD, Francis BACH, Jean philippe VERT
    Proceedings of the International Conference on Machine Learning (ICML) | 2013
    We address the problem of designing a convex nonsmooth regularizer encouraging multiple structural effects simultaneously. Focus- ing on the inference of sparse and low-rank matrices we suggest a new complexity index and a convex penalty approximating it. The new penalty term can be written as the trace norm of a linear function of the matrix. By analyzing theoretical properties of this family of regularizers we come up with oracle in- equalities and compressed sensing results ensuring the quality of our regularized estimator. We also provide algorithms and support- ing numerical experiments.
  • Kernel-Based Methods for Hypothesis Testing: A Unified View.

    Zaid HARCHAOUI, Francis BACH, Olivier CAPPE, Eric MOULINES
    IEEE Signal Processing Magazine | 2013
    Kernel-based methods provide a rich and elegant framework for developing nonparametric detection procedures for signal processing. Several recently proposed procedures can be simply described using basic concepts of reproducing kernel Hilbert space embeddings of probability distributions, namely mean elements and covariance operators. We propose a unified view of these tools, and draw relationships with information divergences between distributions.
  • Convex Relaxations for Permutation Problems.

    Fajwel FOGEL, Rodolphe JENATTON, Francis BACH, Alexandre D ASPREMONT
    Neural Information Processing Systems | 2013
    Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We prove the equivalence between the seriation and the combinatorial 2-sum problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-sum problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We present numerical experiments on archeological data, Markov chains and gene sequences.
  • Domain adaptation for sequence labeling using hidden Markov models.

    Edouard GRAVE, Guillaume OBOZINSKI, Francis BACH
    New Directions in Transfer and Multi-Task: Learning Across Domains and Tasks (NIPS Workshop) | 2013
    Most natural language processing systems based on machine learning are not robust to domain shift. For example, a state-of-the-art syntactic dependency parser trained on Wall Street Journal sentences has an absolute drop in performance of more than ten points when tested on textual data from the Web. An efficient solution to make these methods more robust to domain shift is to first learn a word representation using large amounts of unlabeled data from both domains, and then use this representation as features in a supervised learning algorithm. In this paper, we propose to use hidden Markov models to learn word representations for part-of-speech tagging. In particular, we study the influence of using data from the source, the target or both domains to learn the representation and the different ways to represent words using an HMM.
  • Learning Sparse Penalties for Change-Point Detection using Max Margin Interval Regression.

    Guillem RIGAILL, Toby dylan HOCKING, Francis BACH, Jean philippe VERT
    international conference on machine learning | 2013
    In segmentation models, the number of change-points is typically chosen using a pe- nalized cost function. In this work, we pro- pose to learn the penalty and its constants in databases of signals with weak change-point annotations. We propose a convex relaxation for the resulting interval regression problem, and solve it using accelerated proximal gra- dient methods. We show that this method achieves state-of-the-art change-point detec- tion in a database of annotated DNA copy number profiles from neuroblastoma tumors.
  • Structured Penalties for Log-linear Language Models.

    Anil NELAKANTI, Cedric ARCHAMBEAU, Julien MAIRAL, Francis BACH, Guillaume BOUCHARD
    EMNLP - Empirical Methods in Natural Language Processing | 2013
    Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.
  • Variable selection for unsupervised high-dimensional classification.

    Caroline MEYNET, Pascal MASSART, Gilles CELEUX, Pascal MASSART, Gilles CELEUX, Francis BACH, Christophe BIERNACKI, Gerard BIAU, Marie anne POURSAT, Francis BACH, Christophe BIERNACKI
    2012
    There are statistical modeling situations for which the classic problem of unsupervised classification (i.e., without a priori information on the nature or number of classes to be formed) is coupled with the problem of identifying the variables that are really relevant for determining the classification. This problem is all the more essential as so-called high-dimensional data, comprising many more variables than observations, have multiplied in recent years: gene expression data, curve classification. We propose a variable selection procedure for unsupervised classification adapted to high dimensional problems. We consider a Gaussian mixture model approach, which allows us to reformulate the problem of variable selection and the choice of the number of classes into a global model selection problem. We exploit the variable selection properties of the l1 regularization to efficiently build, from the data, a collection of models that remains of reasonable size even in high dimension. We differ from the classical l1 regularization variable selection procedures with respect to parameter estimation: in each model, instead of considering the Lasso estimator, we compute the maximum likelihood estimator. Then, we select one of these maximum likelihood estimators by a penalized non-asymptotic criterion based on the slope heuristic introduced by Birgé and Massart. From a theoretical point of view, we establish a model selection theorem for the estimation of a density by maximum likelihood for a random collection of models. We apply it in our context to find a minimal penalty form for our penalized criterion. From a practical point of view, simulations are performed to validate our procedure, in particular in the context of unsupervised curve classification. The key idea of our procedure is to use the l1 regularization only to build a restricted collection of models and not also to estimate the parameters of the models. This estimation step is performed by maximum likelihood. This hybrid procedure is inspired by a theoretical study carried out in the first part in which we establish l1 oracle inequalities for the Lasso in the frameworks of Gaussian regression and mixture of Gaussian regressions, which differ from the traditionally established l0 oracle inequalities by their total absence of assumption.
  • Regularization methods for prediction in dynamic graphs and e-marketing applications.

    Emile RICHARD, Nicolas VAYATIS, Francis BACH, Theodoros EVGENIOU, Stephane GAIFFAS, Michael irwin JORDAN, Thibaut MUNIER, Massimiliano PONTIL, Jean philippe VERT
    2012
    The prediction of connections between objects, based either on a noisy observation or on a sequence of observations, is a problem of interest for a number of applications ranging from the design of recommendation systems in e-commerce and social networks to network inference in molecular biology. This work presents formulations of the link prediction problem, in both static and temporal settings, as a regularized problem. In the static scenario it is the combination of two well-known norms, the L1-norm and the trace-norm that allows link prediction, while in the dynamic case the use of an autoregressive model on linear descriptors allows to improve the quality of prediction. We will study the nature of the solutions of the optimization problems both in statistical and algorithmic terms. Encouraging empirical results highlight the contribution of the adopted methodology.
  • Convex optimization for cosegmentation.

    Armand JOULIN, Francis BACH, Michael irwin JORDAN, Jean PONCE, Cordelia SCHMID, Kristen GRAUMAN, Dale SCHUURMANS
    2012
    The apparent simplicity with which a human perceives his surroundings suggests that the process involved is partly mechanical, and therefore does not require a high degree of reflection. This observation suggests that our visual perception of the world can be simulated on a computer. Computer vision is the field of research devoted to the problem of creating a form of visual perception for computers. The computational power of computers in the 1950's did not allow for the processing and analysis of the visual data necessary to create a virtual visual perception. Recently, the computing power and storage capacity have allowed this field to really emerge. In two decades, computer vision has made it possible to answer practical or industrial problems such as the detection of faces, of people behaving suspiciously in a crowd or of manufacturing defects in production lines. On the other hand, little progress has been made in the emergence of non-task-specific virtual visual perception and the community is still facing fundamental problems. One of these problems is to segment an optical stimulus or an image into meaningful regions, objects or actions. Scene segmentation is natural for humans, but also essential to fully understand one's environment. Unfortunately it is also extremely difficult to reproduce on a computer because there is no clear definition of the "meaningful" region. Indeed, depending on the scene or the situation, a region can have different interpretations. Given a scene taking place in the street, we can consider that distinguishing a pedestrian is important in this situation, on the other hand his clothes do not necessarily seem so. If we now consider a scene taking place during a fashion show, a piece of clothing becomes an important element, thus a significant region. Here, we focus on this segmentation problem and we approach it from a particular angle to avoid this fundamental difficulty. We will consider segmentation as a weakly supervised learning problem, i.e. instead of segmenting images according to a certain predefined definition of "significant" regions, we develop methods to simultaneously segment a set of images into regions that appear regularly. We define a statistically significant region as the regions that appear regularly in the set of images. For this purpose we design models with a scope that goes beyond the application to vision. Our approach has its roots in statistical learning, whose goal is to design efficient methods to extract and/or learn recurrent patterns in data sets. This field has recently become very popular due to the increase in the number and size of available databases. We focus here on methods designed to discover "hidden" information in a database from incomplete or non-existent annotations. Finally, our work is rooted in the field of numerical optimization in order to develop efficient algorithms adapted to our problems. In particular, we use and adapt recently developed tools to relax complex combinatorial problems into convex problems for which the optimal solution is guaranteed. We illustrate the quality of our formulations and algorithms also on problems from other domains than computer vision. In particular, we show that our work can be used in text classification and in cell biology.
  • Dictionary learning methods for single-channel source separation.

    Augustin LEFEVRE, Francis BACH, Olivier CAPPE, Cedric FEVOTTE, Arshia CONT, Pierre antoine ABSIL, Laurent DAUDET, Guillermo SAPIRO
    2012
    In this thesis, we propose three main contributions to dictionary learning methods. The first one is a group sparsity criterion adapted to NMF when the chosen distortion measure is the Itakura-Saito divergence. In most music signals one can find long intervals where only one source is active (solos). The group sparsity criterion that we propose allows to automatically find such segments and to learn a dictionary adapted to each source. These dictionaries are then used to perform the separation task in intervals where the sources are mixed. These two tasks of identification and separation are performed simultaneously in a single pass of our proposed algorithm. Our second contribution is an online algorithm for learning the dictionary on a large scale, over signals of several hours. The memory space required by an online estimated NMF is constant while it grows linearly with the size of the signals provided in the standard version, which is impractical for signals longer than one hour. Our third contribution concerns the interaction with the user. For short signals, blind learning is particularly difficult, and the provision of information specific to the processed signal is essential. Our contribution is similar to inpainting and allows to take into account time-frequency annotations. It is based on the observation that almost the entire spectrogram can be divided into regions specifically assigned to each source. We describe an extension of NMF to take into account this information and discuss the possibility of inferring this information automatically with simple statistical learning tools.
  • Learning algorithms and statistical software, with applications to bioinformatics.

    Toby dylan HOCKING, Francis BACH, Jean philippe VERT, Idris ECKLEY, Isabelle JANOUEIX LEROSEY, Stephane ROBIN, Yves GRANDVALET
    2012
    Statistical learning is the field of mathematics that deals with the development of data analysis algorithms. This thesis is divided into two parts: the introduction of mathematical models and the implementation of software tools. In the first part, I present new algorithms for segmentation and for data partitioning (clustering). Data partitioning and segmentation are analysis methods that look for structures in data. I present the following contributions, highlighting the applications to bioinformatics. In the second part, I present my contributions to the free software for statistics, which is used for the daily analysis of the statistician.
  • Structured sparsity-inducing norms : statistical and algorithmic properties with applications to neuroimaging.

    Rodolphe JENATTON, Jean yves AUDIBERT, Francis BACH, Remi GRIBONVAL, Eric MOULINES, Guillaume OBOZINSKI, Bertrand THIRION, Laurent EL GHAOUI, Massimiliano PONTIL
    2011
    Many areas of industry and applied sciences have witnessed a digital revolution. This has been accompanied by a growth in the volume of data, the processing of which has become a technical challenge. In this context, parsimony has emerged as a central concept in statistical learning. It is indeed natural to want to exploit the available data via a reduced number of parameters. This thesis focuses on a particular and more recent form of parsimony, called structured parsimony. As its name indicates, we will consider situations where, beyond the parsimony alone, we will also have at our disposal a priori knowledge about structural properties of the problem. The objective of this thesis is to analyze the concept of structured parsimony, based on statistical, algorithmic and applied considerations. We start by introducing a family of structured parsimony norms whose statistical aspects are studied in detail. We will then consider dictionary learning, where we will exploit the previously introduced norms in a matrix factorization framework. Different efficient algorithmic tools, such as proximal methods, will then be proposed. Using these tools, we will illustrate on many applications why structured parsimony can be beneficial. These examples include restoration tasks in image processing, hierarchical modeling of textual documents, or prediction of object size from functional magnetic resonance imaging signals.
  • Sparse coding for machine learning, image processing and computer vision.

    Julien MAIRAL, Francis BACH, Jean PONCE, Eric MOULINES, Guillermo SAPIRO, Jean philippe VERT, Stephane MALLAT, Bruno a. OLSHAUSEN
    2010
    We study in this thesis a particular representation of signals based on a statistical learning method, which consists in modeling data as linear combinations of some elements of a learned dictionary. This can be seen as an extension of the classical wavelet framework, whose goal is to build such dictionaries (often orthonormal bases) that are adapted to natural signals. An important success of this approach has been its ability to model imagelets, and the performance of image denoising methods based on it. We address several open questions, which are related to this framework: How to efficiently learn a dictionary? How can we enrich this model by adding underlying structure to the dictionary? Is it possible to improve current image processing methods based on this approach? How should the dictionary be learned when it is used for a task other than signal reconstruction? Are there interesting applications of this method in computer vision? We answer these questions, with a multidisciplinary point of view, by borrowing tools from statistical learning, convex and stochastic optimization, signal and image processing, computer vision, but also optimization on graphs.
  • Graph matching and its application in computer vision and bioinformatics.

    Mikhail ZASLAVSKIY, Francis BACH, Jean philippe VERT
    2010
    The graph alignment problem, which plays a central role in various areas of pattern recognition, is one of the biggest challenges in graph processing. We propose an approximate method for the alignment of labeled and weighted graphs based on concave convex programming. An important application of the graph alignment problem is the alignment of protein interaction networks, which plays a central role in the search for evolutionarily conserved signaling pathways, conserved protein complexes between species, and the identification of functional orthologs. We reformulate the interaction network alignment problem as a graph alignment problem, and study how existing graph alignment algorithms can be used to solve it. In the classical formulation of the graph alignment problem, only bijective correspondences between the nodes of two graphs are considered. In many applications, however, it is more interesting to consider correspondences between sets of nodes. We propose a new formulation of this problem as a discrete optimization problem, as well as an approximate algorithm based on a continuous relaxation. We also present two independent results in the fields of statistical machine translation and bioinformatics. First, we show how the sentence-based statistical translation problem can be reformulated as a traveling salesman problem. On the other hand, we propose a new measure of similarity between protein binding sites, based on the 3D comparison of atomic clouds.
Affiliations are detected from the signatures of publications identified in scanR. An author can therefore appear to be affiliated with several structures or supervisors according to these signatures. The dates displayed correspond only to the dates of the publications found. For more information, see https://scanr.enseignementsup-recherche.gouv.fr