The course will consist in three lectures, with contributions from Steven Diamond (Stanford): (i) Convex Optimization Overview; (ii) Constructive Convex Analysis and Disciplined Convex Programming; (iii) Convex Optimization Applications.
Convex optimization has emerged as useful tool for applications that include data analysis and model fitting, resource allocation, engineering design, network design and optimization, finance, and control and signal processing. The first of three lectures in this short course is an overview of optimization in general, focussing on convex optimization in particular. The second lecture covers constructive convex analysis and disciplined convex programming, the basis for several software packages for convex optimization, including CVX and CVXPY. The third lecture addresses several applications in finance, data fitting, and advertising.
Topics in Selective Inference
The big data era has created a new scientific paradigm: collect data first, ask questions later. When the universe of scientific hypotheses that are being examined simultaneously is not taken account, inferences are likely to be false. The consequence is that follow up studies are likely not to be able to reproduce earlier reported findings or discoveries. This reproducibility failure bears a substantial cost and this talk will review recent progress in the field of statistics to address this critical issue. Consider two vignettes: first, imagine that we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)---the expected fraction of false discoveries among all discoveries---is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. Here, we shall discuss the knockoff filter, a new variable selection procedure, which rigorously controls the FDR in this setting. Second, it is well known that most inferential methods fail if they are preceded by data snooping. Yet, in the big data era, data snooping is the rule more often than not. We introduce several post-selection inference frameworks which provide formally valid inference after looking at the data. Finally, we shall explore connections between inferential issues and estimation issues and explain in what sense the FDR is a powerful concept for building purely predictive models as in machine learning.
Machine Learning for Computer Vision
The lecture will focus on computationally-efficient learning algorithms for large-scale visual recognition of images and videos. We will describe state-of-the-art efficient approaches to scale with respect to the three fundamental dimensions of learning: the number of examples, the number of features (effective dimension of feature representation), and the number of categories. We will present the approaches under a common framework, by highlighting how they can understood as decomposition schemes with respect to each of these dimensions. We will start with stochastic gradient and dual coordinate descent approaches which allow scaling linearly with respect to the number of examples. We will then present recent approaches for learning feature representations, reviewing both supervised and unsupervised ones, with an emphasis on deep-conv-net-type feature representations. Finally, we will present one-vs-rest, multi-class, label-embedding, approaches for multi-class/multi-label classification.
Many problems in machine learning that involve discrete structures or subset selection may be phrased in the language of submodular set functions. The property of submodularity, also referred to as a 'discrete analog of convexity', expresses the notion of diminishing marginal returns, and captures combinatorial versions of rank and dependence. Submodular functions occur in a variety of areas including graph theory, information theory, combinatorial optimization, stochastic processes and game theory. In machine learning, they emerge in different forms as the potential functions of graphical models, as the utility functions in active learning and sensing, in models of diversity, in structured sparse estimation or network inference. The lectures will give an introduction to the theory of submodular functions, their applications in machine learning and related optimization problems. Part I of the lectures will introduce the concept of submodularity along with several examples, as well as associated polyhedra and relations to convexity. Part II will address the ideas underlying algorithms for minimizing and maximizing submodular functions. For example, those algorithms exploit ties to both convexity and concavity.
In the analysis of machine learning algorithms one often faces complicated functions of many independent random variables. In such situations concentration inequalities, that quantify the size of typical deviations of such functions from their expected value, offer an elegant and versatile tool. The theory of concentration inequalities has seen spectacular advance in the last few decades. These inequalities proved to be useful not only in machine learning but also in a wide variety of areas, including combinatorics, graph theory, analysis of algorithms. information theory, geometry, just to name a few. This course offers an introduction to the theory and a summary of some of the most useful results with a sample of illustrations of their use in learning theory.
Probabilistic Programming Concepts
A multitude of different probabilistic programming languages exists today, all extending a traditional programming language with primitives to support modeling of complex, structured probability distributions, advanced inference and learning methods. At the same time, they extend probabilistic graphical models with primitives of programming languages to increase the expressive power of graphical models. I shall provide an overview of the underlying concepts and semantics of these languages as well as sketch their current inference and learning mechanisms. I shall also outline some emerging applications of these languages. During the talk I shall focus on probabilistic extensions of logic programming languages such as Prolog and Datalog, which have been developed since more than 20 years. One advantage of these languages is that they naturally fit both the statistical relational learning / artificial intelligence perspective as they define probability distributions over possible worlds as well as the probabilistic programming language perspective. This talk will be largely based on joint tutorials with Angelika Kimmig and the forthcoming survey paper : Luc De Raedt, Angelika Kimmig, Probabilistic (Logic) Programming Concepts, Machine Learning Journal.
Statistical and Computational Aspects of High-Dimensional Learning
In these lecture, we will study the question of optimality in high-dimensional learning problems. A tentative list of topics covered includes: (i) Minimax optimality in high-dimensional learning; (ii) Convex relaxations and semi-definite programming; (iii) The planted clique problem and average case complexity; (iv) Computational lower bounds for high-dimensional learning.
MIT / Genoa
Machine learning systems have achieved remarkable results, but are still limited by the amount of human supervision needed to perform well. Learning data representation from unsupervised (or weakly supervised) data is widely acknowledged to be key to tackle this issue. Indeed, most supervised learning methods perform well as soon as data are represented by the ``right’’ features. Learning data representation is a young, rapidly evolving subfield of machine learning, in this introduction to the topic we provide a unifying perspective starting from classical signal processing tecniques to then cover more recent data driven approaches, including some discussion of hierarchical (deep) learning methods.
Fast, Cheap and Deep - Scaling Machine Learning to the Cloud
Large scale machine learning requires tools from systems research, statistical modeling, optimization, and a good degree of domain knowledge to succeed. In this tutorial I will explain how algorithms can be designed to scale to hundreds of machines, how to achieve fault tolerance, how to do so cheaply in the cloud, and how to solve problems (and obtain record-breaking scalability) arising in log-linear models, collaborative filtering, sampling inference, and deep learning.
Stochastic optimization is a class of powerful optimization techniques to solve optimization in large scale learning problems. The basic idea of using stochastic optimization in machine learning is to divide the original optimization problem into small sub-problems, and solve a randomly chosen sub-problem at each iteration so that it achieves the global optimization on the original problem. In machine learning settings, usually the samples or the coordinates are divided into small parts. An advantage of stochastic optimization is that, although the computational cost per iteration is small, the overall computational complexity could be improved compared with the usual batch optimization. In this course, I will present methodology, theory and recent advances of stochastic optimization. More specifically, the topics will include (1) basics of convex analysis, (2) online stochastic optimization, (3) "batch" stochastic optimization on fixed size data, and (4) parallel computing techniques for stochastic optimization.
Reinforcement learning (RL) is a learning paradigm concerned with learning a behavior that maximizes a numerical performance metric. Two main differences to other forms of learning are that in RL the learner receives only partial feedback about how well it does and that the learner's actions may have long term effects, representing significant hurdles in the design of efficient and effective learning algorithms. Nevertheless, RL is of great interest thanks to its wide applicability from artificial intelligence, through operations research to control engineering, and as such RL is currently the subject of intense research. The lecture will focus on foundational aspects of RL. We will cover Markovian decision processes, core learning and planning methods, as well as algorithms developed to address the exploration-exploitation dilemma, both for the simplest bandit setting and also the full RL setting. The emphasis will be on developing a solid understanding of core issues, while some time will also be devoted to discussing state-of-the-art methods, such as the recent approaches based on deep neural networks.
In many areas, we face high dimensional data that are not just many features or variables but many aspects, e.g., time, space, frequency, users, movies, actions, etc. Tensor is a natural representation for capturing these different modalities and combine multiple sources of information in a unified way. Although tensor decomposition has traditionally been studied in the numerical linear algebra community, many important advances, e.g., convex-optimization-based algorithms and strong theoretical guarantees have been recently made in the ML community. In this lecture, I will give an overview of tensor models and optimization algorithms both old and new, emphasizing practical as well as theoretical advantages of them. I will also discuss connections to compressed sensing and matrix recovery problems.
Large Scale Deep Learning
The field of deep learning builds on 30 years of neural networks, machine learning, optimization and computer architecture research, spanning several generations of computing systems and several orders of magnitudes in both compute capabilities and data sizes. This lecture will discuss what techniques have stood the test of time and perform well on today's hardware, system architectures, datasets and problems. We'll cover training methodology, model architectures, optimization techniques and distributed learning, and wander into some of the most exciting research themes in the field today.
Structure and approximation in high-dimensional statistical optimization
Machine learning leads to large-scale optimization problems that, even when convex, can be challenging to solve exactly. On the the positive side, exact solutions are typically not required: it is usually adequate to optimize to the "statistical precision". In this tutorial, we discuss the role of structure in high-dimensional statistical optimization problems, as well as various approximate schemes that take advantage of this structure. We provide into insight into various questions, including why algorithms often behave better in high dimensions that might be expected, when rigorous guarantees for nonconvex problems are possible, and the use of randomization and dimensionality reduction for fast optimization.