# Trends in Machine Learning

A Workshop at Kyoto University - 17 & 18 March 2014

schedule

### Schedule

             Mon. 17                         Tue. 18 9:40 - 10:50 Ichiro Takeuchi Ryota Tomioka 11:10 - 12:20 Le Song Amir Globerson 12:20 - 14:00 Lunch Break 14:00 - 15:10 Samy Bengio Mehryar Mohri 15:30 - 16:40 Yoshinobu Kawahara Hiroshi Mamitsuka

#### Samy Bengio

Embeddings for images: from large scale image labeling to zero-shot classification PDF

Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called “embedding space”, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. More interestingly, the embedding space can be built such that it even contains labels for which no images were known at training time, which enables the so-called zero-shot setting, where an image of an unknown concept at training time can still be labeled appropriately at test time.

#### Amir Globerson

Learning with Algebraic Manifolds

High dimensional objects such as images and and speech are commonly assumed to lie on a low dimensional manifold. This has motivated many methods for estimating these manifolds and using them for learning. Here we propose a novel approach to the problem that relies on characterizing data via algebraic manifolds. In the talk I will first present our “Vanishing Component Analysis” algorithm for characterizing a manifold via its set of generators. The algorithm is simple and efficient and in practice describes complex data via a small set of polynomials. I will then discuss applications of the approach to semisupervised learning, and show that algebraic manifolds can result in dramatic gains from using unlabeled data.

Amir GLOBERSON (Hebrew University of Jerusalem)

#### Yoshinobu Kawahara

Parametric submodular optimization for machine learning PDF

Submodularity is known to be an analogous concept for a set function of convexity in a continuous function. Recently, it has been used as the mathematical basis for developments of fast algorithms or analyses of discrete problems in machine learning. In this talk, I address parametric submodular optimization techniques in machine learning, which often appears as separable convex optimization. The applications include structured sparse regularization, (minimum average cost) clustering, densest subgraph problem and so on. I address some common properties (combinatorial structures) and efficient algorithms for these problems from the perspective of parametric submodular optimization.

Yoshinobu KAWAHARA (Osaka University)

#### Hiroshi Mamitsuka

Collaborative Matrix Factorization for Predicting Drug-Target Interactions PDF

Computationally predicting drug-target interactions is useful to discover potential new drugs (or targets). Currently, powerful machine learning approaches for this issue use not only known drug-target interactions but also drug and target similarities. Using similarities is well-accepted pharmacologically, since the two types of similarities correspond to two recently advocated concepts, so-called, the chemical space and the genomic space. In this talk, I will first briefly review the literature of similarity-based machine learning methods for this issue and then present our recent method, which can overcome problems in existing methods. Our method is based on a factor model, named Multiple Similarities Collaborative Matrix Factorization (MSCMF), which has the following two key ideas: 1) MSCMF allows to incorporate more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select the similarities, which are effective for performance improvement. 2) MSCMF projects drugs and targets into a common low-rank feature space (matrix), which is estimated to be consistent with weighted similarity matrices over drugs and those over targets by an alternating least squares algorithm. I will finally explain experimental results, obtained by using both synthetic and real datasets, by which the performance of MSCMF was extensively evaluated. We note that our approach is general and applicable to any binary relations with similarities over elements, which can be found in many applications, such as recommender systems. In fact, MSCMF can be regarded as an extension/generalization of weighted low-rank approximation for one-class collaborative filtering.

References: 1. Ding, H., Takigawa, I., Mamitsuka, H. and Zhu, S., Similarity-based Machine Learning Methods for Predicting Drug-target Interactions: A Brief Review. To appear in Briefings in Bioinformatics.

2. Zheng, X., Ding, H., Mamitsuka, H. and Zhu, S., Collaborative Matrix Factorization with Multiple Similarities for Predicting Drug-Target Interactions. Proceedings of the Nineteenth ACM SIGKDD International Conference on on Knowledge Discovery and Data Mining (KDD 2013)., pp. 1025-1033, Chicago, IL, USA, Aug. 2013, ACM Press.

Hiroshi MAMITSUKA (Kyoto University)

#### Mehryar Mohri

Deep Boosting PDF

This talk discusses a new ensemble learning algorithm, DeepBoost, which can use as a base classifier set deep decision trees, or other rich families. Extensive experiments show that DeepBoost consistently outperforms AdaBoost and its L1-regularized variant. The key to the success of the algorithm is a capacity-conscious selection criterion for the hypotheses forming the ensemble, which is grounded in a new theoretical foundation with several significant implications.

Joint work with Corinna Cortes (Google Research) an Umar Syed (Google Research).

Mehryar MOHRI (New York University)

#### Le Song

Timing It Just Right: Learning and Optimization of High Dimensional Event Cascades PDF

Dynamical processes, such as information diffusion in social networks, video characterization of human interactions, gene regulation in biological systems and functional collaborations between brain regions, generate a large volume of high dimensional “asynchronous” and “interdependent” time-stamped event data. This type of timing information is rather different from traditional iid. data and discrete-time temporal data, which calls for new models and scalable algorithms for learning, analyzing and utilizing them. In this talk, I will present a new framework based on multivariate point processes, high dimensional sparse recovery, and randomized algorithms for addressing a sequence of problems arising from this context. As a concrete example, I will also present experimental results on learning and optimizing information cascades in web logs, including estimating hidden diffusion networks and influence maximization with the learned networks. With both careful model and algorithm design, the framework is able to handle millions of events and millions of networked entities.

Le SONG (Georgia Tech)

#### Ichiro Takeuchi

Safe Sample Screening for SVM and Its Extension to Validation Error Bound Computation PDF

The SVM is efficient in test-phases because the classifier is characterized only by a subset of the samples called SVs. However, the advantage of the sparsity has not been fully exploited in training phases because it is generally difficult to know which sample turns out to be SV beforehand. Motivated by a recent work on safe feature screening by El Ghaoui et al., we introduce a new approach called safe sample screening that enables us to identify a subset of the non-SVs and screen them out prior to the training phase. We investigate the advantage of the safe sample screening approach through intensive numerical experiments, and demonstrate that it can substantially decrease the computational cost of the state-of-the-art SVM solvers such as LIBSVM. Furthermore, we show that similar technique can be used for computing validation error bound without actually solving the training optimization problem.

Ichiro TAKEUCHI (Nagoya Institute of Technology)

#### Ryota Tomioka

Towards better computation-statistics trade-off in tensor decomposition PDF

New approaches for tensor decomposition with worst case performance guarantee have recently emerged. $O(rn^{d-1})$ is the number of samples required for recovering an $n \times n \times … \times n$ $d$-way tensor that admits a Tucker decomposition with $r \times r \times \dots \times r$ core proved for one of these methods called overlapped trace norm. Although this method is computationally very efficient, the rate is rather disappointing. A newer approach called square norm achieves lower $O(r^{d/2}n^{d/2})$ samples at the cost of higher computational demand. There is also a more theoretical approach that could achieve optimal $O(rnd)$ but seem to be computationally intractable. I will overview these recent developments and discuss how we could achieve a better trade-off between what we can provably achieve in terms of statistical accuracy and how much computation we need to spend.

Ryota TOMIOKA (Toyota Tech. Institute Chicago)

schedule.txt · Last modified: 2014/03/23 01:28 by marco