Schedule for Spring 2014 - Kyoto University Informatics Seminar

Schedule for Spring 2014

Official course name: Perspectives in Informatics 4, Graduate School of Informatics

Unless noted otherwise, all talks 16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus

    • Mikkel Thorup (University of Copenhagen), Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence, hosted by K. Iwama
    • Gabriel Peyré (Université Paris Dauphine), Optimal Transport in Imaging Sciences, hosted by M. Cuturi
    • Sushmita Gupta (Kyoto University), Paging and server problems revisited : New approaches in online algorithms, hosted by K. Iwama
    • Golden Week holiday
    • David Bremner (University of New Brunswick), Computing symmetry groups of convex polyhedra, hosted by D. Avis
    • Christian Nitschke (Kyoto University), Corneal Imaging: An Introduction to Corneal Reflection Analysis and Applications, hosted by A. Jatowt
    • Cathal Gurrin (Dublin City University), Gathering and Organising Lifelogs, a new Information Retrieval Challenge, hosted by A. Jatowt
    • Gena Hahn (Université de Montréal), Cops and Robbers Games on Graphs, hosted by D. Avis
    • Frank Nielsen (Sony CS Labs, Ecole Polytechnique), Information-theoretic clustering with applications, hosted by M. Cuturi
    • Mahito Sugiyama (Osaka University), Distance-based outlier detection via sampling, hosted by A. Yamamoto
  • June 19 Irregular schedule, Thursday, Talk starts at 14:45
    • Koji Tsuda (University of Tokyo), Controlling False Discoveries in Data Mining, hosted by M. Cuturi
    • Kevin Duh (NAIST), Learning Word Representations in Context, hosted by A. Jatowt
    • Yi-Zhe Song (Queen Mary, University of London), Intra-Category Sketch-Based Image Retrieval by Matching Deformable Part Models, hosted by X.F. Liang
    • Dongwon Lee (Pennsylvania State University, USA), Inferring Personal and Group Traits from Social Media, hosted by A. Jatowt

Abstracts and Detailed Information

April 14

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Mikkel Thorup, University of Copenhagen
Title: Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
Abstract: We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.

We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing.

For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h_1,…,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al.~showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence.

For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.


April 21

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Gabriel Peyré, Université Paris Dauphine
Title: Optimal Transport in Imaging Sciences
Abstract: In this talk I will present an overview of numerical methods for the computation of optimal transport, and their applications to statistical modeling of signals and images. The optimal transport of measures provides an elegant framework for comparing probability distributions taking into account the spatial location of the modes of the distribution [7] , which is not the case of more traditional metrics such as the Kullback-Leibler divergence. The transport disport as well as the transport itself are valuable tools for solving various imaging problems where one wishes to model the geometry of colors and objects using histogram in high dimension. Applications of optimal transport include: correction of artifacts in films [2], data interpolation in computer graphics [1], medical image registration [3], object recognition [5] and texture synthesis [7]. For more information on optimal transport and its associated numerical schemes, you can have look at [4] and the website (“optimal transportation” section).

[1] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich, Displacement interpolation using lagrangian mass transport. ACM Transactions on Graphics, 30(6), 2011.
[2] J. Delon, Movie and video scale-time equalization application to flicker reduction. IEEE Transactions on Image Processing, 15(1):241-248, 2006
[3] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, Optimal mass transport for registration and warping. International Journal of Computer Vision, 60(3):225-240, December 2004.
[4] N. Papadakis, G. Peyré, E. Oudet, Optimal Transport with Proximal Splitting, SIAM Journal on Imaging Sciences, vol. 7(1), pp. 212-238, 2014.
[5] Y. Rubner, C. Tomasi, and L.J. Guibas, A metric for distributions with applications to image databases. Proc. ICCV'98, pages 59-66, 1998.
[6] C. Villani. Topics in Optimal Transportation. Graduate Studies in Mathematics Series. American Mathematical Society, 2003.
[7] G-S. Xia, S. Ferradans, G. Peyré, J-F. Aujol, Synthesizing and Mixing Stationary Gaussian Texture Models, to appear in SIAM Journal on Imaging Sciences, 2013, 2013.

April 28

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Sushmita Gupta, Kyoto University
Title: Paging and server problems revisited : New approaches in online algorithms
Abstract: For many decades now, competitive analysis has led the way for research in the area of online algorithms. Indeed, the burst of interest in online algorithms can be traced back to the formal introduction and application of competitive analysis (CA) to study the paging and list-accessing problems. Notwithstanding the simplicity and ease of application of competitive analysis, that allowed for the subject of online computation to grow quickly in the early years, its inability to produce results that match empirical data necessitated researchers to look for alternatives. Various models, measuring tools, analytical methods etc. have been proposed in order to meet the challenge of finding a robust theoretical framework against which we can measure and compare online algorithms.

In this ​presentation, we​ will discuss some of these approaches and their​ underlying concepts ​, as applied to the paging and k-server problems, as means to explain the behavior ​and performance ​of online ​algorithms.


May 12

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
David Bremner, University of New Brunswick
Title: Computing symmetry groups of convex polyhedra
Abstract: Knowing the symmetries of a polyhedron can be very useful for the analysis of its structure as well as for practical polyhedral computations. In this talk, I discuss how to compute symmetry groups preserving the linear, projective and combinatorial structure of a polyhedron. In pragmatic terms, the linear symmetry group is the most tractable, as its computation can be directly translated into a graph automorphism problem, which is solvable for moderately large instances. Its computation is also a key step in the computation of the projectective and combinatorial symmetry groups.

I will start the talk with brief overview of the necessary theory of convex polyhedra (feasible regions of linear programs) and symmetry groups. Time permitting, I will also discuss how to compute integral subgroups of the linear symmetry group that are used for instance in integer linear programming.

This talk presents joint work with Mathieu Dutour Sikiri\'c, Dmitrii V. Pasechnik, Thomas Rehn and Achill Sch\”urmann.


May 19

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Christian Nitschke, Kyoto University
Title: Corneal Imaging: An Introduction to Corneal Reflection Analysis and Applications
Abstract: Taking a look at a persons eyes, we observe that the cornea of the human eye acts as a mirror that reflects light from the person's environment. These corneal reflections can be extracted from an image by modeling the eye-camera geometry as a catadioptric (mirror+lens) imaging system. As a result, one obtains the visual information of the environment and the relation to the observer (view, gaze), which allows for interesting applications and novel solutions to problems in a number of fields, such as human-machine interfaces/sensors, forensics, psychology and biometrics. Beside its potential, corneal imaging demands for specialized algorithms to handle and exploit the complex geometric and photometric properties of the human eye, leading to various challenges in image capturing, processing and modeling.

This talk will provide a comprehensive overview about the topic. In the first part, we will explain the geometric modeling of corneal reflections, including eye imaging and 3D pose estimation, and the projection of light from the scene. In the second part, we will go through some of our groups works in the field, comprising 3D scene reconstruction, forensics and human-computer-interfaces.


May 26

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Cathal Gurrin, Dublin City University
Title: Gathering and Organising Lifelogs, a new Information Retrieval Challenge
Abstract: Lifelogging is a form of pervasive computing which represents a phenomenon whereby people can digitally record their own daily lives in varying amounts of detail, for a variety of purposes. In a sense it represents the ultimate “black box” of a human’s life activities and it offers huge potential for mining or inferring knowledge about how we live our lives. We are currently observing a convergence of technologies which are creating the conditions for the emergence of lifelogging as a mainstream activity; computer storage has become cheap, there are readily available sensing technologies for both the person as well as location and environment and there is growing interest in the phenomenon of sensing and recording oneself, the quantified-self movement.

In this talk I will introduce and motivate the concept of maintaining lifelogs, which are set to revolutionize our personal lives, healthcare, learning and productivity. I will discuss my own motivation for becoming a lifelogger, as well as what I have learned from this process. With the viewpoint of an Information Retrieval scientist, I will motivate and describe the technical challenges to be addressed and introduce the research to address these challenges, research that points to the potential advances when cognitive science meets computer science. Finally, I will introduce the technologies under development at various locations that attempt to efficiently gather flexible and extensible lifelogs. My experiences motivate this talk, as well as a passion for researching the technical challenges of Lifelogging.

June 2

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Gena Hahn, Université de Montréal
Title: Cops, robbers, in finite graphs and related problems
Abstract: We briefly survey the game of cops-and-robbers on graphs and its variants in the fi nite case and then concentrate on infinite graphs, stressing the diff erence between the finite and the infinite. Along the way we show (time allowing) how to construct infinite vertex transitive graphs from any graphs and point out some strange properties of the construction. We also suggest several open problems, both finite and infinite. The talk is based on work with A. Bonato, C.Tardif and R.E. Woodrow.


June 9

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Frank Nielsen, Sony CS Labs, Ecole Polytechnique
Title: Information-theoretic clustering with applications
Abstract: Clustering is a fundamental and key primitive to discover structural groups of homogeneous data in data sets, called clusters. The most famous clustering technique is the celebrated k-means clustering that seeks to minimize the sum of intra-cluster variances. k-Means is NP-hard as soon as the dimension and the number of clusters are both greater than 1. In the first part of the talk, we first present a generic dynamic programming method to compute the optimal clustering of n scalar elements into k pairwise disjoint intervals. This case includes 1D Euclidean k-means but also other kinds of clustering algorithms like the k-medoids, the k-medians, the k-centers, etc.

We extend the method to incorporate cluster size constraints and show how to choose the appropriate number of clusters using model selection. We then illustrate and refine the method on two case studies: 1D Bregman clustering and univariate statistical mixture learning maximizing the complete likelihood. In the second part of the talk, we introduce a generalization of k-means to cluster sets of histograms that has become an important ingredient of modern information processing due to the success of the bag-of-word modelling paradigm.

Clustering histograms can be performed using the celebrated k-means centroid-based algorithm. We consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We prove that the Jeffreys centroid can be expressed analytically using the Lambert W function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms and conclude with some remarks on the k-means histogram clustering.

References: - Optimal interval clustering: Application to Bregman clustering and statistical mixture learning IEEE ISIT 2014 (recent result poster)

- Jeffreys Centroids: A Closed-Form Expression for Positive Histograms and a Guaranteed Tight Approximation for Frequency Histograms.

IEEE Signal Process. Lett. 20(7): 657-660 (2013)


June 16

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Mahito Sugiyama, Osaka University
Title: Distance-based outlier detection via sampling
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. In this talk I will present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. I will show the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, I will provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.

This talk is based on the following paper: Sugiyama, M., Borgwardt, K. M.: Rapid Distance-Based Outlier Detection via Sampling, Advances in Neural Information Processing Systems 26 (NIPS 2013), 467-475, 2013

June 19

14:45 -16:15 Irregular Schedule, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Koji Tsuda, University of Tokyo
Title: Controlling False Discoveries in Data Mining
Abstract: Recently the Financial Times published an article entitled “Big data: are we making a big mistake?”. It points out that data mining methods produce too many results and it is hard to distinguish real discoveries from false ones. Actually, analyzers tend to hand-pick convincing results and present them to customers/reviewers/bosses, hiding the existence of unwanted results. This is a very bad practice, because it always results in rediscovery of popular knowledge. New and real discoveries are overlooked because they do not look familiar. I will present a new technique called LAMP (Limitless Arity Multiple testing Procedure) to evaluate the reliability of data mining results in terms of statistical significance. Most results are biological but underlying principles are relevant to all kinds of data analysis.

June 30

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Kevin Duh, NAIST
Title: Learning Word Representations in Context
Abstract: Unsupervised learning of word representations provides a simple and effective way to improve the robustness of Natural Language Processing (NLP) systems. Exploiting large amounts of raw text, word representations based on based on clustering and neural language model algorithms can be used as more generalizable features and plugged in to existing NLP systems. Positive results have been demonstrated in dependency parsing (Koo2008), named entity recognition (Turian2010), part-of-speech tagging (Huang2013), and sentiment analysis (Socher2013), among others.

Most word representation learning algorithms are static. I.e., there is only one representation per word type, and this same representation is employed in all occurrences of the word token. However, intuitively we know that word meaning and usage ought to change depending on the context. For example, the idea of co-compositionality in Generative Lexicon Theory states that a verb may change its meaning depending on the object it takes, and vice versa. I will present a neural language model that learns dynamic word representations using inspirations from this idea. I will also survey other recent works and present preliminary results on the application of such dynamic word representations to tagging and parsing tasks.


July 7

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Yi-Zhe Song, Queen Mary, University of London
Title: Intra-Category Sketch-Based Image Retrieval by Matching Deformable Part Models
Abstract: An important characteristic of sketches, compared with text, rests with their ability to intrinsically capture object appearance and structure. Nonetheless, akin to traditional text-based image retrieval, conventional sketch-based image retrieval (SBIR) principally focuses on retrieving photos of the same category, neglecting the fine-grained characteristics of sketches. In this paper, we advocate the expressiveness of sketches and examine their efficacy under a novel intra-category SBIR framework. In particular, we study how sketches enable pose-specific retrieval within object categories. Key to this problem is introducing a mid-level sketch representation that not only captures object pose, but also possess the ability to traverse sketch and photo domains. Specifically, we learn deformable part-based model (DPM) as a mid-level representation to discover and encode the various poses and parts in sketch and image domains independently, after which graph matching is performed on DPMs to establish pose correspondences across the two domains. We further propose an SBIR dataset that covers the unique aspects of fine-grained SBIR. Through in-depth experiments, we demonstrate the superior performance of our SBIR framework, and showcase its unique ability in pose-specific retrieval.


July 14

16:30 -18:00, Joho 2, Research Bldg. No.7 (総合7), Main Campus
Dongwon Lee, Pennsylvania State University
Title: Inferring Personal and Group Traits from Social Media
Abstract: Social Network Services (SNSs) have become an integral part of people’s lives in recent years, facilitating the social relations among people who share similar activities or interests in online and/or offline lives. With an explosively growing number of users, SNSs produce diverse forms of user-generated contents (e.g., tweets, wall posts, uploaded photos, comments) as well as network features (e.g., follower or friend list). One of interesting questions (with practical implications) is to accurately infer the latent traits of individuals and groups in SNSs using diverse forms of social media therein. In this talk, I will present a few recent such attempts developed in the SUM project at Penn State.

Dongwon Lee is currently an associate professor of the Pennsylvania State University, College of Information Sciences and Technology, USA. He obtained his Ph.D. in Computer Science from UCLA in 2002. From 1995 to 1997, he has also worked at AT&T Bell Labs. Working mostly on the issues arising in the management and mining of data, he has (co-)authored over 110+ scholarly articles in selective publication outlets in Databases and Data Mining fields. Further details can be found at the webpages of his research group