Schedule for Spring 2014
Official course name: Perspectives in Informatics 4, Graduate School of Informatics
Unless noted otherwise, all talks 16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus

 Mikkel Thorup (University of Copenhagen), Bottomk and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence, hosted by K. Iwama

 Gabriel Peyré (Université Paris Dauphine), Optimal Transport in Imaging Sciences, hosted by M. Cuturi

 Sushmita Gupta (Kyoto University), Paging and server problems revisited : New approaches in online algorithms, hosted by K. Iwama

 Golden Week holiday

 David Bremner (University of New Brunswick), Computing symmetry groups of convex polyhedra, hosted by D. Avis

 Christian Nitschke (Kyoto University), Corneal Imaging: An Introduction to Corneal Reflection Analysis and Applications, hosted by A. Jatowt

 Cathal Gurrin (Dublin City University), Gathering and Organising Lifelogs, a new Information Retrieval Challenge, hosted by A. Jatowt

 Gena Hahn (Université de Montréal), Cops and Robbers Games on Graphs, hosted by D. Avis

 Frank Nielsen (Sony CS Labs, Ecole Polytechnique), Informationtheoretic clustering with applications, hosted by M. Cuturi

 Mahito Sugiyama (Osaka University), Distancebased outlier detection via sampling, hosted by A. Yamamoto
 June 19 Irregular schedule, Thursday, Talk starts at 14:45
 Koji Tsuda (University of Tokyo), Controlling False Discoveries in Data Mining, hosted by M. Cuturi

 Kevin Duh (NAIST), Learning Word Representations in Context, hosted by A. Jatowt

 YiZhe Song (Queen Mary, University of London), IntraCategory SketchBased Image Retrieval by Matching Deformable Part Models, hosted by X.F. Liang

 Dongwon Lee (Pennsylvania State University, USA), Inferring Personal and Group Traits from Social Media, hosted by A. Jatowt
Abstracts and Detailed Information
April 14
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Mikkel Thorup, University of Copenhagen
Title: Bottomk and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
Abstract: We consider bottomk sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=Y/X of any subset Y as S_k(X) intersect Y/k. A standard application is the estimation of the Jaccard similarity f=A intersect B/A union B between sets A and B. Given the bottomk samples from A and B, we construct the bottomk sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as S_k(A union B) intersect S_k(A) intersect S_k(B)/k.
We show here that even if the hash function is only 2independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxminwise where we use k hash independent functions h_1,…,h_k, storing the smallest element with each hash function. For kxminwise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al.~showed that bottomk circumvents the bias if the hash function is 8independent and k is sufficiently large. We get down to 2independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottomk sample, e.g., getting stronger probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavytailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.
April 21
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Gabriel Peyré, Université Paris Dauphine
Title: Optimal Transport in Imaging Sciences
Abstract:
In this talk I will present an overview of numerical methods for the computation of optimal transport, and their applications to statistical modeling of signals and images. The optimal transport of measures provides an elegant framework for comparing probability distributions taking into account the spatial location of the modes of the distribution [7] , which is not the case of more traditional metrics such as the KullbackLeibler divergence. The transport disport as well as the transport itself are valuable tools for solving various imaging problems where one wishes to model the geometry of colors and objects using histogram in high dimension. Applications of optimal transport include: correction of artifacts in films [2], data interpolation in computer graphics [1], medical image registration [3], object recognition [5] and texture synthesis [7]. For more information on optimal transport and its associated numerical schemes, you can have look at [4] and the website www.numericaltours.com (“optimal transportation” section).
Bibliography:
[1] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich, Displacement interpolation using lagrangian mass transport. ACM Transactions on Graphics, 30(6), 2011.
[2] J. Delon, Movie and video scaletime equalization application to flicker reduction. IEEE Transactions on Image Processing, 15(1):241248, 2006
[3] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, Optimal mass transport for registration and warping. International Journal of Computer Vision, 60(3):225240, December 2004.
[4] N. Papadakis, G. Peyré, E. Oudet, Optimal Transport with Proximal Splitting, SIAM Journal on Imaging Sciences, vol. 7(1), pp. 212238, 2014.
[5] Y. Rubner, C. Tomasi, and L.J. Guibas, A metric for distributions with applications to image databases. Proc. ICCV'98, pages 5966, 1998.
[6] C. Villani. Topics in Optimal Transportation. Graduate Studies in Mathematics Series. American Mathematical Society, 2003.
[7] GS. Xia, S. Ferradans, G. Peyré, JF. Aujol, Synthesizing and Mixing Stationary Gaussian Texture Models, to appear in SIAM Journal on Imaging Sciences, 2013, 2013.
April 28
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Sushmita Gupta, Kyoto University
Title: Paging and server problems revisited : New approaches in online algorithms
Abstract: For many decades now, competitive analysis has led the way for research in the area of online algorithms. Indeed, the burst of interest in online algorithms can be traced back to the formal introduction and application of competitive analysis (CA) to study the paging and listaccessing problems. Notwithstanding the simplicity and ease of application of competitive analysis, that allowed for the subject of online computation to grow quickly in the early years, its inability to produce results that match empirical data necessitated researchers to look for alternatives.
Various models, measuring tools, analytical methods etc. have been proposed in order to meet the challenge of finding a robust theoretical framework against which we can measure and compare online algorithms.
In this presentation, we will discuss some of these approaches and their underlying concepts , as applied to the paging and kserver problems, as means to explain the behavior and performance of online algorithms.
May 12
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
David Bremner, University of New Brunswick
Title: Computing symmetry groups of convex polyhedra
Abstract: Knowing the symmetries of a polyhedron can be very useful for the analysis of its structure as well as for practical polyhedral computations. In this talk, I discuss how to compute symmetry groups preserving the linear, projective and combinatorial structure of a polyhedron. In pragmatic terms, the linear symmetry group is the most tractable, as its computation can be directly translated into a graph automorphism problem, which is solvable for moderately large instances. Its computation is also a key step in the computation of the projectective and combinatorial symmetry groups.
I will start the talk with brief overview of the necessary theory of convex polyhedra (feasible regions of linear programs) and symmetry groups. Time permitting, I will also discuss how to compute integral subgroups of the linear symmetry group that are used for instance in integer linear programming.
This talk presents joint work with Mathieu Dutour Sikiri\'c, Dmitrii V. Pasechnik, Thomas Rehn and Achill Sch\”urmann.
May 19
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Christian Nitschke, Kyoto University
Title: Corneal Imaging: An Introduction to Corneal Reflection Analysis and Applications
Abstract: Taking a look at a persons eyes, we observe that the cornea of the human eye acts as a mirror that reflects light from the person's environment. These corneal reflections can be extracted from an image by modeling the eyecamera geometry as a catadioptric (mirror+lens) imaging system. As a result, one obtains the visual information of the environment and the relation to the observer (view, gaze), which allows for interesting applications and novel solutions to problems in a number of fields, such as humanmachine interfaces/sensors, forensics, psychology and biometrics. Beside its potential, corneal imaging demands for specialized algorithms to handle and exploit the complex geometric and photometric properties of the human eye, leading to various challenges in image capturing, processing and modeling.
This talk will provide a comprehensive overview about the topic. In the first part, we will explain the geometric modeling of corneal reflections, including eye imaging and 3D pose estimation, and the projection of light from the scene. In the second part, we will go through some of our groups works in the field, comprising 3D scene reconstruction, forensics and humancomputerinterfaces.
May 26
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Cathal Gurrin, Dublin City University
Title: Gathering and Organising Lifelogs, a new Information Retrieval Challenge
Abstract: Lifelogging is a form of pervasive computing which represents a phenomenon whereby people can digitally record their own daily lives in varying amounts of detail, for a variety of purposes. In a sense it represents the ultimate “black box” of a human’s life activities and it offers huge potential for mining or inferring knowledge about how we live our lives. We are currently observing a convergence of technologies which are creating the conditions for the emergence of lifelogging as a mainstream activity; computer storage has become cheap, there are readily available sensing technologies for both the person as well as location and environment and there is growing interest in the phenomenon of sensing and recording oneself, the quantifiedself movement.
In this talk I will introduce and motivate the concept of maintaining lifelogs, which are set to revolutionize our personal lives, healthcare, learning and productivity. I will discuss my own motivation for becoming a lifelogger, as well as what I have learned from this process. With the viewpoint of an Information Retrieval scientist, I will motivate and describe the technical challenges to be addressed and introduce the research to address these challenges, research that points to the potential advances when cognitive science meets computer science. Finally, I will introduce the technologies under development at various locations that attempt to efficiently gather flexible and extensible lifelogs. My experiences motivate this talk, as well as a passion for researching the technical challenges of Lifelogging.
June 2
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Gena Hahn, Université de Montréal
Title: Cops, robbers, infinite graphs and related problems
Abstract: We briefly survey the game of copsandrobbers on graphs and its variants in the finite case and then concentrate on infinite graphs, stressing the difference between the finite and the infinite. Along the way we show (time allowing) how to construct infinite vertex transitive graphs from any graphs and point out some strange properties of the construction. We also suggest several open problems, both finite and infinite. The talk is based on work with A. Bonato, C.Tardif and R.E. Woodrow.
June 9
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Frank Nielsen, Sony CS Labs, Ecole Polytechnique
Title: Informationtheoretic clustering with applications
Abstract: Clustering is a fundamental and key primitive to discover structural groups of homogeneous
data in data sets, called clusters. The most famous clustering technique is the celebrated
kmeans clustering that seeks to minimize the sum of intracluster variances.
kMeans is NPhard as soon as the dimension and the number of clusters are both greater than 1.
In the first part of the talk, we first present a generic dynamic programming method to compute the optimal clustering of n scalar
elements into k pairwise disjoint intervals. This case includes 1D Euclidean kmeans but also other kinds of clustering algorithms like the kmedoids, the kmedians, the kcenters, etc.
We extend the method to incorporate cluster size constraints and show how to choose the appropriate number of clusters using model selection. We then illustrate and refine the method on two case studies: 1D Bregman clustering and univariate statistical mixture learning maximizing the complete likelihood. In the second part of the talk, we introduce a generalization of kmeans to cluster sets of histograms that has become an important ingredient of modern information processing due to the success of the bagofword modelling paradigm.
Clustering histograms can be performed using the celebrated kmeans centroidbased algorithm. We consider the Jeffreys divergence that symmetrizes the KullbackLeibler divergence, and investigate the computation of Jeffreys centroids. We prove that the Jeffreys centroid can be expressed analytically using the Lambert W function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms and conclude with some remarks on the kmeans histogram clustering.
References:  Optimal interval clustering: Application to Bregman clustering and statistical mixture learning IEEE ISIT 2014 (recent result poster) http://arxiv.org/abs/1403.2485
 Jeffreys Centroids: A ClosedForm Expression for Positive Histograms and a Guaranteed Tight Approximation for Frequency Histograms.
IEEE Signal Process. Lett. 20(7): 657660 (2013) http://arxiv.org/abs/1303.7286
June 16
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Mahito Sugiyama, Osaka University
Title: Distancebased outlier detection via sampling
Abstract: Distancebased approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for highdimensional data. In this talk I will present an empirical comparison of various approaches to distancebased outlier detection across a large number of datasets. I will show the surprising observation that a simple, samplingbased scheme outperforms stateoftheart techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, I will provide a theoretical analysis why the samplingbased approach outperforms alternative methods based on knearest neighbor search.
This talk is based on the following paper: Sugiyama, M., Borgwardt, K. M.: Rapid DistanceBased Outlier Detection via Sampling, Advances in Neural Information Processing Systems 26 (NIPS 2013), 467475, 2013
June 19
14:45 16:15 Irregular Schedule, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Koji Tsuda, University of Tokyo
Title: Controlling False Discoveries in Data Mining
Abstract: Recently the Financial Times published an article entitled “Big data: are
we making a big mistake?”. It points out that data mining methods
produce too many results and it is hard to distinguish real
discoveries from false ones. Actually, analyzers tend to handpick
convincing results and present them to customers/reviewers/bosses,
hiding the existence of unwanted results. This is a very bad practice,
because it always results in rediscovery of popular knowledge. New and
real discoveries are overlooked because they do not look familiar. I
will present a new technique called LAMP (Limitless Arity Multiple
testing Procedure) to evaluate the reliability of data mining results
in terms of statistical significance. Most results are biological but
underlying principles are relevant to all kinds of data analysis.
June 30
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Kevin Duh, NAIST
Title: Learning Word Representations in Context
Abstract: Unsupervised learning of word representations provides a simple and effective way to improve the robustness of Natural Language Processing (NLP) systems. Exploiting large amounts of raw text, word representations based on based on clustering and neural language model algorithms can be used as more generalizable features and plugged in to existing NLP systems. Positive results have been demonstrated in dependency parsing (Koo2008), named entity recognition (Turian2010), partofspeech tagging (Huang2013), and sentiment analysis (Socher2013), among others.
Most word representation learning algorithms are static. I.e., there is only one representation per word type, and this same representation is employed in all occurrences of the word token. However, intuitively we know that word meaning and usage ought to change depending on the context. For example, the idea of cocompositionality in Generative Lexicon Theory states that a verb may change its meaning depending on the object it takes, and vice versa. I will present a neural language model that learns dynamic word representations using inspirations from this idea. I will also survey other recent works and present preliminary results on the application of such dynamic word representations to tagging and parsing tasks.
July 7
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
YiZhe Song, Queen Mary, University of London
Title: IntraCategory SketchBased Image Retrieval by Matching Deformable Part Models
Abstract: An important characteristic of sketches, compared with text, rests with their ability to intrinsically capture object appearance and structure. Nonetheless, akin to traditional textbased image retrieval, conventional sketchbased image retrieval (SBIR) principally focuses on retrieving photos of the same category, neglecting the finegrained characteristics of sketches. In this paper, we advocate the expressiveness of sketches and examine their efficacy under a novel intracategory SBIR framework. In particular, we study how sketches enable posespecific retrieval within object categories. Key to this problem is introducing a midlevel sketch representation that not only captures object pose, but also possess the ability to traverse sketch and photo domains. Specifically, we learn deformable partbased model (DPM) as a midlevel representation to discover and encode the various poses and parts in sketch and image domains independently, after which graph matching is performed on DPMs to establish pose correspondences across the two domains. We further propose an SBIR dataset that covers the unique aspects of finegrained SBIR. Through indepth experiments, we demonstrate the superior performance of our SBIR framework, and showcase its unique ability in posespecific retrieval.
July 14
16:30 18:00, Joho 2, Research Bldg. No.7 (総合７), Main Campus
Dongwon Lee, Pennsylvania State University
Title: Inferring Personal and Group Traits from Social Media
Abstract: Social Network Services (SNSs) have become an integral part of people’s lives in recent years, facilitating the social relations among people who share similar activities or interests in online and/or offline lives. With an explosively growing number of users, SNSs produce diverse forms of usergenerated contents (e.g., tweets, wall posts, uploaded photos, comments) as well as network features (e.g., follower or friend list). One of interesting questions (with practical implications) is to accurately infer the latent traits of individuals and groups in SNSs using diverse forms of social media therein. In this talk, I will present a few recent such attempts developed in the SUM project at Penn State.
Dongwon Lee is currently an associate professor of the Pennsylvania State University, College of Information Sciences and Technology, USA. He obtained his Ph.D. in Computer Science from UCLA in 2002. From 1995 to 1997, he has also worked at AT&T Bell Labs. Working mostly on the issues arising in the management and mining of data, he has (co)authored over 110+ scholarly articles in selective publication outlets in Databases and Data Mining fields. Further details can be found at the webpages of his research group