## Tuesday, October 5, 2010

### Partha Niyogi

Below are some notes I wrote as a graduate student while (for a brief time) working with Partha Niyogi at the University of Chicago. Partha recently passed away, and I wanted to put these notes in a more public place.

I never knew him terribly well, but talking with and learning from Partha was really a great pleasure for me. Thank you, Partha.

Below:
why I like what I think about at work
and
Using the geometric structure of data for learning, how heat generalizes Fourier series

why I like what I think about at work

I like it because of the balance between questions/concepts that feel 'deep,' and the concrete problems/applications.

In my first year of college, I took a class about philosophy of science. We talked a lot about 'scientific induction.' How does it work? What justifies it? I think this idea that similar things will happen in similar situations is behind all learning, not just 'science.' But it's really hard to say what makes situations similar what makes situations similar.

Say you know a bunch of 'situations' (encoded as vectors), and what happened (encoded as function values). Then there's a metric, or distance, on these vectors that you could use to judge 'similarity.' But this is probably not the right thing to do, because the particular encoding you used is fairly arbitrary. So the approach I'm studying tries to think of the collection of 'situations' that arise as a geometric structure, and make use of that structure instead of the encoding. Similarity might be measured by geodesic distance (but maybe better is 'diffusion distance', basically measuring how long it takes heat to flow from one part of the manifold to another).

I also like the way this work connects different mathematical fields for me1. And it's neat doing something that relates to 'applications'--people building machines that learn2. (Or maybe even relates to thinking about how biological systems learn.)

Maybe this was more about why I like the idea of what I think about at work. Hopefully in a few years I'll make a post about how I managed to like it enough to think about it hard enough to get a Phd.

1 The 'philosophy', geometry, some topology, operator theory, (spectral) graph theory, stochastic processes. Laplacians are the thing connecting all of these in what we do.

2 It was neat looking up a paper on noticing on the author's webpage that now he works for Google.

Using the geometric structure of data for learning, how heat generalizes Fourier series

The problem: Suppose you have a whole bunch of points in RN, and you there's some function f which you want (your computer) to learn. For example, maybe the data are a bunch of pictures (represented as vectors of pixel values) of objects, and the function maps each picture to what sort of object it is. You'd like (your computer) to learn this function, so it can recognize objects on its own. But the general problem is, from examples, learning a function f: RN --> R.

Machine learning people have standard techniques for these sorts of things, but they may not work that well when N is very large, as it often is for modern applications (we have lots of tools for collecting lots of data). But maybe we're lucky and there's some geometric structure to the data: Even though its 'ambient' space is high-dimensional, maybe the data we see will all come from some lower-dimensional geometric object ('manifold') embedded in this space. (In the example above, this is rougly the fact that most vectors of pixel values don't correspond to meaningful images.) And let's assume the data we're trying to use for learning have been sampled uniformly (randomly) from this manifold.

If we have a manifold, we can come up with a collection of real-valued 'basis' functions defined on the manifold, in the same way that the sines and cosines of Fourier analysis are good for representing functions defined on a circle. The theorems say things like (but less vague): "Using linear combinations of some number of your basis functions, you can approximate well any nice function (here 'nice' means that the first k derivatives aren't too big) defined on your manifold." So then you just have to find the right representation of the function you're trying to learn in terms of those basis functions, which is much easier.

Notice that we only have the data, not the manifold. So we'll need some way to approximate these basis functions from the data. That works really well, and is probably the whole reason for this approach. But when I get to a few details below, I'll need to be mathier.

Notice also that this approach decides what class of functions is 'natural' for the problem at hand based on the geometry of the underlying manifold. Imagine a path ('1-manifold') looping around in space: the 'intrinsic' distances along the path will be completely different from the 'ambient' distances in R3. We're making a decision that the former matter more. (Justifying this decision is another matter.)

Notice finally that I've been assuming that to learn our function, we were given a bunch of points, together with the corresponding function values at those points. But what if we're only given the function values at some of the points, and still given other points, but without the function values? (This is like seeing a whole bunch of pictures of objects, but only getting labels for some of them.) Since we're trying to approximate the intrinsic geometry of the data, we can still use the 'unlabeled' data for that part of the problem. When people learn, they make lots of use of unlabeled data (we see lots of objects that no one labels for us, when we're learning language, we hear lots of speech that isn't attached to its meaning for us, and so on). And in applications, unlabeled data is a whole lot cheaper. The theme: unlabeled data is for figuring out the geometry of the space from which data come.

Now some math:

On the space of functions on the circle (L2, let's say), d2/dx2 is a 'densely defined' linear operator. Its eigenfunctions are the familiar sin(nt), cos(nt) (eigenvalues -n2) of Fourier series.

That operator is the Laplace operator, Δ, for a circle. Associated to any Riemannian manifold, we have a Laplace operator. This operator is tied to heat flow: we could 'define' it by starting with a smooth function f (supposed to represent 'heat' distributed on our manifold), let Htf be the distribution of heat after time t. Then Δf is (d/dt)Htf evaluated at time 0. And the eigenfunctions of Δ are the same as for Ht.

When the manifold is compact, these eigenfunctions make a basis for L2(M). If you do physics, you might have heard of 'spherical harmonics,' which are what those functions are called when M is a sphere. So now we want to approximate them from the data.

We imagine a weighted graph whose vertices are our data points and whose edges have weights that depend on the distance between them in the ambient space. From the graph, we get a matrix called the graph Laplacian, and (because it's sparse) there are good algorithms for finding its eigenvectors. Then these eigenvectors correspond in a nice way to functions, and those functions 'converge' (I'm being vague here) to the right eigenfunctions, as we get more data.

That's all done, so what am I trying to do? The biggest thing is that now we want to deal with noisy data, so our data doesn't quite live on the manifold, but only near it. So we're dealing with data drawn from a probability distribution on RN supported near the manifold, the operator now is a generalization of the classical Laplacian, which applies to measures, and worst of all, RN is not compact. So I'm thinking about such matters as how to get the above convergence result in this setting. But also such matters as how the eigenfunctions in the 'noisy' case relate to the non-noisy manifold, what they tell you about its geometry, and so on.