Wednesday, March 4, 2015

An Interaction or Not? Understanding a Few ML Algorithms via an Example with No Evidence Either Way

My latest blog post helps to explain a few statistical/ML models by whether they learn an interaction in a toy example

Since Github Pages is way nicer than Blogger, I'm writing over there at now. To keep old posts alive, these pages at will remain as is.

Tuesday, January 21, 2014

Interactive Lissijous Curves in d3

On a visit to San Francisco's Exploratorium, I saw an oscilloscope they had making Lissijous curves. So I decided to make a version in d3: 

You can play with it (and see the code) here

Saturday, October 26, 2013


I was converting a PNG (which represented partials sums of the Weierstrass elliptic function) to a GIF and was confused about why the PNG looked fine at low resolution but the GIF looked bad.

Then I learned that GIFs can only use 256 colors and we have to do something to map the many colors to the fewer colors. We could just use the closest available color in the new set of colors, but that results in sharp jumps and "color banding" where the colors change.
So instead we use (and ImageMagick's default is) dithering, which assigns pixels to a mixture of color values that locally average to the appropriate shade. I had a great time reading about this in ImageMagick's documentation! Here's a description of the default method:
The pixel is assigned the color closest to that pixels value, and any difference between the pixels original color and the selected color, is saved and added to the next pixels color values (which is always a neighbouring pixel) before a new color is again selected. In this way any color variations between selected colors and the images original color is distributed to the other pixels in the same area. The result is that while only specific colors will be assigned to the final image, the same basic overall color for that area will closely match the original image.
When there aren't enough dots, the picture looks blurry.
Resolution: 800x600 Resolution: 200x150
PNG (lots of colors)
no dither (GIF, 256 colors)
default dither (GIF, 256 colors)
Here's my code on Github, and here are the ImageMagick commands that produced these:
convert -resize 100% -dither None ./zoomed_in/Weierstrass001.png ./res100no_dither.gif
convert -resize 50% -dither None ./zoomed_in/Weierstrass001.png ./res50no_dither.gif
convert -resize 100%  ./zoomed_in/Weierstrass001.png ./res100default_dither.gif
convert -resize 25%  ./zoomed_in/Weierstrass001.png ./res25default_dither.gif
convert -resize 100%  ./zoomed_in/Weierstrass001.png ./res100many_colors.png
convert -resize 25%  ./zoomed_in/Weierstrass001.png ./res25many_colors.png

Wednesday, October 16, 2013

A Bayesian Model for a Function Increasing by Chi-Squared Jumps (in Stan)

This post will describe a way I came up with of fitting a function that's constrained to be increasing, using Stan. If you want practical help, standard statistical approaches, or expert research, this isn't the place for you (look up “isotonic regression” or “Bayesian isotonic regression” or David Dunson, whose work Andrew Gelman pointed me to). This is the place for you if you want to read about how I thought about setting up a model, implemented the model in Stan, and created graphics to understand what was going on.

Here's the source for this post, including markdown and and R code, and here's the Stan model it uses.

Many thanks to Naftali Harris for comments and insights on an earlier version.

I recently read a paper that Andrew Gelman wrote back in 1996 about theoretical considerations that can restrict model choice even “in the abence of data with minimal applied context”. As one example, he describes the problem fitting a function \( f(t) \) that is known to be increasing and where \( f(0)=0 \) and \( f(1)=1 \) are known. We estimate the function given noisy observations of its value at \( (N-1) \) equally spaced sample points \( \theta_i=f(\frac{i}{N}) \) (for \( i \in \{1,2,3,...,N-1\} \)).

What prior should we use for the \( \{\theta_i\} \)? It might seem like a reasonable “uninformative” prior would make the \( \{\theta_i\} \) independent and uniform on \( [0,1] \), with the restriction that non-increasing sets of \( \{\theta_i\} \) are given zero probability. But this turns out badly! Gelman notes that this prior on the \( \{\theta_i\} \) is the same as the distribution of the order statistics of a samples of size \( N-1 \) from the uniform distribution on \( [0,1] \). As the number of points in the discretization increases, the mass of the prior concentrates around a straight line, which will overwhelm the data.

To illustrate, here are some samples from this distribution, for various \( N \):

plot of chunk sample_from_restricted_uniform
This weird! Under the uniform distribution, each (legal) path is equally likely. That's why if we fit this model to noisy data, the maximum likelihood estimate (which is the mode of the posterior) would be reasonable. But it turns out that for large \( k \), almost all legal paths are basically straight lines, so the bulk of the posterior would be basically a straight line no matter what the data say. It reminds me a bit of statistical mechanics, which is largely about micro-level randomness determining macro-level properties. (I'm no expert, but I recommend this book.)

Tuesday, July 2, 2013

Mine Depth (Make your axes meaningful!)

At the Empire Mine State Historic Park, you can walk 30 feet into the old gold mine, meet a quirky and knowledgeable blacksmith named David, see an awesome scale model of the mine (above--very secret during the years of operation!), and see this chart visualizing the depth of the mine over time:

I usually like axes to be as meaningful as possible, so I redrew the chart (code below) representing the year with the horizontal axis:
The original version probably has its own advantages as well, but some observations and questions comes a little more quickly from my version:

  • From 1886 onward, the mine seems to get deeper as a roughly constant rate (averaged over time).
  • Are they digging it deeper smoothly over time, or does the depth increase in spurts?
  • Why do we have so much more data in the period from 1886 to 1892 than the rest?

Before drawing the chart this way, I figured that depth would increase faster in more recent years (with better technology), which isn't true at all. But note that this is just the maximum depth -- maybe they can dig much more easily with better technology, but that's spread across more tunnels, so the depth doesn't increase any faster? Or maybe each bar on the original chart represents a short, concentrated digging effort (meaning they can increase the depth much faster in recent years, but also go for much longer without increasing it at all)?

Wednesday, June 19, 2013

Effective altruists -- lifestyle changes for animals vs. the environment

A good question from the Effective Altruism group on Facebook: "why EAs [people in the effective altruism community] often think that making changes to their personal lifestyles in order to reduce factory farming is worth it while hardly any EAs seem to think that making changes in their personal lifestyles in order to reduce global warming is not worth it."

Here are some guesses:

Monday, March 11, 2013

Weierstrass Elliptic Function

\wp\left(z\right)=\frac{1}{z^{2}}+\sum_{\omega\in\Lambda\smallsetminus\left\{ 0\right\} }\left(\frac{1}{\left(z-\omega\right)^{2}}-\frac{1}{\omega^{2}}\right)

(where  \omega\in\Lambda are the points of a lattice in the complex plane).

The technique of adding up one term for each lattice point is really neat, and it's extra great that the function turns out to have such amazing properties

Since John's post showed what pretty pictures come from using phase portraits to visual complex functions, I decided to make a video to show me what the partial sums that converge to the Weierstrass function. I'm adding a bunch of terms starting from the origin going outward, but you could add the terms in any order.

The picture on the left is a zoomed in version showing two periods in each direction. You can see the function getting more periodic as we add more terms.

The code is here. The code is O(n^2) (where n is the number of terms I want images for) when it really should be O(n), since I'm in effect asking for a lot of repeated computation by making each image separately. Sorry!

Sunday, August 5, 2012

Visualize a random forest that classifies digits

My last post uses random forest proximity to visualize a set of diamond shapes (the random forest is trained to distinguish diamonds from non-diamonds).

This time I looked at the digits data set that Kaggle is using as the basis of a competition for "getting started". The random forest is trained to classify the digits, and this is an embedding of 1000 digits into 2 dimensions preserving proximities from the random forest as closely as possible:

The colors of the points show the correct label. The larger points are digits classified incorrectly, and you can see that in general those are ones that the random forest has put in the wrong "region". I've shown some of the digits themselves (instead of colored points) -- the red ones are incorrectly classified.

Here's the same but just for the 7's:

The random forest has done a reasonable job putting different types of 7's in different areas, with the most "canonical" 7's toward the middle.

You can see all of the other digits

Note that this random forest is different from the one in my last post -- here it's built to classify the digits, not separate digits from non-digits. I wonder what kind of results a random forest to distinguish 7's from non-7's would look like?

Code is on Github.

Random forests for visualizing data

Recently I read about using random forests as a way to visualize data. Here's how it works:
  1. Have a data set
  2. Create a set of fake data by permuting the columns randomly -- each column will still have the same distribution, but the relationships are destroyed.
  3. Train a random forest to distinguish the fake data from the original data.
  4. Get a "proximity" measure between points, based on how often the points are in the same leaf node.
  5. Embed the points in 2D in such a way as to distort these proximities as little as possible.
I decided to try this in a case where I would know what the outcome should be, as a way of thinking about how it works. So I generated 931 images of diamonds varying in two dimensions:
  1. Size
  2. Position (only how far left/right)

Then I followed the above procedure, getting this:

Neat! The random forest even picked up on a feature of this space that I wasn't expecting it to: for the same difference in position, small diamonds need to be closer to each other than large diamonds. None of my diamonds have a diameter smaller than 4 pixels, but imagine of the sizes got so small the diamond wasn't even there -- then position wouldn't matter at all of those diamonds.

I set one column of pixels to random values, and the method still worked just as well. (Which makes sense, as the random forest only cares about pixels that help it distinguish between diamonds and non-diamonds.)

A cool technique that I'd love to try some more! For one, I'd like to understand better how it differs from various manifold learning methods. One nice feature is that you could easily use this with a mix of continuous and categorical variables.

Note that starting with Euclidean distance between images (as vectors in R^2500) and mapping points to 2D doesn't seem to produce anything useful:

Code available on github.