# David Chudzicki's Blog

## Wednesday, March 4, 2015

### An Interaction or Not? Understanding a Few ML Algorithms via an Example with No Evidence Either Way

Since Github Pages is way nicer than Blogger, I'm writing over there at http://davidchudzicki.com now. To keep old posts alive, these pages at http://blog.davidchudzicki.com will remain as is.

## Tuesday, January 21, 2014

### Interactive Lissijous Curves in d3

You can play with it (and see the code) here

## Saturday, October 26, 2013

### dithering

Then I learned that GIFs can only use 256 colors and we have to do something to map the many colors to the fewer colors. We could just use the closest available color in the new set of colors, but that results in sharp jumps and "color banding" where the colors change.

So instead we use (and ImageMagick's default is) dithering, which assigns pixels to a mixture of color values that locally average to the appropriate shade. I had a great time reading about this in ImageMagick's documentation! Here's a description of the default method:

The pixel is assigned the color closest to that pixels value, and any difference between the pixels original color and the selected color, is saved and added to the next pixels color values (which is always a neighbouring pixel) before a new color is again selected. In this way any color variations between selected colors and the images original color is distributed to the other pixels in the same area. The result is that while only specific colors will be assigned to the final image, the same basic overall color for that area will closely match the original image.When there aren't enough dots, the picture looks blurry.

Resolution: 800x600 | Resolution: 200x150 | |

PNG (lots of colors) | ||

no dither (GIF, 256 colors) | ||

default dither (GIF, 256 colors) |

convert -resize 100% -dither None ./zoomed_in/Weierstrass001.png ./res100no_dither.gif convert -resize 50% -dither None ./zoomed_in/Weierstrass001.png ./res50no_dither.gif convert -resize 100% ./zoomed_in/Weierstrass001.png ./res100default_dither.gif convert -resize 25% ./zoomed_in/Weierstrass001.png ./res25default_dither.gif convert -resize 100% ./zoomed_in/Weierstrass001.png ./res100many_colors.png convert -resize 25% ./zoomed_in/Weierstrass001.png ./res25many_colors.png

## Wednesday, October 16, 2013

### A Bayesian Model for a Function Increasing by Chi-Squared Jumps (in Stan)

*This post will describe a way I came up with of fitting a function that's constrained to be increasing, using Stan. If you want practical help, standard statistical approaches, or expert research, this isn't the place for you (look up “isotonic regression” or “Bayesian isotonic regression” or David Dunson, whose work Andrew Gelman pointed me to). This is the place for you if you want to read about how I thought about setting up a model, implemented the model in Stan, and created graphics to understand what was going on.*

*Here's the source for this post, including markdown and and R code, and here's the Stan model it uses.*

*Many thanks to Naftali Harris for comments and insights on an earlier version.*

I recently read a paper that Andrew Gelman wrote back in 1996 about theoretical considerations that can restrict model choice even “in the abence of data with minimal applied context”. As one example, he describes the problem fitting a function \( f(t) \) that is known to be increasing and where \( f(0)=0 \) and \( f(1)=1 \) are known. We estimate the function given noisy observations of its value at \( (N-1) \) equally spaced sample points \( \theta_i=f(\frac{i}{N}) \) (for \( i \in \{1,2,3,...,N-1\} \)).

What prior should we use for the \( \{\theta_i\} \)? It might seem like a reasonable “uninformative” prior would make the \( \{\theta_i\} \) independent and uniform on \( [0,1] \), with the restriction that non-increasing sets of \( \{\theta_i\} \) are given zero probability. But this turns out badly! Gelman notes that this prior on the \( \{\theta_i\} \) is the same as the distribution of the order statistics of a samples of size \( N-1 \) from the uniform distribution on \( [0,1] \). As the number of points in the discretization increases, the mass of the prior concentrates around a straight line, which will overwhelm the data.

To illustrate, here are some samples from this distribution, for various \( N \):

This weird! Under the uniform distribution, each (legal) path is equally likely. That's why if we fit this model to noisy data, the maximum likelihood estimate (which is the mode of the posterior) would be reasonable. But it turns out that for large \( k \), almost all legal paths are basically straight lines, so the bulk of the posterior would be basically a straight line no matter what the data say. It reminds me a bit of statistical mechanics, which is largely about micro-level randomness determining macro-level properties. (I'm no expert, but I recommend this book.)

## Tuesday, July 2, 2013

### Mine Depth (Make your axes meaningful!)

- From 1886 onward, the mine seems to get deeper as a roughly constant rate (averaged over time).
- Are they digging it deeper smoothly over time, or does the depth increase in spurts?
- Why do we have so much more data in the period from 1886 to 1892 than the rest?

## Wednesday, June 19, 2013

### Effective altruists -- lifestyle changes for animals vs. the environment

Here are some guesses:

## Monday, March 11, 2013

### Weierstrass Elliptic Function

The code is here. The code is O(n^2) (where n is the number of terms I want images for) when it really should be O(n), since I'm in effect asking for a lot of repeated computation by making each image separately. Sorry!

## Sunday, August 5, 2012

### Visualize a random forest that classifies digits

This time I looked at the digits data set that Kaggle is using as the basis of a competition for "getting started". The random forest is trained to classify the digits, and this is an embedding of 1000 digits into 2 dimensions preserving proximities from the random forest as closely as possible:

The colors of the points show the correct label. The larger points are digits classified incorrectly, and you can see that in general those are ones that the random forest has put in the wrong "region". I've shown some of the digits themselves (instead of colored points) -- the red ones are incorrectly classified.

Here's the same but just for the 7's:

The random forest has done a reasonable job putting different types of 7's in different areas, with the most "canonical" 7's toward the middle.

You can see all of the other digits http://www.learnfromdata.com/media/blog/digits/.

Note that this random forest is different from the one in my last post -- here it's built to classify the digits, not separate digits from non-digits. I wonder what kind of results a random forest to distinguish 7's from non-7's would look like?

Code is on Github.

### Random forests for visualizing data

- Have a data set
- Create a set of fake data by permuting the columns randomly -- each column will still have the same distribution, but the relationships are destroyed.
- Train a random forest to distinguish the fake data from the original data.
- Get a "proximity" measure between points, based on how often the points are in the same leaf node.
- Embed the points in 2D in such a way as to distort these proximities as little as possible.

*should*be, as a way of thinking about how it works. So I generated 931 images of diamonds varying in two dimensions:

- Size
- Position (only how far left/right)

**for the same difference in position, small diamonds need to be closer to each other than large diamonds.**None of my diamonds have a diameter smaller than 4 pixels, but imagine of the sizes got so small the diamond wasn't even there -- then position wouldn't matter at all of those diamonds.