My last post uses random forest proximity to visualize a set of diamond shapes (the random forest is trained to distinguish diamonds from non-diamonds).

This time I looked at the digits data set that Kaggle is using as the basis of a competition for "getting started". The random forest is trained to classify the digits, and this is an embedding of 1000 digits into 2 dimensions preserving proximities from the random forest as closely as possible:

The colors of the points show the correct label. The larger points are digits classified incorrectly, and you can see that in general those are ones that the random forest has put in the wrong "region". I've shown some of the digits themselves (instead of colored points) -- the red ones are incorrectly classified.

Here's the same but just for the 7's:

The random forest has done a reasonable job putting different types of 7's in different areas, with the most "canonical" 7's toward the middle.

You can see all of the other digits http://www.learnfromdata.com/media/blog/digits/.

Note that this random forest is different from the one in my last post -- here it's built to classify the digits, not separate digits from non-digits. I wonder what kind of results a random forest to distinguish 7's from non-7's would look like?

Code is on Github.

Great job!!Can you explain what " 2 dimensions preserving proximities from the random forest as closely as possible" means..like what metric you used to determine that it stays as close as possible?

ReplyDelete