Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Unreasonable Effectiveness of Random Forests (medium.com/rants-on-machine-learning)
111 points by PaulHoule on Sept 19, 2015 | hide | past | favorite | 34 comments


The article mentions that "The main drawback of Random Forests is the model size. You could easily end up with a forest that takes hundreds of megabytes of memory and is slow to evaluate."

A cool trick for speeding up trained random forests/gradient boosted decision trees is to dump the tree to C/ASM, compile it, and dlopen it as a function pointer. Depending on the model and the architecture, you can get an 8x speedup relative to an optimized C implementation that walks the tree.

There's an implementation for scikit-learn with benchmarks at https://github.com/ajtulloch/sklearn-compiledtrees if you're interested in this technique.


(I) Indeed, it is a pretty good trick, and is simple to implement, given that the algorithm for evaluating a random forest is simple. Good on you for implementing it nicely for scikit-learn, benchmarking it, and making the code available.

I did something similar a few years ago when using a random forest as a heuristic to speed-up some minimax game tree search code. I don't think I still have the python scripts used to generate the compiled code (no great loss, they would have been terrible throwaway code) but amusingly enough I still have an example of a compiled random forest:

(warning: link #2 is a ~44k line cpp file of gotos encoding 50 compiled decision trees)

    1. https://raw.githubusercontent.com/fcostin/hangman_cpp/master/forest.h
    2. https://raw.githubusercontent.com/fcostin/hangman_cpp/master/forest.cpp
This probably isn't the fastest possible encoding of a bunch of decision trees, but it was pretty quick.

(II) On a different note, the article states "Another point that some might find a concern is that random forest models are black boxes that are very hard to interpret.". Well, yes and no. Decision trees are pretty easy to interpret -- you can always pick out one of the decision trees and look at it.

If one is interested in learning more about random forests I'd recommend taking a look at these notes from Breiman and Cutler (the folks responsible for the algorithm!): https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home...

E.g. they discuss "variable importance" measures, as featured in the original R version of randomForest. There are a bunch of tools built around random forests for extracting some insight beyond the raw predictions, to aid interpretation! Use them! https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home...

(III) Final comment: the article doesn't mention the Bayesian interpretation of ensembling a bunch of statistical models together. That's another potentially insightful way of thinking about why the method might be effective, and isn't limited to decision trees.


Why go through all that though? The fundamental principle is that it is flattening the buffer to one linear section of memory, making the pointers jump to a location that is likely already prefetched and possibly in the same cache line. Dumping to C seems like it is missing the point.

A library like Cereal should be able to do this almost trivially simply by serializing the data, unserializing it to a flat buffer, then pointing to the start of that buffer.

Also I should say that a random forest created by pointer hopping to each new level is a very naive implementation and will take a long time to create due to heap allocation and a long time to traverse due to cache misses.


That's actually not the case - that's one effect, sure, but when you compared the compiled trees to the flattened trees (the strategy just described), the compiled trees are substantially faster across a range of parameters. See http://tullo.ch/articles/decision-tree-evaluation/ for a detailed evaluation.


That's really awesome that you put that together, but I have to think that 200 points and a depth < 2 would basically be a linear search over about 100 elements right?

Fundamentally I can't think of any reason compiling a tree would be a good approach in a general sense.

Another optimization is to put make splits multiple levels instead of just one. How many might depend on how many you can squeeze into a cache line. If you split position is a float and the dimension is a byte, each split would be 5 bytes. You can squeeze 12 of these into a cache line. You can go 3 level down instead of one by packing 7 of them into a chunk (1 split + 2 splits + 4 splits). This still leaves room for a 32 bit index or pointer for each of the 4 leaves.


Mmmm,

Yeah, as I understand it, the enthusiasm for deep learning comes because it scales well to truly huge data sets where most other machine learning algorithms tend to create artifacts of a similar size to the data in question.

That is to say that deep learning, random trees and SVMs all do approximately the same thing - non-linear regression, equivalently regression on a feature space, roughly drawing curves between two sets on huge dimensional space and using these curves for distinguishing objects.

Also - the linked paper giving all the actual details of the algorithm seems unresponsive.

This link worked for me:

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf


Is this often done? Can you provide a reference explaining this further? Possibly specific to ML models.


Random forests are great. I use them in real-time silhouette identification of people against backgrounds by porting the random forest to the GPU. This allows blazing fast forest evaluation for every pixel of my input image concurrently. I used this paper to implement the design of it http://www.msr-waypoint.com/pubs/71445/ForestFire.pdf

The main trick is in storing the tree as a binary heap in the rows of a texture, with the left and right subtrees at 2i and 2i+1, respectively, and where the attribute index and split value are encoded in the color channels. Your texture width is your tree depth^2-1, and your texture height the number of trees in your forest. The texture ends up looking something like this (scaled up x10) http://i.imgur.com/afH5iFl.png

With a GPU shader, you can navigate these trees very quickly to classify an entire input image at once.


This is so true. Random forest should be the default choice for most problem sets.

Now we have The Unreasonable Effectiveness of Random Forests and The Unreasonable Effectiveness of Recurrent Neural Networks. We just need The Unreasonable Effectiveness of XGBoost (for winning Kaggle competitions) and we'll have the whole set.

[1] http://karpathy.github.io/2015/05/21/rnn-effectiveness/


I love random forests but IMO the default should be extremely randomized forests. It's interesting that he advocates random forests as ExtraTrees (sklearns extremely randomized forest implementation) is also visible on his images.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65....


Another one is The Unreasonable Effectiveness of Deep Learning [1] by LeCun.

[1] http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-expres...


All of which are a play on 'The Unreasonable Effectiveness of Mathematics in the Natural Science,' I believe.

https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...


Here's a poor quality recording of LeCun's presentation https://www.youtube.com/watch?v=sc-KbuZqGkI&list=PL513vUSBNc...


Maybe we need a RNN analyzing titles of classics to come up with other tentative titles (forgot how this process is called)


I'm surprised the author didn't mention model variance reduction, which is what RFs were designed to do. For example, if we generate 100 data sets which are the same except for noise, the 100 decision boundaries drawn by a classification tree will vary much more than the decision boundaries drawn by RFs on the same sets.

Here's what this means visually:

http://i.imgur.com/IjfXFkm.png

There is just one of the 100 data sets generated shown.


Hmm, those CT boundaries aren't just high variance, they have extremely high error, which seems a more significant problem. They are linear!


Each CT has zero error on its training set. The overall structure in each training set is the same (shown by the circles and triangles). Each CT, however, fits to the noise that is specific each training set. This causes the "linear" regions you mention. This is caused by how CTs recursively subdivide the feature space - these are essentially small subdivision areas fit to noise, which have extended beyond the class region to which they "should" belong.


Nice post. However, I was kind of surprised to see the author list interpretability as one of the drawbacks of random forests:

> Another point that some might find a concern is that random forest models are black boxes that are very hard to interpret.

I generally agree that random forests are more difficult to interpret than linear models such as logistic regression, but I think they're still far more interpretable than more comparable non-linear models such as neural networks or SVMs with non-linear kernels. At the end of the day, a random forest is just a bunch of decision trees, each of which are very straightforward for humans to understand. Additionally, there are a number of straightforward methods available for assessing the importance of each individual feature in a random forest in aggregate ([1], section 15.4). Neural networks, on the other hand, result in models that are too cryptic to be meaningfully inspected by a human, and have more complex variable importance measures [2].

The relative interpretability of these non-linear models played a large factor in our decision to add random forests to our modeling stack at Sift Science, which you can read more about here: http://blog.siftscience.com/blog/2015/large-scale-decision-f...

[1]: http://statweb.stanford.edu/~tibs/ElemStatLearn/

[2]: http://www.massey.ac.nz/~mkjoy/pdf/Olden,Joy&DeathEM.pdf


I wish there were some examples in the article of their effectiveness, mentions of best areas for their use, etc. Currently, as a non-ML expert, my takeaway from the article was "there's a thing called a Random Forest. That's quite cool."

Maybe I'm not the target audience, though.


RF's work well on heterogenous data (ie a mix of data types from difrent sources which might include numerical, categorical and text data etc. They also tend to be fairly resistance to noise in the data and to work well on data sets that are wide (ie have more features/dimensions then cases/observations). Finally they are fairly resistant to overfitting (ie fitting quirks in the training data instead of generalizable signal) and don't need too much parameter tuning.

This is largely because they are a randomized ensemble of weaker models. Individual decision trees are quite prone to overfitting and other issues but in an rf you grow a bunch of them on diffrent bootstrap samples of the data and let them vote and it turns out the combined performance is much better and much less error prone then a single model.

Specific examples where they work well include genetic data (many more noisy variables then observations) and customer/consumer data. They also get used in image data and signal processing but deep neural networks are recently tending to beat them here and in similar less heterogenous data sets.


Random Forest is an algorithm for classification or prediction.

Usually the SciKit-learn algorithm cheat sheet[1] is a good guide for this, but it doesn't include random forest (or any kind of decision tree).

[1] http://scikit-learn.org/stable/tutorial/machine_learning_map...


I share the awe of RF (especially as they look naive, but turn out to be extremely good for a wide range of problems), however, the main problem is that they are black boxes. Typically, it is easy to get good results, but almost no insight, or further pointers.

Many times I ended up using linear regression (with properly engineered variables), or something as simple, because it gave almost as good results as RF, but I could interpret the results, inputs, etc. (And may task was academic or business analytics, so CV score was not the only figure of merit.)


Random forests being black boxes is something that is widely touted even among active practitioner, but not really true any more. There are methods available for decomposing random forest predictions into feature contributions, so that each prediction is represented as the sum of the bias term and contribution from each feature (similar to what you get in linear regression, but at the level of each individual prediction instead of one for the whole model), see for example http://blog.datadive.net/interpreting-random-forests/. Also there are methods for extracting, pruning and summarizing random forest rules to make them human readable, see for example inTrees package in R.


This is pretty much exactly the message of a talk I attended at CU Boulder. The Microsoft researcher spoke of how linear regression for the classification of whether a given medical patient will be re-admitted to the hospital offered much more valuable insight into the relationships different diseases had to the probability of re-admittance.

Moreover, not-so-obviously strange rules learned by more complex models were exposed as strange by the information-dense pictorial representation of linear regression.


Here's an interesting paper on learning methods for high-dimensional data: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149...

The abstract:

In this paper we perform an empirical evaluation of supervised learning on high-dimensional data. We evaluate performance on three metrics: accuracy, AUC, and squared loss and study the effect of increasing dimensionality on the performance of the learning algorithms. Our findings are consistent with previous studies for problems of relatively low dimension, but suggest that as dimensionality increases the relative performance of the learning algorithms changes. To our surprise, the method that performs consistently well across all dimensions is random forests, followed by neural nets, boosted trees, and SVMs.


Another paper comparing classifier methods, that also concludes RF is one of the best: http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf


Any recommendations for beginner materials about this and other modeling and machine learning topics? I worked for a few years with the folks behind SciPy and NumPy and iPython, and helped package the suite of tools many use for this kind of work, and so I have quite a bit of familiarity with the tools for experimenting with this stuff, but I never dug into the math and I'm not even sure where to start on it.


A bigger-picture "unreasonable effectiveness" article would be "The Unreasonable Effectiveness of Weighted Majority Votes in Machine Learning."

This Monday I'm publishing a post on [1] that gives some rigorous explanation for this, which essentially covers the results of this paper [2] whose main theorem is a claim about majority voting schemes.

[1]: http://jeremykun.com/

[2]: http://cseweb.ucsd.edu/~yfreund/papers/BoostingtheMargin.pdf


Thanks, [2] is a very nice paper with intuitive ideas and insights. To summarize it: When the common people agree and few disagree there are great expectations that a generalization is possible and effective.


One of thing to watch out for when using RFs is when your test set doesn't have identical factor levels as your training data. Often can actually lead to a significant bias in the prediction. Can't find a great paper on this but some discussion here:http://stats.stackexchange.com/questions/29446/random-forest...


I really enjoyed the point made on the first paragraph, especially the following phrase:

> Some like SVMs for the elegance of their formulation or the quality of the available implementations, some like decision rules for their simplicity and interpretability, and some are crazy about neural networks for their flexibility.

Nowadays it seems that ANN (and related algorithms) get all the credit. It's refreshing to see someone point out why some people actually prefer other methods.


aside: There really needs to be a law :) capping cpu cycles / information gained. Energy ain't free. Emissions are not harmless. OP appears to be a gas-guzzler algorithm.


What are you even talking about ??


The article. Did you read it all the way to the end?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: