Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, it's not the same as overfitting. It's nothing to do with the algorithm at all. The problem is with the training data, which contains patterns induced by the way the data was collected rather than the reality of what you want to classify.

For example, imagine that you wanted to train an algorithm to distinguish photos of dogs from photos humans. So you collect a bunch of photos of both dogs and humans and use them to train a classifier. You do all the proper cross-validation, bootstrapping, etc. to ensure that you are not overfitting, and you get really good results. Then, looking at the mis-classifications, you notice something: all the photos that are taken looking at an angle down toward the ground are classified as dog photos, and all the photos taken looking straight ahead are classified as human photos. It turns out that in your training set, most of the dog photos are taken at a downward angle while must of the human photos are taken facing straight ahead, because humans are taller than dogs, and your machine learning algorithm identified this feature as the most reliable way to distinguish the two groups of photos in your training set.

In this hypothetical example, no overfitting occurred. The difference in photo angles is a real difference in the training sets that you provided to the algorithm, and the algorithm did its job and correctly identified this difference between the two groups of photos as a reliable predictor. The problem is that your training set has a variable (photo angle) that is highly correlated with what you want to classify (species). This is considered an unwanted bias (and not a reliable indicator) because the correlation is caused by the means of data collection (most photos are taken from human head height) and has nothing to do with the subject of the photos.



I think you’re arguing semantics a bit. What you’re saying checks out, but one could say that overfitting was occurring but the test dataset distribution was not wide enough to catch it.


As I understand it, if it's overfitting, testing on a second random sample gathered in the same way as the training set should degrade performance. That's not the case here.

(Though maybe the term as used in industry is less strict.)


It's not semantics. Overfitting is a completely unrelated problem. Overfitting is an issue with the machine learning algorithm and how you're applying it and can be fixed by changing how you use the algorithm, while what we're taking about here is a problem with the training data. There is nothing that can fix a biased training data set short of getting new training data that doesn't contain that bias.

It would be like if your car was driving in circles and you called a mechanic to fix your steering, and they told you that the actual problem was that both right wheels were missing. That's not a steering problem, and no repair to the steering system will fix it. The only fix is to put new wheels on.


Overfitting is about being too precise because of the sample inputs, such as if "downward angle" + "brown blob" (one specific dog breed) + "leash" + "lots of green" (grass) was required to identify a dog. GP's example wasn't that, it was just identifying the wrong thing.


Instead of overfitting , it's more related to exploitation vs exploration. We see more men related to programming might be just that women are not given opportunities to explore the programming as a career.

When AI makes a decision, right now, people only uses the probability output. Hiring A has .6 probability while hiring B has .4. then we will hire A instead of B. However, if we consider the confidence intervals, the decision might not be that clear. Say +/- .5 to hire A but .2 to hire B. If exploration is considered too, very likely that we will give B a chance.

AI is in the realm of probabilistic decision making, while normal people don't follow. The bias is not from the training side. It's the decision making process incorporating AI should change.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: