{"id":191,"date":"2018-01-11T04:53:00","date_gmt":"2018-01-11T04:53:00","guid":{"rendered":"https:\/\/kindsonthegenius.com\/blog\/2018\/01\/11\/bias-variance-trade-off-in-classificationmachine-learning\/"},"modified":"2019-04-11T13:42:21","modified_gmt":"2019-04-11T11:42:21","slug":"bias-variance-trade-off-in-classificationmachine-learning","status":"publish","type":"post","link":"https:\/\/kindsonthegenius.com\/blog\/bias-variance-trade-off-in-classificationmachine-learning\/","title":{"rendered":"Bias\/Variance Trade-off in Classification(Machine Learning)"},"content":{"rendered":"<p>In this lesson, you will learn about Bias\/Variance Trade-off in Machine Learning.<br \/>\nThis is a concept in machine learning which refers to the problem of minimizing two error sources at the same time and this prevents the supervised learning algorithms from generalizing to accommodate inputs beyond the original training set.<\/p>\n<div style=\"clear: both; text-align: center;\"><a style=\"margin-left: 1em; margin-right: 1em;\" href=\"https:\/\/2.bp.blogspot.com\/-wf6laCULu9E\/WmRIbnGh69I\/AAAAAAAAA2s\/WOFLFrNRQzs0vzX7yqyjfBv-hNB02J0HACLcBGAs\/s1600\/Bias-Variance-Trade-off%2528Thumbnail%2529.jpg\" target=\"_blank\" rel=\"noopener\">\u00a0<\/a><\/div>\n<p><strong>We would discuss the following:<\/strong><\/p>\n<ol>\n<li><a href=\"#t1\">What is Bias\/Variance Tradeoff (Definition)?<\/a><\/li>\n<li><a href=\"#t2\">Sources of Error<\/a><\/li>\n<li><a href=\"#t3\">Bias\/Variance Decomposition of Squared Error<\/a><\/li>\n<li><a href=\"#t4\">Trade-off: Minimizing Error<\/a><\/li>\n<li><a href=\"#t5\">Relationship to Underfitting and Overfitting<\/a><\/li>\n<li><a href=\"#t6\">Illustration of Bias-Variance Trade-off<\/a><\/li>\n<li><a href=\"#t7\">The Bias\/Variance Trade-off Graph<\/a><\/li>\n<li><a href=\"#t8\">Summary and Final Notes <\/a><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<div style=\"clear: both; text-align: left;\">\n<p>&nbsp;<\/p>\n<h4><strong id=\"t1\">1. Definition of Bias-Variance Trade-off<\/strong><\/h4>\n<p>First, let&#8217;s take a simple definition. Bias-Variance Trade-off refers to the property of a machine learning model such that as the bias of the model increased, the variance reduces and as the bias reduces, the variance increases.<\/p>\n<\/div>\n<div style=\"clear: both; text-align: left;\">Therefore the problem is to determine the amount of bias and variance to make the model optimal.<\/div>\n<div style=\"clear: both; text-align: left;\"><\/div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Sources of Error<\/strong><\/h4>\n<p>We recall the problem of underfitting and overfitting when trying to fit a regression line through a set of data points.<br \/>\nIn case of underfitting, the bias is an error from a faulty assumption in the learning algorithm. This is such that when the bias is too large, the algorithm would be able to correctly model the relationship between the features and the target outputs.<\/p>\n<p>In case of overfitting, variance is an error resulting from fluctuations int he training dataset. A high value for variance would cause\u00a0 the algorithm may capture the most data points put would not be generalized enough to capture new data points. This is overfitting.<\/p>\n<p>The trade-off, means that a model would be chosen carefully to both correctly capture that regularities in the training data and at the same time be generalized enough to correctly categorize new observation<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Bias-Variance Decomposition of Squared Error<\/strong><\/h4>\n<p>Considering the squared loss function and the conditional distribution of the training data set, we could summarize the formula for the expected loss to be:<\/p>\n<div style=\"text-align: center;\"><span style=\"color: black;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><b><i>Expected Loss = (bias)<sup>2<\/sup> + variance\u00a0+ noise<\/i><\/b><\/span><\/span><\/div>\n<p>Now assuming <i><span style=\"font-family: 'georgia' , 'times new roman' , serif;\"><span style=\"color: black;\">y = f(x)<\/span><\/span><\/i>\u00a0 representing the true relationship between the variables in the training data set<br \/>\nAlso let function <span style=\"color: black;\"><i><span style=\"font-family: 'times' , 'times new roman' , serif;\">f'(x)<\/span><\/i><\/span> which is an approximation of <span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i>f(x)<\/i><\/span><\/span> through the learning process<\/p>\n<p>Then we\u00a0 measure the mean squared error between <span style=\"color: black;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><i>y<\/i><\/span><\/span> and <span style=\"color: black;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><i>f'(x)<\/i><\/span><\/span> which is given as:<br \/>\n<span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><i><span style=\"color: black;\">(y &#8211; f'(x))<sup>2<\/sup>.\u00a0<\/span><\/i><\/span><\/span><br \/>\nThis error is expected to be minimal.<\/p>\n<p>We than then write the original expected loss equation as:<\/p>\n<p><span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i>E[(y &#8211; f'(x))<sup>2<\/sup>] = Bias[f'(x)]<sup>2<\/sup>\u00a0+ Var[f'(x)]\u00a0+ <span style=\"line-height: 115%;\">\u03c3<sup>2<\/sup><\/span><\/i><\/span><\/span><\/span><\/p>\n<p>where:<br \/>\n<span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i>Bias[f'(x)] = <\/i><i><i>E[(f'(x) &#8211; f(x)] <\/i><\/i><\/span><\/span><\/span><\/p>\n<p>and<br \/>\n<span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i>Var[f'(x)] = <\/i><i><i>E[f'(x)<\/i><\/i><\/span><\/span><\/span><span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i><i><span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i><sup>2<\/sup>]<\/i><\/span><\/span><\/span> &#8211; <\/i><\/i><\/span><\/span><\/span><span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i><i><span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i><i>E[f'(x)<\/i><\/i><\/span><\/span><\/span>]<sup>2<\/sup><\/i><\/i><\/span><\/span><\/span><\/p>\n<p>and<br \/>\n<span style=\"font-size: large;\"><span style=\"font-family: 'times' , 'times new roman' , serif;\"><span style=\"color: black;\"><i> <span style=\"line-height: 115%;\">\u03c3<sup>2<\/sup><\/span><\/i><\/span><\/span><\/span>\u00a0 represents the noise term in the equation<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. The Bias\/Variance Tradeoff<\/strong><\/h4>\n<p>The objective is to reduce the error E to the minimum.\u00a0 This can be done by modifying the terms of the mean square error. From the equation, we see that we could only modify the bias and the variance terms.<br \/>\nBias arises when we generalize relationships using a function, while variance arises when there are multiple samples or input.<br \/>\nOne way to reduce the error is to reduce the bias and the variance terms. However, we cannot reduce both terms simultaneously, since reducing one term leads to increase in the other term. This is the idea of bias variance trade\/off.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t5\">5. Relationship with Underfitting and Overfitting<\/strong><\/h4>\n<p>A good model should do one of two things<\/p>\n<ul>\n<li>Capture the patterns in the given training data set<\/li>\n<li>Correctly compute the output for a new instance<\/li>\n<\/ul>\n<p>The model should be complete enough to represent the data, but the more complex the model, the better it represents the training data. However, there is a limit to how complex the model can get.<br \/>\nIf the model is too complex, then it will pick up specific random features (noise or example)\u00a0 in the training data set.<br \/>\nIf the model is not complex enough, then it might miss out on important dynamics of the data given.<\/p>\n<p>The problem where the model chosen is too complex, and becomes specific to the training data set is called overfitting.<\/p>\n<p>The problem where the model is not complex enough and misses out\u00a0 on the important features of the data is called underfitting.<\/p>\n<p>It is generally impossible to minimize the two errors\u00a0 at the same time and this trade-off is what is known as bias\/variance trade-off.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t6\">6. Illustration of Bias-Variance Trade-off<\/strong><\/h4>\n<p>Assuming you have several training data sets for the same population:<\/p>\n<ul>\n<li>Training Data 1<\/li>\n<li>Training Data 2<\/li>\n<li>Training Data 3<\/li>\n<\/ul>\n<table style=\"margin-left: auto; margin-right: auto; text-align: center;\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a style=\"margin-left: auto; margin-right: auto;\" href=\"https:\/\/2.bp.blogspot.com\/-aNArb9J77NU\/WmEyslDidbI\/AAAAAAAAA0A\/rLF_SQbRy7UtB7ITzqUqBDh5lSQhnFnwACLcBGAs\/s1600\/Bias-Variance-Trade-of-%2BSL-Algorithm.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/2.bp.blogspot.com\/-aNArb9J77NU\/WmEyslDidbI\/AAAAAAAAA0A\/rLF_SQbRy7UtB7ITzqUqBDh5lSQhnFnwACLcBGAs\/s640\/Bias-Variance-Trade-of-%2BSL-Algorithm.jpg\" width=\"640\" height=\"185\" border=\"0\" data-original-height=\"284\" data-original-width=\"957\" \/><\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\">Figure 1: Supervised Learning algorithm<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These three data sets are passed through the same supervised learning algorithm which produces three models.<\/p>\n<ul>\n<li>Model 1<\/li>\n<li>Model 2<\/li>\n<li>Model 3<\/li>\n<\/ul>\n<p>Now, let say we want to predict the output of a new input x, The three models should be able to produce the same output for the same new instance. But when you pass x into each of the models, instead of getting the same output, you get a different output(y<sub>1<\/sub>, y<sub>2<\/sub> and y<sub>3<\/sub>) for the same x.<br \/>\nThis is illustrated in Figure 2.<\/p>\n<table style=\"margin-left: auto; margin-right: auto; text-align: center;\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a style=\"margin-left: auto; margin-right: auto;\" href=\"https:\/\/4.bp.blogspot.com\/-c7o21RqawPg\/WmRE10hhO9I\/AAAAAAAAA2g\/WpEK4tyWe2YC1_FJQgdC_qwCJnlGK93xQCEwYBhgL\/s1600\/Bias-Variance-Trade-off-Overfitting.jpg.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/4.bp.blogspot.com\/-c7o21RqawPg\/WmRE10hhO9I\/AAAAAAAAA2g\/WpEK4tyWe2YC1_FJQgdC_qwCJnlGK93xQCEwYBhgL\/s320\/Bias-Variance-Trade-off-Overfitting.jpg.png\" width=\"320\" height=\"130\" border=\"0\" data-original-height=\"286\" data-original-width=\"702\" \/><\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\">Figure 2: High Variance Error<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The problem here is that that the model have become too specific that it cannot capture the correct output for a new value for x.<br \/>\nIn this case, the algorithm is said to have <i><span style=\"color: #990000;\">high-variance error<\/span><\/i>. Which results in a problem of overfitting.<\/p>\n<p>&nbsp;<\/p>\n<p>Let&#8217;s also assume that, you pass different values of <i><span style=\"font-family: 'times' , 'times new roman' , serif;\">x (x<sub>1<\/sub>, x<sub>2<\/sub> and x<sub>3<\/sub>)<\/span><\/i> into the same model. Instead of getting different outputs, you get the same output y. In this case, the algorithm is said to have <span style=\"color: #990000;\"><i>high bias error<\/i><\/span>, which results in a problem of underfitting. This is illustrated in Figure 3 below:<\/p>\n<table style=\"margin-left: auto; margin-right: auto; text-align: center;\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a style=\"margin-left: auto; margin-right: auto;\" href=\"https:\/\/2.bp.blogspot.com\/-tRSCkBoi18g\/WmE3bbGgAVI\/AAAAAAAAA0Q\/RAvzLLwxKAcX4YOHac6Xz-MjsnRP7V7pgCLcBGAs\/s1600\/Bias-Variance-Trade-off-Underfitting.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/2.bp.blogspot.com\/-tRSCkBoi18g\/WmE3bbGgAVI\/AAAAAAAAA0Q\/RAvzLLwxKAcX4YOHac6Xz-MjsnRP7V7pgCLcBGAs\/s400\/Bias-Variance-Trade-off-Underfitting.jpg\" width=\"400\" height=\"136\" border=\"0\" data-original-height=\"236\" data-original-width=\"687\" \/><\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\">Figure 3: High Bias Error<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><i>High variance<\/i> means that the algorithm have become too specific.<br \/>\n<i>High bias <\/i>means that the algorithm have\u00a0 failed to understand the pattern in the input data.<br \/>\nIt&#8217;s generally not possible to minimize both errors simultaneously, since high bias would always means low variance, whereas low bias would always mean high variance.<br \/>\nFinding a trade-off between the two extremes is known as <span style=\"color: #990000;\"><i>Bias\/Variance Tradeoff.<\/i><\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t7\">7. Explanation of the Bias\/Variance Graph<\/strong><\/h4>\n<p>The graph in Figure 3 is a typical plot of the bias\/variance trade-off which we would briefly examine.<\/p>\n<div style=\"clear: both; text-align: center;\"><a style=\"margin-left: 1em; margin-right: 1em;\" href=\"https:\/\/4.bp.blogspot.com\/-7JpC5qZ9UsU\/WmXKu0n_-KI\/AAAAAAAAA5k\/FkkLGfLGs3chzJ5_FCaK6Lj2JSSulnf4wCLcBGAs\/s1600\/Bias-Variance-Trade-of-Graph.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/4.bp.blogspot.com\/-7JpC5qZ9UsU\/WmXKu0n_-KI\/AAAAAAAAA5k\/FkkLGfLGs3chzJ5_FCaK6Lj2JSSulnf4wCLcBGAs\/s400\/Bias-Variance-Trade-of-Graph.jpg\" width=\"400\" height=\"256\" border=\"0\" data-original-height=\"311\" data-original-width=\"484\" \/><\/a><\/div>\n<p>The bias\/variance graph shows a plot of Error against Model Complexity. It also shows:<br \/>\n<i>Relationship of variance and Model Complexity:<\/i> As we increase the variance, the variance increases<br \/>\n<i>Relationship of bias and Model Complexity<\/i>: As the bias increase, the model complexity reduces<br \/>\n<i>Relationship of variance and Error<\/i>: As the variance increases, the error increases.<br \/>\n<i>Relationship of bias and Error<\/i>: As the bias increases, the error increases.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t8\">8. Final Notes<\/strong><\/h4>\n<p>With the above assumption, we could go ahead to derive the bias-variance decomposition for squared error but that would be in a different lesson.<\/p>\n<p>Thank you for reading and do remember to leave a comment if you have any challenges following the explanation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this lesson, you will learn about Bias\/Variance Trade-off in Machine Learning. This is a concept in machine learning which refers to the problem of &hellip; <\/p>\n","protected":false},"author":1,"featured_media":878,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[16,392],"tags":[548,547,546],"class_list":["post-191","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-supervised-learning","tag-bias-variance-trade-off","tag-overfitting","tag-underfitting"],"_links":{"self":[{"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/posts\/191","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/comments?post=191"}],"version-history":[{"count":2,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/posts\/191\/revisions"}],"predecessor-version":[{"id":879,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/posts\/191\/revisions\/879"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/media\/878"}],"wp:attachment":[{"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/media?parent=191"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/categories?post=191"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kindsonthegenius.com\/blog\/wp-json\/wp\/v2\/tags?post=191"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}