- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.
- Density Estimation
- Maximum a Posteriori (MAP)
- MAP and Machine Learning
- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.
- P(X ; theta)
- P(x1, x2, x3, …, xn ; theta)
- maximize P(X ; theta)
- P(A | B) = (P(B | A) * P(A)) / P(B)
- P(A | B) is proportional to P(B | A) * P(A)
- P(A | B) = P(B | A) * P(A)
- P(theta | X) = P(X | theta) * P(theta)
- maximize P(X | theta) * P(theta)
- maximize P(X | h) * P(h)
- Chapter 6 Bayesian Learning, Machine Learning, 1997.
- Chapter 12 Maximum Entropy Models, Foundations of Machine Learning, 2018.
- Chapter 9 Probabilistic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Chapter 13 MAP Inference, Probabilistic Graphical Models: Principles and Techniques, 2009.
- Maximum a posteriori estimation, Wikipedia.
- Bayesian statistics, Wikipedia.
- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.
- GPU-accelerated search engine
- Intelligent index
- Strong scalability
- High compatibility
- Logistic regression is a linear model for binary classification predictive modeling.
- The linear part of the model predicts the log-odds of an example belonging to class 1, which is converted to a probability via the logistic function.
- The parameters of the model can be estimated by maximizing a likelihood function that predicts the mean of a Bernoulli distribution for each example.
- Logistic Regression
- Logistic Regression and Log-Odds
- Maximum Likelihood Estimation
- Logistic Regression as Maximum Likelihood
- yhat = model(X)
- yhat = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm
- y = X * Beta
- f(x) = 1 / (1 + exp(-x))
- yhat = 1 / (1 + exp(-(X * Beta)))
- Least Squares Optimization (iteratively reweighted least squares).
- Maximum Likelihood Estimation.
- log-odds = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm
- odds of success = p / (1 – p)
- log-odds = log(p / (1 – p)
- log-odds = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm
- odds = exp(log-odds)
- odds = exp(beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm)
- p = odds / (odds + 1)
- p = 1 / (1 + exp(-log-odds))
- P(X ; theta)
- P(x1, x2, x3, …, xn ; theta)
- L(X ; theta)
- sum i to n log(P(xi ; theta))
- minimize -sum i to n log(P(xi ; theta))
- maximize sum i to n log(P(xi ; h))
- P(y | X)
- maximize sum i to n log(P(yi|xi ; h))
- P(y=1) = p
- P(y=0) = 1 – p
- mean = P(y=1) * 1 + P(y=0) * 0
- mean = p * 1 + (1 – p) * 0
- likelihood = yhat * y + (1 – yhat) * (1 – y)
- log-likelihood = log(yhat) * y + log(1 – yhat) * (1 – y)
- maximize sum i to n log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i)
- minimize sum i to n -(log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i))
- cross entropy = -(log(q(class0)) * p(class0) + log(q(class1)) * p(class1))
- A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning
- How To Implement Logistic Regression From Scratch in Python
- Logistic Regression Tutorial for Machine Learning
- Logistic Regression for Machine Learning
- Section 4.4.1 Fitting Logistic Regression Models, The Elements of Statistical Learning, 2016.
- Section 4.3.2 Logistic regression, Pattern Recognition and Machine Learning, 2006.
- Chapter 8 Logistic regression, Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 4 Algorithms: the basic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Section 18.6.4 Linear classification with logistic regression, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.
- Section 12.2 Logistic Regression, Applied Predictive Modeling, 2013.
- Section 4.3 Logistic Regression, An Introduction to Statistical Learning with Applications in R, 2017.
- Maximum likelihood estimation, Wikipedia.
- Likelihood function, Wikipedia.
- Logistic regression, Wikipedia.
- Logistic function, Wikipedia.
- Odds, Wikipedia.
- Logistic regression is a linear model for binary classification predictive modeling.
- The linear part of the model predicts the log-odds of an example belonging to class 1, which is converted to a probability via the logistic function.
- The parameters of the model can be estimated by maximizing a likelihood function that predicts the mean of a Bernoulli distribution for each example.
- Linear regression is a model for predicting a numerical quantity and maximum likelihood estimation is a probabilistic framework for estimating model parameters.
- Coefficients of a linear regression model can be estimated using a negative log-likelihood function from maximum likelihood estimation.
- The negative log-likelihood function can be used to derive the least squares solution to linear regression.
- Linear Regression
- Maximum Likelihood Estimation
- Linear Regression as Maximum Likelihood
- Least Squares and Maximum Likelihood
- yhat = model(X)
- yhat = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm
- y = X * Beta
- Least Squares Optimization.
- Maximum Likelihood Estimation.
- How to Solve Linear Regression Using Linear Algebra
- P(X ; theta)
- P(x1, x2, x3, …, xn ; theta)
- L(X ; theta)
- sum i to n log(P(xi ; theta))
- minimize -sum i to n log(P(xi ; theta))
- maximize sum i to n log(P(xi ; h))
- P(y | X)
- maximize sum i to n log(P(yi|xi ; h))
- f(x) = (1 / sqrt(2 * pi * sigma^2)) * exp(- 1/(2 * sigma^2) * (x – mu)^2 )
- maximize product i to n (1 / sqrt(2 * pi * sigma^2)) * exp(-1/(2 * sigma^2) * (xi – h(xi, Beta))^2)
- maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (x – h(xi, Beta))^2)
- minimize sum i to n (yi – h(xi, Beta))^2
- mse = 1/n * sum i to n (yi – yhat)^2
- maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (x – h(xi, Beta))^2)
- maximize sum i to n – (1/(2 * sigma^2) * (x – h(xi, Beta))^2)
- minimize sum i to n (1/(2 * sigma^2) * (x – h(xi, Beta))^2)
- minimize sum i to n (x – h(xi, Beta))^2
- How to Solve Linear Regression Using Linear Algebra
- How to Implement Linear Regression From Scratch in Python
- How To Implement Simple Linear Regression From Scratch With Python
- Linear Regression Tutorial Using Gradient Descent for Machine Learning
- Simple Linear Regression Tutorial for Machine Learning
- Linear Regression for Machine Learning
- Section 15.1 Least Squares as a Maximum Likelihood Estimator, Numerical Recipes in C: The Art of Scientific Computing, Second Edition, 1992.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Section 2.6.3 Function Approximation, The Elements of Statistical Learning, 2016.
- Section 6.4 Maximum Likelihood and Least-Squares Error Hypotheses, Machine Learning, 1997.
- Section 3.1.1 Maximum likelihood and least squares, Pattern Recognition and Machine Learning, 2006.
- Section 7.3 Maximum likelihood estimation (least squares), Machine Learning: A Probabilistic Perspective, 2012.
- Maximum likelihood estimation, Wikipedia.
- Likelihood function, Wikipedia.
- Linear regression, Wikipedia.
- Linear regression is a model for predicting a numerical quantity and maximum likelihood estimation is a probabilistic framework for estimating model parameters.
- Coefficients of a linear regression model can be estimated using a negative log-likelihood function from maximum likelihood estimation.
- The negative log-likelihood function can be used to derive the least squares solution to linear regression.
- Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
- It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
- It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.
- Problem of Probability Density Estimation
- Maximum Likelihood Estimation
- Relationship to Machine Learning
- How do you choose the probability distribution function?
- How do you choose the parameters for the probability distribution function?
- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), frequentist method.
- P(X | theta)
- P(X ; theta)
- P(x1, x2, x3, …, xn ; theta)
- L(X ; theta)
- maximize L(X ; theta)
- L(x1, x2, x3, …, xn ; theta)
- product i to n P(xi ; theta)
- sum i to n log(P(xi ; theta))
- minimize -sum i to n log(P(xi ; theta))
- P(X ; h)
- maximize L(X ; h)
- maximize sum i to n log(P(xi ; h))
- Clustering algorithms.
- maximize L(y|X ; h)
- maximize sum i to n log(P(yi|xi ; h))
- Linear Regression, for predicting a numerical value.
- Logistic Regression, for binary classification.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Chapter 2 Probability Distributions, Pattern Recognition and Machine Learning, 2006.
- Chapter 8 Model Inference and Averaging, The Elements of Statistical Learning, 2016.
- Chapter 9 Probabilistic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Chapter 22 Maximum Likelihood and Clustering, Information Theory, Inference and Learning Algorithms, 2003.
- Chapter 8 Learning distributions, Bayesian Reasoning and Machine Learning, 2011.
- Maximum likelihood estimation, Wikipedia.
- Maximum Likelihood, Wolfram MathWorld.
- Likelihood function, Wikipedia.
- Some problems understanding the definition of a function in a maximum likelihood method, CrossValidated.
- Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
- It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
- It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

## [P] Neural Network based Terrain Generator

https://preview.redd.it/hcibsplhirx31.jpg?width=1280&format=pjpg&auto=webp&s=1094e2e472b64aaedee7bcb275c8cf7e83cf52e7 Official site: https://apseren.com/mlterraform/ submitted by /u/apseren |

## A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning

Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain.

Typically, estimating the entire distribution is intractable, and instead, we are happy to have the expected value of the distribution, such as the mean or mode. Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a distribution and model parameters that best explain an observed dataset.

This flexible probabilistic framework can be used to provide a Bayesian foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, and unlike maximum likelihood estimation, explicitly allows prior belief about candidate models to be incorporated systematically.

In this post, you will discover a gentle introduction to Maximum a Posteriori estimation.

After reading this post, you will know:

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

## Overview

This tutorial is divided into three parts; they are:

## Density Estimation

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

For example, given a sample of observation (*X*) from a domain (*x1, x2, x3, …, xn*), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explains the joint probability distribution of the observed data (*X*).

Often estimating the density is too challenging; instead, we are happy with a point estimate from the target distribution, such as the mean.

There are many techniques for solving this problem, although two common approaches are:

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

or

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters.

The objective of Maximum Likelihood Estimation is to find the set of parameters (*theta*) that maximize the likelihood function, e.g. result in the largest likelihood value.

An alternative and closely related approach is to consider the optimization problem from the perspective of Bayesian probability.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1568216021.js

## Maximum a Posteriori (MAP)

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of *B* given *A* multiplied by the prior.

Or, simply:

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

We can now relate this calculation to our desire to estimate a distribution and parameters (*theta*) that best explains our dataset (*X*), as we described in the previous section. This can be stated as:

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

We are typically not calculating the full posterior probability distribution, and in fact, this may not be tractable for many problems of interest.

… finding MAP hypotheses is often much easier than Bayesian learning, because it requires solving an optimization problem instead of a large summation (or integration) problem.

— Page 804, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Instead, we are calculating a point estimation such as a moment of the distribution, like the mode, the most common value, which is the same as the mean for the normal distribution.

One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation.

— Page 139, Deep Learning, 2016.

**Note**: this is very similar to Maximum Likelihood Estimation, with the addition of the prior probability over the distribution and parameters.

In fact, if we assume that all values of *theta* are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent.

Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ.

… the maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is.

— Page 167, Machine Learning, 1997.

## MAP and Machine Learning

In machine learning, Maximum a Posteriori optimization provides a Bayesian probability framework for fitting model parameters to training data and an alternative and sibling to the perhaps more common Maximum Likelihood Estimation framework.

Maximum a posteriori (MAP) learning selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning.

— Page 825, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives.

Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weigh the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.

Bayesian methods can be used to determine the most probable hypothesis given the data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the sense that no other hypothesis is more likely.

— Page 197, Machine Learning, 1997.

In fact, the addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. For example, L2 is a bias or prior that assumes that a set of coefficients or weights have a small sum squared value.

… in particular, L2 regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.

— Page 236, Deep Learning, 2016.

We can make the relationship between MAP and machine learning clearer by re-framing the optimization problem as being performed over candidate modeling hypotheses (*h* in *H*) instead of the more abstract distribution and parameters (*theta*); for example:

Here, we can see that we want a model or hypothesis (*h*) that best explains the observed training dataset (*X*) and that the prior (*P(h)*) is our belief about how useful a hypothesis is expected to be, generally, regardless of the training data. The optimization problem involves estimating the posterior probability for each candidate hypothesis.

We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

— Page 157, Machine Learning, 1997.

Like MLE, solving the optimization problem depends on the choice of model. For simpler models, like linear regression, there are analytical solutions. For more complex models like logistic regression, numerical optimization is required that makes use of first- and second-order derivatives. For the more prickly problems, stochastic optimization algorithms may be required.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

### Articles

## Summary

In this post, you discovered a gentle introduction to Maximum a Posteriori estimation.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning appeared first on Machine Learning Mastery.

Click here to read more## [News] Microsoft's Project Silica succeeds in storing Superman film on a piece of glass.

Zoomed image of 1978 film "Superman" stored in a piece of glass. Cloud storage has now seamlessly entered into our digital lives to such an extent that we neither realize that we are utilizing it in many instances and nor do we comprehend the fact that all this data is being physically stored in hardware whose capacity in increasingly flattening. This problem is compounded by the fact that the amount of data that each of us generate is increasing exponentially. Even if the infrastructure keeps up with this rising demand, the hardware such as disk drives itself have a lifespan of around five years. This means that to keep the data saved they have to be cyclically written on to newer hardware. A unique solution thought of, was to use the same ultrashort optical pulses used in LASIK surgeries to store data in glass by permanently changing its structure. This can help keep the data saved for centuries. Quartz glass also doesn’t need energy-intensive air conditioning to keep material at a constant temperature or systems that remove moisture from the air – both of which could lower the environmental footprint of large-scale data storage. Continue reading: https://latesttechnewswiki.blogspot.com/2019/11/microsoft-stores-superman-in-a-piece-of-glass.html submitted by /u/Anirban_Hazra |

## [D] Thoughts on Quantum Artificial Intelligence / Q Supremacy

"Quantum Computing: The Why and How ǀ Jonathan Baker, UChicago" https://www.youtube.com/watch?v=5kTiB_KDUj0 Hey 🙂 So just wanted to start a discussion on what people think about whether Quantum Algorithms will "revolutinize" machine learning algorithms. I'm not a quantum expert, so take my stance with a grain of salt. I was watching many videos ("Quantum algorithm for solving linear equations" https://www.youtube.com/watch?v=KtIPAPyaPOg, "Seth Lloyd: Quantum Machine Learning" https://www.youtube.com/watch?v=wkBPp9UovVU etc etc) + reading Wikipedia blah. Then I came across the diagram above. According to Jonathan Baker, there are 3 main future trends for QPCs. I just extrapolated his graphs. The green line is most optimistic, utilising "co-design"??? which I don't know what that means. The read is less steep, and the blue is just a straight line continuation of the current # of qubit trend. (Notice the log10 scale) QAOA or Quantum Approximation Optimization Algorithms include Quantum Linear Regression, and possibly??? (I don't know) optimisation methods for backprop. The green line shows by 2025 QPCs can be used for Linear Reg. Red which is like average case is 2035? and worst case is 2045. To crach cryptography (ie Shor's Algorithm), over 10,000 # of qubits are needed. By green best case, that will be at 2032. Average is 2045 and worst is 2067. I was also reading Wikipedia "Quantum algorithm for linear systems of equations" https://en.wikipedia.org/wiki/Quantum_algorithm_for_linear_systems_of_equations, and it highlights how solving X * beta = y or A * x = b takes O( log(P) * K^2 ), where K is the condition number and P is the # of coefficients in beta. The best conjugate gradeint method takes O( P * K ). More concretely, the "exponential speedup" (I think???) applies to sparse matrices. If you include error bounds, then you get O( log(P) * K^2 / err ). For dense matrices you get O( sqrt(P) log(P) K^2 ). The issue I see is since methods all include error bounds, it isn't necessarily a good way to compare direct methods with quantum algos. A better way is to compare randomized methods, where an "exponential speedup" is also possible by sketching only log(N) rows. It's possible to also say apply the randomized methods with Q Algos, hence in total you might get a staggering "exponential-exponential" speedup, but because Q Algos are inherently have error, this will exaggerate the error a lot. So what do people think about the potential of Q Algos for ML ? PS: The graph above is suprisingly a log10 plot (ie x10). This is clearly different from Moore's Law graph (x2 for # of transistors), but anyways I'm guessing Qubits don't follow Moores Law submitted by /u/danielhanchen |

## [P] Mask_RCNN for blurring advertisment on streets.

https://github.com/WannaFIy/mask_AD https://preview.redd.it/6dv43jvyy8w31.jpg?width=2048&format=pjpg&auto=webp&s=b9feedfeb74912aac43d53eb6fe03cd82cd7b08e submitted by /u/wannafIy |

## [P] Milvus: A big leap to scalable AI search engine

## The challenge with data searchThe explosion in unstructured data, such as images, videos, sound records, and text, requires an effective solution for computer vision, voice recognition, and natural language processing. How to extract value from unstructured data poses as a big challenge for many enterprises. AI, especially deep learning, has been proved as an effective solution. Vectorization of data features enables people to perform content-based search on unstructured data. For example, you can perform content-based image retrieval, including facial recognition and object detection, etc. https://preview.redd.it/20lpm6iqouv31.png?width=5148&format=png&auto=webp&s=75051c51002f71687a1ff2eae8f6b8690b2b388e Now the challenge turns into how to execute effectively search among billions of vectors. That’s what Milvus is designed for. ## What is Milvus?Milvus is an open source distributed vector search engine that provides state-of-the-art similarity search and analysis of feature vectors and unstructured data. Some of its key features are: Milvus is designed for the largest scale of vector index. CPU/GPU heterogeneous computing architecture allows you to process data at a speed 1000 times faster. With a “Decide Your Own Algorithm” approach, you can embed machine learning and advanced algorithms into Milvus without the headache of complex data engineering or migrating data between disparate systems. Milvus is built on optimized indexing algorithm based on quantization indexing, tree-based and graph indexing methods. The data is stored and computed on a distributed architecture. This lets you scale data sizes up and down without redesigning the system. Milvus is compatible with major AI/ML models and programming languages such as C++, Java and Python. https://preview.redd.it/2aadp060puv31.png?width=1275&format=png&auto=webp&s=ce1f18df54bba4744f58421efec4c84374d2ea3a ## Billion-Scale similarity searchYou may follow this link for step-by-step procedures to carry out performance test on 100 million vector search (SIFT1B). If you want, you can also try testing 1 billion with Milvus. Here is the hardware requirements. ## Join usMilvus has been open sourced lately. We greatly welcome contributors to join us in reinventing data science! Milvus on GitHub Our Slack channel Check the original article: https://medium.com/@milvusio/milvus-a-big-leap-to-scalable-ai-search-engine-e9c5004543f submitted by /u/rainmanwy |

## A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation

Logistic regression is a model for binary classification predictive modeling.

The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation. Under this framework, a probability distribution for the target variable (class label) must be assumed and then a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.

The maximum likelihood approach to fitting a logistic regression model both aids in better understanding the form of the logistic regression model and provides a template that can be used for fitting classification models more generally. This is particularly true as the negative of the log-likelihood function used in the procedure can be shown to be equivalent to cross-entropy loss function.

In this post, you will discover logistic regression with maximum likelihood estimation.

After reading this post, you will know:

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

## Overview

This tutorial is divided into four parts; they are:

## Logistic Regression

Logistic regression is a classical linear method for binary classification.

Classification predictive modeling problems are those that require the prediction of a class label (e.g. ‘*red*‘, ‘*green*‘, ‘*blue*‘) for a given set of input variables. Binary classification refers to those classification problems that have two class labels, e.g. true/false or 0/1.

Logistic regression has a lot in common with linear regression, although linear regression is a technique for predicting a numerical value, not for classification problems. Both techniques model the target variable with a line (or hyperplane, depending on the number of dimensions of input. Linear regression fits the line to the data, which can be used to predict a new quantity, whereas logistic regression fits a line to best separate the two classes.

The input data is denoted as *X* with n examples and the output is denoted *y* with one output for each input. The prediction of the model for a given input is denoted as *yhat*.

The model is defined in terms of parameters called coefficients (*beta*), where there is one coefficient per input and an additional coefficient that provides the intercept or bias.

For example, a problem with inputs *X* with m variables *x1, x2, …, xm* will have coefficients *beta1, beta2, …, betam*, and *beta0*. A given input is predicted as the weighted sum of the inputs for the example and the coefficients.

The model can also be described using linear algebra, with a vector for the coefficients (*Beta*) and a matrix for the input data (*X*) and a vector for the output (*y*).

So far, this is identical to linear regression and is insufficient as the output will be a real value instead of a class label.

Instead, the model squashes the output of this weighted sum using a nonlinear function to ensure the outputs are a value between 0 and 1.

The logistic function (also called the sigmoid) is used, which is defined as:

Where x is the input value to the function. In the case of logistic regression, x is replaced with the weighted sum.

For example:

The output is interpreted as a probability from a Binomial probability distribution function for the class labeled 1, if the two classes in the problem are labeled 0 and 1.

Notice that the output, being a number between 0 and 1, can be interpreted as a probability of belonging to the class labeled 1.

— Page 726, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

The examples in the training dataset are drawn from a broader population and as such, this sample is known to be incomplete. Additionally, there is expected to be measurement error or statistical noise in the observations.

The parameters of the model (*beta*) must be estimated from the sample of observations drawn from the domain.

There are many ways to estimate the parameters. There are two frameworks that are the most common; they are:

Both are optimization procedures that involve searching for different model parameters.

Maximum Likelihood Estimation is a frequentist probabilistic framework that seeks a set of parameters for the model that maximizes a likelihood function. We will take a closer look at this second approach in the subsequent sections.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1568216021.js

## Logistic Regression and Log-Odds

Before we dive into how the parameters of the model are estimated from data, we need to understand what logistic regression is calculating exactly.

This might be the most confusing part of logistic regression, so we will go over it slowly.

The linear part of the model (the weighted sum of the inputs) calculates the log-odds of a successful event, specifically, the log-odds that a sample belongs to class 1.

In effect, the model estimates the log-odds for class 1 for the input variables at each level (all observed values).

What are odds and log-odds?

Odds may be familiar from the field of gambling. Odds are often stated as wins to losses (wins : losses), e.g. a one to ten chance or ratio of winning is stated as 1 : 10.

Given the probability of success (*p*) predicted by the logistic regression model, we can convert it to odds of success as the probability of success divided by the probability of not success:

The logarithm of the odds is calculated, specifically log base-e or the natural logarithm. This quantity is referred to as the log-odds and may be referred to as the logit (logistic unit), a unit of measure.

Recall that this is what the linear part of the logistic regression is calculating:

The log-odds of success can be converted back into an odds of success by calculating the exponential of the log-odds.

Or

The odds of success can be converted back into a probability of success as follows:

And this is close to the form of our logistic regression model, except we want to convert log-odds to odds as part of the calculation.

We can do this and simplify the calculation as follows:

This shows how we go from log-odds to odds, to a probability of class 1 with the logistic regression model, and that this final functional form matches the logistic function, ensuring that the probability is between 0 and 1.

We can make these calculations of converting between probability, odds and log-odds concrete with some small examples in Python.

First, let’s define the probability of success at 80%, or 0.8, and convert it to odds then back to a probability again.

The complete example is listed below.

# example of converting between probability and odds from math import log from math import exp # define our probability of success prob = 0.8 print('Probability %.1f' % prob) # convert probability to odds odds = prob / (1 - prob) print('Odds %.1f' % odds) # convert back to probability prob = odds / (odds + 1) print('Probability %.1f' % prob)

Running the example shows that 0.8 is converted to the odds of success 4, and back to the correct probability again.

Probability 0.8 Odds 4.0 Probability 0.8

Let’s extend this example and convert the odds to log-odds and then convert the log-odds back into the original probability. This final conversion is effectively the form of the logistic regression model, or the logistic function.

The complete example is listed below.

# example of converting between probability and log-odds from math import log from math import exp # define our probability of success prob = 0.8 print('Probability %.1f' % prob) # convert probability to odds odds = prob / (1 - prob) print('Odds %.1f' % odds) # convert odds to log-odds logodds = log(odds) print('Log-Odds %.1f' % logodds) # convert log-odds to a probability prob = 1 / (1 + exp(-logodds)) print('Probability %.1f' % prob)

Running the example, we can see that our odds are converted into the log odds of about 1.4 and then correctly converted back into the 0.8 probability of success.

Probability 0.8 Odds 4.0 Log-Odds 1.4 Probability 0.8

Now that we have a handle on the probability calculated by logistic regression, let’s look at maximum likelihood estimation.

## Maximum Likelihood Estimation

Maximum Likelihood Estimation, or MLE for short, is a probabilistic framework for estimating the parameters of a model.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to *n*.

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the log conditional probability.

Given the frequent use of log in the likelihood function, it is referred to as a log-likelihood function. It is common in optimization problems to prefer to minimize the cost function rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the logistic regression model.

## Logistic Regression as Maximum Likelihood

We can frame the problem of fitting a machine learning model as the problem of probability density estimation.

Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*. We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

Supervised learning can be framed as a conditional probability problem of predicting the probability of the output given the input:

As such, we can define conditional maximum likelihood estimation for supervised machine learning as follows:

Now we can replace *h* with our logistic regression model.

In order to use maximum likelihood, we need to assume a probability distribution. In the case of logistic regression, a Binomial probability distribution is assumed for the data sample, where each example is one outcome of a Bernoulli trial. The Bernoulli distribution has a single parameter: the probability of a successful outcome (*p*).

The probability distribution that is most often used when there are two classes is the binomial distribution.5 This distribution has a single parameter, p, that is the probability of an event or a specific class.

— Page 283, Applied Predictive Modeling, 2013.

The expected value (mean) of the Bernoulli distribution can be calculated as follows:

Or, given p:

This calculation may seem redundant, but it provides the basis for the likelihood function for a specific input, where the probability is given by the model (*yhat*) and the actual label is given from the dataset.

This function will always return a large probability when the model is close to the matching class value, and a small value when it is far away, for both *y=0* and *y=1* cases.

We can demonstrate this with a small worked example for both outcomes and small and large probabilities predicted for each.

The complete example is listed below.

# test of Bernoulli likelihood function # likelihood function for Bernoulli distribution def likelihood(y, yhat): return yhat * y + (1 - yhat) * (1 - y) # test for y=1 y, yhat = 1, 0.9 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) y, yhat = 1, 0.1 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) # test for y=0 y, yhat = 0, 0.1 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) y, yhat = 0, 0.9 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat)))

Running the example prints the class labels (*y*) and predicted probabilities (*yhat*) for cases with close and far probabilities for each case.

We can see that the likelihood function is consistent in returning a probability for how well the model achieves the desired outcome.

y=1.0, yhat=0.9, likelihood: 0.900 y=1.0, yhat=0.1, likelihood: 0.100 y=0.0, yhat=0.1, likelihood: 0.900 y=0.0, yhat=0.9, likelihood: 0.100

We can update the likelihood function using the log to transform it into a log-likelihood function:

Finally, we can sum the likelihood function across all examples in the dataset to maximize the likelihood:

It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood:

Calculating the negative of the log-likelihood function for the Bernoulli distribution is equivalent to calculating the cross-entropy function for the Bernoulli distribution, where *p()* represents the probability of class 0 or class 1, and *q()* represents the estimation of the probability distribution, in this case by our logistic regression model.

Unlike linear regression, there is not an analytical solution to solving this optimization problem. As such, an iterative optimization algorithm must be used.

Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use an optimization algorithm to compute it. For this, we need to derive the gradient and Hessian.

— Page 246, Machine Learning: A Probabilistic Perspective, 2012.

The function does provide some information to aid in the optimization (specifically a Hessian matrix can be calculated), meaning that efficient search procedures that exploit this information can be used, such as the BFGS algorithm (and variants).

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

### Books

### Articles

## Summary

In this post, you discovered logistic regression with maximum likelihood estimation.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

Click here to read more## [P] 10K Downloads Special 🎉: gpt2-client accepting all feature requests!

Hey everyone 👋🏻👋🏻! First off, I want to thank all of you for your amazing support. gpt2-client just reached Being my first open-source project, it's touching to see the positive experiences you share with me via email/DM. I've noticed a trend where many of you are using it for your NLP research and some of you for your side-projects. No matter what you do, I'd love to know how I can improve it, either in terms of functionality, extendability, modularity, efficiency. You name it. We did it y'all! 10K in the bag 😀
As a way of giving back, I'd love to hear what you'd want to see in gpt2-client. It can be any bombastic feature request!!! We could discuss this on any platform (or you can open a feature request here https://github.com/rish-16/gpt2client/issues/new/choose). This could tug on any aspect of gpt2-client that you feel should belong inside the module. ------------------------------------ If you still aren't sure what gpt2-client is, I urge you to check out https://github.com/rish-16/gpt2client/ and if you like what you're seeing, do drop a ⭐. It means a lot to me and motivates me to continue building open-source technology. Express your creativity down below in the comments!!! Grateful for your continuing support 🤘🏻 Cheers! submitted by /u/rish-16 |

## A Gentle Introduction to Linear Regression With Maximum Likelihood Estimation

Linear regression is a classical model for predicting a numerical quantity.

The parameters of a linear regression model can be estimated using a least squares procedure or by a maximum likelihood estimation procedure. Maximum likelihood estimation is a probabilistic framework for automatically finding the probability distribution and parameters that best describe the observed data. Supervised learning can be framed as a conditional probability problem, and maximum likelihood estimation can be used to fit the parameters of a model that best summarizes the conditional probability distribution, so-called conditional maximum likelihood estimation.

A linear regression model can be fit under this framework and can be shown to derive an identical solution to a least squares approach.

In this post, you will discover linear regression with maximum likelihood estimation.

After reading this post, you will know:

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

## Overview

This tutorial is divided into four parts; they are:

## Linear Regression

Linear regression is a standard modeling method from statistics and machine learning.

Linear regression is the “work horse” of statistics and (supervised) machine learning.

— Page 217, Machine Learning: A Probabilistic Perspective, 2012.

Generally, it is a model that maps one or more numerical inputs to a numerical output. In terms of predictive modeling, it is suited to regression type problems: that is, the prediction of a real-valued quantity.

The input data is denoted as *X* with *n* examples and the output is denoted *y* with one output for each input. The prediction of the model for a given input is denoted as *yhat*.

The model is defined in terms of parameters called coefficients (beta), where there is one coefficient per input and an additional coefficient that provides the intercept or bias.

For example, a problem with inputs *X* with m variables *x1, x2, …, xm* will have coefficients *beta1, beta2, …, betam* and *beta0*. A given input is predicted as the weighted sum of the inputs for the example and the coefficients.

The model can also be described using linear algebra, with a vector for the coefficients (*Beta*) and a matrix for the input data (*X*) and a vector for the output (*y*).

The examples are drawn from a broader population and as such, the sample is known to be incomplete. Additionally, there is expected to be measurement error or statistical noise in the observations.

The parameters of the model (*beta*) must be estimated from the sample of observations drawn from the domain.

There are many ways to estimate the parameters given the study of the model for more than 100 years; nevertheless, there are two frameworks that are the most common. They are:

Both are optimization procedures that involve searching for different model parameters.

Least squares optimization is an approach to estimating the parameters of a model by seeking a set of parameters that results in the smallest squared error between the predictions of the model (*yhat*) and the actual outputs (*y*), averaged over all examples in the dataset, so-called mean squared error.

Maximum Likelihood Estimation is a frequentist probabilistic framework that seeks a set of parameters for the model that maximize a likelihood function. We will take a closer look at this second approach.

Under both frameworks, different optimization algorithms may be used, such as local search methods like the BFGS algorithm (or variants), and general optimization methods like stochastic gradient descent. The linear regression model is special in that an analytical solution also exists, meaning that the coefficients can be calculated directly using linear algebra, a topic that is out of the scope of this tutorial.

For more information, see:

## Maximum Likelihood Estimation

Maximum Likelihood Estimation, or MLE for short, is a probabilistic framework for estimating the parameters of a model.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to *n*.

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the natural log conditional probability.

Given the common use of log in the likelihood function, it is referred to as a log-likelihood function. It is also common in optimization problems to prefer to minimize the cost function rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the linear regression model.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1568216021.js

## Linear Regression as Maximum Likelihood

We can frame the problem of fitting a machine learning model as the problem of probability density estimation.

Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*. We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

Supervised learning can be framed as a conditional probability problem of predicting the probability of the output given the input:

As such, we can define conditional maximum likelihood estimation for supervised machine learning as follows:

Now we can replace *h* with our linear regression model.

We can make some reasonable assumptions, such as the observations in the dataset are independent and drawn from the same probability distribution (i.i.d.), and that the target variable (*y*) has statistical noise with a Gaussian distribution, zero mean, and the same variance for all examples.

With these assumptions, we can frame the problem of estimating *y* given *X* as estimating the mean value for *y* from a Gaussian probability distribution given *X*.

The analytical form of the Gaussian function is as follows:

Where *mu* is the mean of the distribution and *sigma^2* is the variance where the units are squared.

We can use this function as our likelihood function, where mu is defined as the prediction from the model with a given set of coefficients (*Beta*) and *sigma* is a fixed constant.

First, we can state the problem as the maximization of the product of the probabilities for each example in the dataset:

Where *xi* is a given example and *Beta* refers to the coefficients of the linear regression model. We can transform this to a log-likelihood model as follows:

The calculation can be simplified further, but we will stop there for now.

It’s interesting that the prediction is the mean of a distribution. It suggests that we can very reasonably add a bound to the prediction to give a prediction interval based on the standard deviation of the distribution, which is indeed a common practice.

Although the model assumes a Gaussian distribution in the prediction (i.e. Gaussian noise function or error function), there is no such expectation for the inputs to the model (*X*).

[the model] considers noise only in the target value of the training example and does not consider noise in the attributes describing the instances themselves.

— Page 167, Machine Learning, 1997.

We can apply a search procedure to maximize this log likelihood function, or invert it by adding a negative sign to the beginning and minimize the negative log-likelihood function (more common).

This provides a solution to the linear regression model for a given dataset.

This framework is also more general and can be used for curve fitting and provides the basis for fitting other regression models, such as artificial neural networks.

## Least Squares and Maximum Likelihood

Interestingly, the maximum likelihood solution to linear regression presented in the previous section can be shown to be identical to the least squares solution.

After derivation, the least squares equation to be minimized to fit a linear regression to a dataset looks as follows:

Where we are summing the squared errors between each target variable (*yi*) and the prediction from the model for the associated input *h(xi, Beta)*. This is often referred to as ordinary least squares. More generally, if the value is normalized by the number of examples in the dataset (averaged) rather than summed, then the quantity is referred to as the mean squared error.

Starting with the likelihood function defined in the previous section, we can show how we can remove constant elements to give the same equation as the least squares approach to solving linear regression.

Note: this derivation is based on the example given in Chapter 6 of Machine Learning by Tom Mitchell.

Key to removing constants is to focus on what does not change when different models are evaluated, e.g. when *h(xi, Beta)* is evaluated.

The first term of the calculation is independent of the model and can be removed to give:

We can then remove the negative sign to minimize the positive quantity rather than maximize the negative quantity:

Finally, we can discard the remaining first term that is also independent of the model to give:

We can see that this is identical to the least squares solution.

In fact, under reasonable assumptions, an algorithm that minimizes the squared error between the target variable and the model output also performs maximum likelihood estimation.

… under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis pre- dictions and the training data will output a maximum likelihood hypothesis.

— Page 164, Machine Learning, 1997.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Tutorials

### Books

### Articles

## Summary

In this post, you discovered linear regression with maximum likelihood estimation.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Linear Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

Click here to read more## A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning

Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain.

There are many techniques for solving density estimation, although a common framework used throughout the field of machine learning is maximum likelihood estimation. Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given a probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

This flexible probabilistic framework also provides the foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, but also more generally for deep learning artificial neural networks.

In this post, you will discover a gentle introduction to maximum likelihood estimation.

After reading this post, you will know:

Let’s get started.

## Overview

This tutorial is divided into three parts; they are:

## Problem of Probability Density Estimation

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

For example, given a sample of observation (*X*) from a domain (*x1, x2, x3, …, xn*), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data (*X*).

This problem is made more challenging as sample (*X*) drawn from the population is small and has noise, meaning that any evaluation of an estimated probability density function and its parameters will have some error.

There are many techniques for solving this problem, although two common approaches are:

The main difference is that MLE assumes that all solutions are equally likely beforehand, whereas MAP allows prior information about the form of the solution to be harnessed.

In this post, we will take a closer look at the MLE method and its relationship to applied machine learning.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1568216021.js

## Maximum Likelihood Estimation

One solution to probability density estimation is referred to as Maximum Likelihood Estimation, or MLE for short.

Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (*X*).

First, it involves defining a parameter called *theta* that defines both the choice of the probability density function and the parameters of that distribution. It may be a vector of numerical values whose values change smoothly and map to different probability distributions and their parameters.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

This conditional probability is often stated using the semicolon (;) notation instead of the bar notation (|) because *theta* is not a random variable, but instead an unknown parameter. For example:

or

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

The objective of Maximum Likelihood Estimation is to find the set of parameters (*theta*) that maximize the likelihood function, e.g. result in the largest likelihood value.

We can unpack the conditional probability calculated by the likelihood function.

Given that the sample is comprised of n examples, we can frame this as the joint probability of the observed data samples *x1, x2, x3, …, xn* in *X* given the probability distribution parameters (*theta*).

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters.

Multiplying many small probabilities together can be numerically unstable in practice, therefore, it is common to restate this problem as the sum of the log conditional probabilities of observing each example given the model parameters.

Where log with base-e called the natural logarithm is commonly used.

This product over many probabilities can be inconvenient […] it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum

— Page 132, Deep Learning, 2016.

Given the frequent use of log in the likelihood function, it is commonly referred to as a log-likelihood function.

It is common in optimization problems to prefer to minimize the cost function, rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL) …

— Page 133, Deep Learning, 2016.

## Relationship to Machine Learning

This problem of density estimation is directly related to applied machine learning.

We can frame the problem of fitting a machine learning model as the problem of probability density estimation. Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*.

We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

Or, more fully:

This provides the basis for estimating the probability density of a dataset, typically used in unsupervised machine learning algorithms; for example:

Using the expected log joint probability as a key quantity for learning in a probability model with hidden variables is better known in the context of the celebrated “expectation maximization” or EM algorithm.

— Page 365, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The Maximum Likelihood Estimation framework is also a useful tool for supervised machine learning.

This applies to data where we have input and output variables, where the output variate may be a numerical value or a class label in the case of regression and classification predictive modeling retrospectively.

We can state this as the conditional probability of the output *X* given the input (*y*) given the modeling hypothesis (*h*).

Or, more fully:

The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P(y | x ; theta) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning.

— Page 133, Deep Learning, 2016.

This means that the same Maximum Likelihood Estimation framework that is generally used for density estimation can be used to find a supervised learning model and parameters.

This provides the basis for foundational linear modeling techniques, such as:

In the case of linear regression, the model is constrained to a line and involves finding a set of coefficients for the line that best fits the observed data. Fortunately, this problem can be solved analytically (e.g. directly using linear algebra).

In the case of logistic regression, the model defines a line and involves finding a set of coefficients for the line that best separates the classes. This cannot be solved analytically and is often solved by searching the space of possible coefficient values using an efficient optimization algorithm such as the BFGS algorithm or variants.

Both methods can also be solved less efficiently using a more general optimization algorithm such as stochastic gradient descent.

In fact, most machine learning models can be framed under the maximum likelihood estimation framework, providing a useful and consistent way to approach predictive modeling as an optimization problem.

An important benefit of the maximize likelihood estimator in machine learning is that as the size of the dataset increases, the quality of the estimator continues to improve.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

### Articles

## Summary

In this post, you discovered a gentle introduction to maximum likelihood estimation.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning appeared first on Machine Learning Mastery.

Click here to read more