Using Likelihood to Model Probability

center

Likelihood function

The likelihood function measures how likely it to observe an output for a give set of parameters to a model. Mathematically it is represented through the conditional probability.

p (t_{0} ∣ x_{0}, w, β)

where,

$p \to$ PDF
$t_{0} \to$ an output
$x_{0} \to$ an input
$w \to$ model parameter. Example: In a linear regression model relating $x$ and $t$

t \approx P (x) = w_{0} + w_{1} x

Here, $w = [w_{0}, w_{1}]$ are the parameters of the model. $w_{0}$ is the intercept, and $w_{1}$ is the slope.

$β \to$ precision, is the inverse of the variance ( $σ^{2}$ ).
Note: High precision means low variance and vice versa.

Assumption

The probability is assumed to be Gaussian distributed with mean on the model and variance $β^{- 1} = σ^{2}$ (for the 1-dimensional case)¹

p (t_{0} ∣ x_{0}, w, β) = N (t_{0} ∣ y (w, x_{0}), β^{- 1})

where,

$y \to$ is the regression model used.

Measurement to Stochastic Model

For a given parametric model $y (x, w)$ and target $t$ . The noise can be modeled by the Gaussian Distribution $N$ .

t = y (x, w) + N (0, β^{- 1})

The Mean is $0$ because we center the Gaussian around the predicted value.

Since we have a distribution (Probabilistic), the equality is removed.

t - y (x, w) \sim N (0, β^{- 1})

Finally we get

t \sim N (y (x, w), β^{- 1})

Applying this to the whole Data-set

We define the Likelihood function (Data-Likelihood) and express it as a joint density function where the stochasticity at each target value is expressed through the Gaussian.

L (w) = P (T ∣ X, w, β) = n = 1 \prod N N (t_{n} ∣ y (x_{n}, w), β^{- 1})

Note:

We assume the data points to be independent. Inductive bias!
The Joint Density of a Independent values is the product of the densities of individual data.

Parameter Optimisation

Involves maximising the Likelihood for the given parameters i.e. we maximise the probability of observing the measured output ( $t$ ) for a given model parameter ( $w$ ) and input ( $x$ )

ω_{M L} = ω^{*} = ar g ω max L (ω)

Note: the actual value of $L (w)$ at the maximum is not important!

Generalisation

Select parameters $w_{M L}$ & $β_{M L}$ which maximize the likelihood and represent the optimal output distribution as a Gaussian:

p (t ∣ x, w_{M L}, β_{M L}) = N (t ∣ y (x, w_{M L}), β_{M L}^{- 1})

center

Finding parameters from the probabilistic approach

Start with the stochastic data model:

t = y (x, w) + ν

where $ν \sim N (0, β^{- 1})$

Construct the likelihood function for a single data point:

p (t_{n} ∣ x_{n}, w, β) = N (t_{n} ∣ y (x_{n}, w), β^{- 1})

Form the data likelihood by assuming independence of data points:

p (t ∣ X, w, β) = i = 1 \prod N N (t_{i} ∣ y (x_{i}, w), β^{- 1})

Take the negative logarithm to get the error function:

E (w) = - ln p (t ∣ X, w, β) = \frac{β}{2} i = 1 \sum N (t_{i} - y (x_{i}, w))^{2} + \frac{N}{2} ln \frac{2 π}{β}

Minimize $E (w)$ by setting its derivative to zero: $\nabla E (w) = - β \sum_{i = 1}^{N} (t_{i} - y (x_{i}, w)) \nabla y (x_{i}, w) = 0$
For linear models $y (x, w) = w^{T} Φ (x)$ , this gives: $w_{M L} = (Φ^{T} Φ)^{- 1} Φ^{T} t$
The optimal precision parameter is: $β_{M L} = \frac{N}{\sum _{i = 1}^{N} ( t _{i} - y ( x _{i} , w _{M L} ) ) ^{2}}$

Pattern Recognition Bishop Pg. 29 ↩

Ashu's Online Notes

Explorer

Using Likelihood to Model Probability

Likelihood function

Assumption

Measurement to Stochastic Model

Applying this to the whole Data-set

Parameter Optimisation

Generalisation

Finding parameters from the probabilistic approach

Graph View

Table of Contents

Backlinks

Ashu's Online Notes

Explorer

Using Likelihood to Model Probability

Likelihood function

Assumption

Measurement to Stochastic Model

Applying this to the whole Data-set

Parameter Optimisation

Generalisation

Finding parameters from the probabilistic approach

Footnotes

Graph View

Table of Contents

Backlinks