[ML] Bayesian Methods Overview

Chain Rule for Probability

\[P(X,Y) = P(X|Y)P(Y)\] \[P(X,Y,Z) = P(X|Y,Z)P(Y|Z)P(Z)\] \[P(X_1, ... ,X_N) = \prod_{i=i}^{N} P(X_i|X_1, ... , X_{i-1})\]

Sum Rule

\[p(X) = \int_{\infty}^{-\infty}p(X,Y)dY\]

Bayes Theorem

Given \( \theta \) denotes parameters and \( X \) denotes observations:

\[P(\theta|X) = \frac{P(X,\theta)}{P(X)} = \frac{P(X|\theta)P(\theta)}{P(X)}\]

And we have the following terms:

\( P(\theta) \) - prior - shows what prior knowledge we know about the parameters.
\( P(X \vert \theta) \) - likelihood - shows how well the parameters explain our data.
\( P(\theta \vert X) \) - posterior - the probability of the parameters after we observe the data.
\( P(X) \) - evidence - the data.

\( P(X) \) can be calculated by \( P(X) = \sum P(X \vert \theta_i) \times P(\theta_i) \)

A Bayesian approach always requires a prior probability, which is used without being tested. But instead of confirming or falsifying a hypothesis, a Bayesian will adjust the prior probability based on new evidence.

This theorem is very important because that it allows prior related knowledge of an event update the probability of the specific event. It is a game of degree of belief:

\( P(A) \), the prior, is the initial degree of belief in A.
\( P (A \vert B) \), the posterior is the degree of belief having accounted for B.
The quotient \( \frac{P(B \vert A)}{P(B)} \) represents the support B provides for A.

Bayesian Networks

A model is not set in stone, it is a representation of how we believe in how this world works.

A Bayesian network is:

A directed acyclic graph (DAG) G whose nodes represent the random variables \( X_1, … , X_n \)
For each node \( X_i \), there’s a CPD (Conditional Probability Distribution) \( P(X_i \vert Par_G(X_i)) \)

Chain Rule for BNs

The BN represents a joint distribution via the chain rule for Bayesian networks:

\[ P(X_1, …, X_n) = \prod_i^n P(X_i \vert Par_G(X_i)) \]

which we can also say: “P factorizes over G”.

Flow of Probabilistic Influence

A trail \( X_1 - … - X_k \) is active if: it has no V-structure \( X_{i-1} \to X_i \gets X_{i+1} \).
\( X_1 - … - X_k \) is active given \( Z \)if:
- For any V-structure \( X_{i-1} \to X_i \gets X_{i+1} \), we have that \( X_i \) or one of its descendants \( \in Z \).
- No other \( X_i \) is in \( Z \).

If one random variable in the trail is observed, we say that “it blocks the trail”.

D-separation

Definition: \( X \) and \( Y \) are d-separated in \( G \) given \( Z \) if there is no active trail in \( G \) between \( X \) and \( Y \) given \( Z \), denoted as follows:

\[ \text{d-sep} _ G(X, Y \vert Z) \]

Any node is d-separated from its non-descendants given its parents.

I-maps

\[ I(G) = { (X \perp Y \vert Z) : \text{d-sep} _ G(X, Y \vert Z) } \]

If P satisfies I(G), we say that G is an I-map (Independency map) of P.

2 Theorems

If P factorizes over G, and \( \text{d-sep} _ G(X, Y \vert Z) \), then \( P \models (X \perp Y \vert Z) \).
If G is an I-map for P, then P factorizes over G.

Factorization & I-Map Summary

There are 2 equivalent ways of viewing the graph:

Factorization: G allows P to be represented.
I-map: Independencies encoded by G hold in P.

December 26, 2017 · 机器学习