[ML] Bayesian Methods Overview
Chain Rule for Probability
\[P(X,Y) = P(X|Y)P(Y)\] \[P(X,Y,Z) = P(X|Y,Z)P(Y|Z)P(Z)\] \[P(X_1, ... ,X_N) = \prod_{i=i}^{N} P(X_i|X_1, ... , X_{i-1})\]Sum Rule
\[p(X) = \int_{\infty}^{-\infty}p(X,Y)dY\]Bayes Theorem
Given \( \theta \) denotes parameters and \( X \) denotes observations:
\[P(\theta|X) = \frac{P(X,\theta)}{P(X)} = \frac{P(X|\theta)P(\theta)}{P(X)}\]And we have the following terms:
\( P(\theta) \) - prior - shows what prior knowledge we know about the parameters.
\( P(X \vert \theta) \) - likelihood - shows how well the parameters explain our data.
\( P(\theta \vert X) \) - posterior - the probability of the parameters after we observe the data.
\( P(X) \) - evidence - the data.
\( P(X) \) can be calculated by \( P(X) = \sum P(X \vert \theta_i) \times P(\theta_i) \)
A Bayesian approach always requires a prior probability, which is used without being tested. But instead of confirming or falsifying a hypothesis, a Bayesian will adjust the prior probability based on new evidence.
This theorem is very important because that it allows prior related knowledge of an event update the probability of the specific event. It is a game of degree of belief
:
- \( P(A) \), the prior, is the initial degree of belief in A.
- \( P (A \vert B) \), the posterior is the degree of belief having accounted for B.
- The quotient \( \frac{P(B \vert A)}{P(B)} \) represents the support B provides for A.
Bayesian Networks
A model is not set in stone, it is a representation of how we believe in how this world works.
A Bayesian network is:
A directed acyclic graph (DAG) G whose nodes represent the random variables \( X_1, … , X_n \)
For each node \( X_i \), there’s a CPD (Conditional Probability Distribution) \( P(X_i \vert Par_G(X_i)) \)
Chain Rule for BNs
The BN
represents a joint distribution via the chain rule for Bayesian networks:
\[ P(X_1, …, X_n) = \prod_i^n P(X_i \vert Par_G(X_i)) \]
which we can also say: “P factorizes over G”.
Flow of Probabilistic Influence
A
trail
\( X_1 - … - X_k \) isactive
if: it has noV-structure
\( X_{i-1} \to X_i \gets X_{i+1} \).\( X_1 - … - X_k \) is
active
given \( Z \)if:- For any
V-structure
\( X_{i-1} \to X_i \gets X_{i+1} \), we have that \( X_i \) or one of its descendants \( \in Z \). - No other \( X_i \) is in \( Z \).
- For any
If one random variable in the trail is observed, we say that “it blocks the trail”.
D-separation
Definition: \( X \) and \( Y \) are d-separated in \( G \) given \( Z \) if there is no active trail in \( G \) between \( X \) and \( Y \) given \( Z \), denoted as follows:
\[ \text{d-sep} _ G(X, Y \vert Z) \]
Any node is d-separated from its non-descendants given its parents.
I-maps
\[ I(G) = { (X \perp Y \vert Z) : \text{d-sep} _ G(X, Y \vert Z) } \]
If P satisfies I(G), we say that G is an I-map
(Independency map) of P.
2 Theorems
If P factorizes over G, and \( \text{d-sep} _ G(X, Y \vert Z) \), then \( P \models (X \perp Y \vert Z) \).
If G is an I-map for P, then P factorizes over G.
Factorization & I-Map Summary
There are 2 equivalent ways of viewing the graph:
- Factorization: G allows P to be represented.
- I-map: Independencies encoded by G hold in P.