The Backpropagation Algorithm(closed form)
Intro
이번 포스트에는 딥러닝의 학습과정에 사용되는 Backpropagation Algorithm에 대해 적어보려 합니다. 예전에 처음으로 AI분야를 공부 시작했을때 아무리 B.P. 에 관한 설명을 찾아보아도 직관적으로 확 와닿는 closed form을 찾지 못해 답답하던 중, 직접 식을 서술하고 증명했었습니다. 이 내용을 기록차 남기려 합니다. 아무래도 algorithm 의 관점에서 조금 더 우아하게 서술할 수 있을것이라 생각이 들고 또 그런 관점에서 여러 library들이 작성되어 있을것이라 생각이 됩니다만, 아직 그런 우아한 설명은 찾아내지 못하였고.., 찾아 낸다면 새로운 포스트에 또 정리해볼 예정입니다.
가장 간단한 형태의 nueral network, fully connected neural network를 구성하였고, activation은 sigmoid로 통일하였습니다. classification의 문제를 해결하는것으로 가정하고 entropy loss를 최소화 하는 세팅으로 구성하였습니다.
기본 notation 및 기본 식
충분히 기호가 많기 때문에 식을 전개할때, 특정 example을 지칭하는 $(i)$ 는 생략하도록 하였습니다.
- $n$ : number of features.
- $m$ : number of examples.
- $L$ : total number of layers in the network.
- $S_l$ : number of units in layer $l$ (not containing bias unit).
- $\theta^{(l)}_{ij}$ : parameter from $j$-th unit of $l$-th layer to $i$-th unit of $(l+1)$-th layer.
- $z^{(l)(i)}_x$ $(= z^{(l)}_x)$ : $x$-th unit of $l$-th layer for the $i$-th example.
- $a^{(l)(i)}_x$ $(= a^{(l)}_x)$ : activation of $z^{(l)(i)}_x$.
- $\delta^{(l)(i)}_x$ $(= \delta^{(l)}_x)$.
- $h_{\theta}(x)$ : the Hypothesis function of data x with the parameter $\theta$.
- $J(\theta)$ : the Cost function.
Architecture
Algorithm
\[\frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) = \frac{1}{m} \sum^m_{i=1}\Big[a^{(l)(i)}_j \delta^{(l+1)(i)}_k \Big] \text{,}\]where
\[\begin{aligned} \delta^{(L)}_k &= a^{(L)}_k - y_k, \quad k=1 \cdots S_L ,\\ \delta^{(l)}_k &= g'(z^{(l)}_k) \sum^{S_{l+1}}_{j=1} \Big[ \delta^{(l+1)}_j \theta^{(l)}_{jk} \Big], \quad k = 1 \cdots S_l, l = 1 \cdots L-1 \end{aligned}\]Backpropagation Algorithm을 증명하기위해 두개의 Lemma를 활용하였습니다.
- Lemma 1
- Lemma 2
Proof
Proof of Theorem(Backpropagation Algorithm)
$l = L-1$ 일때,
\[\begin{aligned} \frac{\partial}{\partial \theta^{(L-1)}_{kj}} J(\theta) &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(L-1)}_{kj}} \delta_t^{(L)} \Big], \text{ by Lemma 1} \\ &= \frac{1}{m} \sum^m_{i=1} \Big[a_j^{(L-1)} \delta_k^{(L)} \Big] \end{aligned}\]$l < L-1$ 일때,
\[\begin{aligned} \frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(l)}_{kj}} \delta_t^{(L)} \Big], \text{ by Lemma 1} \\ &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \frac{\partial z_t^{(L)}}{\partial a_k^{(l+1)}} \delta_t^{(L)} \Big], \text{ by Chain rule} \\ &= \frac{1}{m} \sum^m_{i=1}\frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \Big[\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial a_k^{(l+1)}} \delta_t^{(L)} \Big]\Big] \\ &= \frac{1}{m} \sum^m_{i=1}\frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \Big[\theta_{1k}^{(l+1)}\delta_1^{(l+2)}+\theta_{2k}^{(l+1)}\delta_2^{(l+2)} + \cdots + \theta_{S_{l+2}k}^{(l+1)}\delta_{S_{l+2}}^{(l+2)} \Big], \text{ by Lemma 2} \\ &= \frac{1}{m} \sum^m_{i=1} a_j^{(l)}g'(z^{(l+1)}_k) \Big[\theta_{1k}^{(l+1)}\delta_1^{(l+2)}+\theta_{2k}^{(l+1)}\delta_2^{(l+2)} + \cdots + \theta_{S_{l+2}k}^{(l+1)}\delta_{S_{l+2}}^{(l+2)} \Big] \\ &= \frac{1}{m} \sum^m_{i=1} \Big[a_j^{(l)} \delta_k^{(l+1)} \Big] , \text{ by Definition of delta} \end{aligned}\]Proof of Lemma 1
$J(θ)$의 정의로부터 계산을 해 나가면 아래와 같이 된다.
\[\begin{aligned} J(\theta) &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log(h_{\theta}(x)_t) + (1-y_t)\log(1-h_{\theta}(x)_t) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log( \frac{1}{1+e^{-z_t^{(L)}}} ) + (1-y_t)\log(1-\frac{1}{1+e^{-z_t^{(L)}}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log( \frac{1}{1+e^{-z_t^{(L)}}} ) + (1-y_t)\log(\frac{e^{-z_t^{(L)}}}{1+e^{-z_t^{(L)}}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ -y_t \log( 1+e^{-z_t^{(L)}} ) + (1-y_t) (-z_t^{(L)} - \log(1+e^{-z_t^{(L)}})) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - z_t^{(L)} - \log(1+e^{-z_t^{(L)}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - \log(1+e^{z_t^{(L)}}) \Big] \\ \end{aligned}\]따라서,
\[\begin{aligned} \frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) &= - \frac{\partial}{\partial \theta^{(l)}_{kj}}\frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - \log(1+e^{z_t^{(L)}}) \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}}\log(1+e^{z_t^{(L)}}) \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] \frac{e^{z_t^{(L)}}}{1+e^{z_t^{(L)}}} \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] \frac{1}{1+e^{-z_t^{(L)}}} \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] a^{(L)}_t \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}](y_t - a_t^{(L)})\Big] \\ &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(l)}_{kj}} \delta_t^{(L)} \Big] \end{aligned}\]Proof of Lemma 2
$l$에 대한 귀납법을 이용하여 위 Lemma를 증명하도록 하자.
우리는 간단한 계산을 통해 $l = L-1$ 일때 위 식이 성립함을 확인 할 수 있다.
따라서,
\[\sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(L-1)}_{x}} \delta^{(L)}_t = \theta_{1x}^{(L-1)}\delta_1^{(L)}+\theta_{2x}^{(L-1)}\delta_2^{(L)} + \cdots + \theta_{S_{L}x}^{(L-1)}\delta_{S_{L}}^{(L)}.\]$l=p$일 때 성립함을 가정하면,
\[\sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{x}} \delta^{(L)}_t = \theta_{1x}^{(p)}\delta_1^{(p+1)}+\theta_{2x}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}x}^{(p)}\delta_{S_{p+1}}^{(p+1)}.\]다음과 같이 $l=p-1$ 일 때 성립함을 보일 수 있다.
\[\begin{aligned} \sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(p-1)}_{x}} \delta^{(L)}_t &= \sum_{t=1}^{S_L}\Big[ \frac{\partial a_1^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{1}}\delta^{(L)}_t \Big] + \sum_{t=1}^{S_L}\Big[ \frac{\partial a_2^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{2}}\delta^{(L)}_t \Big] + \cdots + \sum_{t=1}^{S_L}\Big[ \frac{\partial a_{S_p}^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{S_p}}\delta^{(L)}_t \Big] \\ &= \frac{\partial a_1^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{11}^{(p)}\delta_1^{(p+1)}+\theta_{21}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}1}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000}\frac{\partial a_2^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{12}^{(p)}\delta_1^{(p+1)}+\theta_{22}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}2}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \cdots + \frac{\partial a_{S_p}^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{1S_p}^{(p)}\delta_1^{(p+1)}+\theta_{2S_p}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}S_p}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] \\ &= \theta^{(p-1)}_{1x} g'(z_1^{(p)}) \Big[ \theta_{11}^{(p)}\delta_1^{(p+1)}+\theta_{21}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}1}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \theta^{(p-1)}_{2x} g'(z_2^{(p)}) \Big[ \theta_{12}^{(p)}\delta_1^{(p+1)}+\theta_{22}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}2}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \cdots + \theta^{(p-1)}_{S_px} g'(z_{S_p}^{(p)}) \Big[ \theta_{1S_p}^{(p)}\delta_1^{(p+1)}+\theta_{2S_p}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}S_p}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] \\ &= \theta_{1x}^{(p-1)}\delta_1^{(p)}+\theta_{2x}^{(p-1)}\delta_2^{(p)} + \cdots + \theta_{S_{p}x}^{(p-1)}\delta_{S_{p}}^{(p)}. \end{aligned}\]