The Backpropagation Algorithm(closed form)

Published: by Creative Commons Licence

Table of Contents




    Intro

    이번 포스트에는 딥러닝의 학습과정에 사용되는 Backpropagation Algorithm에 대해 적어보려 합니다. 예전에 처음으로 AI분야를 공부 시작했을때 아무리 B.P. 에 관한 설명을 찾아보아도 직관적으로 확 와닿는 closed form을 찾지 못해 답답하던 중, 직접 식을 서술하고 증명했었습니다. 이 내용을 기록차 남기려 합니다. 아무래도 algorithm 의 관점에서 조금 더 우아하게 서술할 수 있을것이라 생각이 들고 또 그런 관점에서 여러 library들이 작성되어 있을것이라 생각이 됩니다만, 아직 그런 우아한 설명은 찾아내지 못하였고.., 찾아 낸다면 새로운 포스트에 또 정리해볼 예정입니다.

    가장 간단한 형태의 nueral network, fully connected neural network를 구성하였고, activation은 sigmoid로 통일하였습니다. classification의 문제를 해결하는것으로 가정하고 entropy loss를 최소화 하는 세팅으로 구성하였습니다.

    기본 notation 및 기본 식

    충분히 기호가 많기 때문에 식을 전개할때, 특정 example을 지칭하는 $(i)$ 는 생략하도록 하였습니다.

    • $n$ : number of features.
    • $m$ : number of examples.
    • $L$ : total number of layers in the network.
    • $S_l$ : number of units in layer $l$ (not containing bias unit).
    • $\theta^{(l)}_{ij}$ : parameter from $j$-th unit of $l$-th layer to $i$-th unit of $(l+1)$-th layer.
    • $z^{(l)(i)}_x$ $(= z^{(l)}_x)$ : $x$-th unit of $l$-th layer for the $i$-th example.
    • $a^{(l)(i)}_x$ $(= a^{(l)}_x)$ : activation of $z^{(l)(i)}_x$.
    • $\delta^{(l)(i)}_x$ $(= \delta^{(l)}_x)$.
    • $h_{\theta}(x)$ : the Hypothesis function of data x with the parameter $\theta$.
    • $J(\theta)$ : the Cost function.
    \[\begin{aligned} z^{(l)}_x &= \theta^{(l-1)}_{x0}a^{(l-1)}_0 + \theta^{(l-1)}_{x1}a^{(l-1)}_1 + \cdots + \theta^{(l-1)}_{xS_{L-1}}a^{(l-1)}_{S_{l-1}}.\\ a^{(l)}_x &= g(z_x^{(l)}). \\ g(z) &= \frac{1}{1+e^{-z}}. \\ h_{\theta}(x) &= \Big(g(z_1^{(L)}),g(z_2^{(L)}),\cdots,g(z_{S_L}^{(L)})\Big). \\ (h_{\theta}(x))_t &= g(z_t^{(L)}). \\ J(\theta) &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log(h_{\theta}(x)_t) + (1-y_t)\log(1-h_{\theta}(x)_t) \Big]. \\ y^{(i)}&= y \\ &= S_L \times 1 \text{ vector of i-th example that represent class which belongs to}. \\ y_t &= \begin{cases} 1 \text{ if i-th example is belongs to t-th class}\\ 0 \text{ otherwise.} \end{cases} \end{aligned}\]

    Architecture

    img

    Algorithm

    The Backpropagation Algorithm : objective function인 J의 미분을 각 neuron과 delta와의 간단 연산으로 구해낼 수 있다. 
    \[\frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) = \frac{1}{m} \sum^m_{i=1}\Big[a^{(l)(i)}_j \delta^{(l+1)(i)}_k \Big] \text{,}\]

    where

    \[\begin{aligned} \delta^{(L)}_k &= a^{(L)}_k - y_k, \quad k=1 \cdots S_L ,\\ \delta^{(l)}_k &= g'(z^{(l)}_k) \sum^{S_{l+1}}_{j=1} \Big[ \delta^{(l+1)}_j \theta^{(l)}_{jk} \Big], \quad k = 1 \cdots S_l, l = 1 \cdots L-1 \end{aligned}\]

    Backpropagation Algorithm을 증명하기위해 두개의 Lemma를 활용하였습니다.

    • Lemma 1
    \[\frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) = \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(l)}_{kj}} \delta_t^{(L)} \Big] ,\quad l = 1,\cdots,L-1\]
    • Lemma 2
    \[\sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(l)}_{x}} \delta^{(L)}_t = \theta_{1x}^{(l)}\delta_1^{(l+1)}+\theta_{2x}^{(l)}\delta_2^{(l+1)} + \cdots + \theta_{S_{l+1}x}^{(l)}\delta_{S_{l+1}}^{(l+1)},\quad l = 1,\cdots,L-1\]

    Proof

    Proof of Theorem(Backpropagation Algorithm)

    $l = L-1$ 일때,

    \[\begin{aligned} \frac{\partial}{\partial \theta^{(L-1)}_{kj}} J(\theta) &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(L-1)}_{kj}} \delta_t^{(L)} \Big], \text{ by Lemma 1} \\ &= \frac{1}{m} \sum^m_{i=1} \Big[a_j^{(L-1)} \delta_k^{(L)} \Big] \end{aligned}\]

    $l < L-1$ 일때,

    \[\begin{aligned} \frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(l)}_{kj}} \delta_t^{(L)} \Big], \text{ by Lemma 1} \\ &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \frac{\partial z_t^{(L)}}{\partial a_k^{(l+1)}} \delta_t^{(L)} \Big], \text{ by Chain rule} \\ &= \frac{1}{m} \sum^m_{i=1}\frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \Big[\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial a_k^{(l+1)}} \delta_t^{(L)} \Big]\Big] \\ &= \frac{1}{m} \sum^m_{i=1}\frac{\partial a_k^{(l+1)}}{\partial \theta^{(l)}_{kj}} \Big[\theta_{1k}^{(l+1)}\delta_1^{(l+2)}+\theta_{2k}^{(l+1)}\delta_2^{(l+2)} + \cdots + \theta_{S_{l+2}k}^{(l+1)}\delta_{S_{l+2}}^{(l+2)} \Big], \text{ by Lemma 2} \\ &= \frac{1}{m} \sum^m_{i=1} a_j^{(l)}g'(z^{(l+1)}_k) \Big[\theta_{1k}^{(l+1)}\delta_1^{(l+2)}+\theta_{2k}^{(l+1)}\delta_2^{(l+2)} + \cdots + \theta_{S_{l+2}k}^{(l+1)}\delta_{S_{l+2}}^{(l+2)} \Big] \\ &= \frac{1}{m} \sum^m_{i=1} \Big[a_j^{(l)} \delta_k^{(l+1)} \Big] , \text{ by Definition of delta} \end{aligned}\]

    Proof of Lemma 1

    $J(θ)$의 정의로부터 계산을 해 나가면 아래와 같이 된다.

    \[\begin{aligned} J(\theta) &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log(h_{\theta}(x)_t) + (1-y_t)\log(1-h_{\theta}(x)_t) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log( \frac{1}{1+e^{-z_t^{(L)}}} ) + (1-y_t)\log(1-\frac{1}{1+e^{-z_t^{(L)}}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \log( \frac{1}{1+e^{-z_t^{(L)}}} ) + (1-y_t)\log(\frac{e^{-z_t^{(L)}}}{1+e^{-z_t^{(L)}}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ -y_t \log( 1+e^{-z_t^{(L)}} ) + (1-y_t) (-z_t^{(L)} - \log(1+e^{-z_t^{(L)}})) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - z_t^{(L)} - \log(1+e^{-z_t^{(L)}}) \Big] \\ &=- \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - \log(1+e^{z_t^{(L)}}) \Big] \\ \end{aligned}\]

    따라서,

    \[\begin{aligned} \frac{\partial}{\partial \theta^{(l)}_{kj}} J(\theta) &= - \frac{\partial}{\partial \theta^{(l)}_{kj}}\frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t z_t^{(L)} - \log(1+e^{z_t^{(L)}}) \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}}\log(1+e^{z_t^{(L)}}) \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] \frac{e^{z_t^{(L)}}}{1+e^{z_t^{(L)}}} \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] \frac{1}{1+e^{-z_t^{(L)}}} \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ y_t \frac{\partial}{\partial \theta^{(l)}_{kj}}z_t^{(L)} - \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}] a^{(L)}_t \Big] \\ &= - \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial}{\partial \theta^{(l)}_{kj}} [z_t^{(L)}](y_t - a_t^{(L)})\Big] \\ &= \frac{1}{m} \sum^m_{i=1}\sum^{S_L}_{t=1}\Big[ \frac{\partial z_t^{(L)}}{\partial \theta^{(l)}_{kj}} \delta_t^{(L)} \Big] \end{aligned}\]

    Proof of Lemma 2

    $l$에 대한 귀납법을 이용하여 위 Lemma를 증명하도록 하자.
    우리는 간단한 계산을 통해 $l = L-1$ 일때 위 식이 성립함을 확인 할 수 있다.

    \[\begin{aligned} \frac{\partial z_t^{(L)}}{\partial a^{(L-1)}_{x}} \delta^{(L)}_t &= \frac{\partial}{\partial a^{(L-1)}_{x}} \Big(z_t^{(L)}\Big) \delta^{(L)}_t\\ &=\frac{\partial}{\partial a^{(L-1)}_{x}} \Big( a_0^{(L-1)}\theta_{t0}^{(L-1)} + a_1^{(L-1)}\theta_{t1}^{(L-1)} + \cdots + a_{S_{L-1}}^{(L-1)}\theta_{tS_{L-1}}^{(L-1)} \Big) \delta^{(L)}_t\\ &=\theta_{tx}^{(L-1)} \delta^{(L)}_t.\\ \end{aligned}\]

    따라서,

    \[\sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(L-1)}_{x}} \delta^{(L)}_t = \theta_{1x}^{(L-1)}\delta_1^{(L)}+\theta_{2x}^{(L-1)}\delta_2^{(L)} + \cdots + \theta_{S_{L}x}^{(L-1)}\delta_{S_{L}}^{(L)}.\]

    $l=p$일 때 성립함을 가정하면,

    \[\sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{x}} \delta^{(L)}_t = \theta_{1x}^{(p)}\delta_1^{(p+1)}+\theta_{2x}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}x}^{(p)}\delta_{S_{p+1}}^{(p+1)}.\]

    다음과 같이 $l=p-1$ 일 때 성립함을 보일 수 있다.

    \[\begin{aligned} \sum_{t=1}^{S_L} \frac{\partial z_t^{(L)}}{\partial a^{(p-1)}_{x}} \delta^{(L)}_t &= \sum_{t=1}^{S_L}\Big[ \frac{\partial a_1^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{1}}\delta^{(L)}_t \Big] + \sum_{t=1}^{S_L}\Big[ \frac{\partial a_2^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{2}}\delta^{(L)}_t \Big] + \cdots + \sum_{t=1}^{S_L}\Big[ \frac{\partial a_{S_p}^{(p)}}{\partial a^{(p-1)}_{x}} \frac{\partial z_t^{(L)}}{\partial a^{(p)}_{S_p}}\delta^{(L)}_t \Big] \\ &= \frac{\partial a_1^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{11}^{(p)}\delta_1^{(p+1)}+\theta_{21}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}1}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000}\frac{\partial a_2^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{12}^{(p)}\delta_1^{(p+1)}+\theta_{22}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}2}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \cdots + \frac{\partial a_{S_p}^{(p)}}{\partial a^{(p-1)}_{x}} \Big[ \theta_{1S_p}^{(p)}\delta_1^{(p+1)}+\theta_{2S_p}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}S_p}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] \\ &= \theta^{(p-1)}_{1x} g'(z_1^{(p)}) \Big[ \theta_{11}^{(p)}\delta_1^{(p+1)}+\theta_{21}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}1}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \theta^{(p-1)}_{2x} g'(z_2^{(p)}) \Big[ \theta_{12}^{(p)}\delta_1^{(p+1)}+\theta_{22}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}2}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] + \\ &\phantom{000} \cdots + \theta^{(p-1)}_{S_px} g'(z_{S_p}^{(p)}) \Big[ \theta_{1S_p}^{(p)}\delta_1^{(p+1)}+\theta_{2S_p}^{(p)}\delta_2^{(p+1)} + \cdots + \theta_{S_{p+1}S_p}^{(p)}\delta_{S_{p+1}}^{(p+1)} \Big] \\ &= \theta_{1x}^{(p-1)}\delta_1^{(p)}+\theta_{2x}^{(p-1)}\delta_2^{(p)} + \cdots + \theta_{S_{p}x}^{(p-1)}\delta_{S_{p}}^{(p)}. \end{aligned}\]