Paper Review

Understanding the difficulty of training deep feedforward neural networks

2022. 12. 28. 16:00
๋ชฉ์ฐจ
  1. [๋…ผ๋ฌธ๋ฆฌ๋ทฐ]
  2. ABSTRACT
  3. Deep Neural Networks
  4. Experimental Setting and Datasets
  5. Effect of Activation Functions and Saturation During Training
  6. Studying Gradients and their Propagation

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ]


ABSTRACT

  • Random Initialization์„ ์‚ฌ์šฉํ•œ ์ผ๋ฐ˜์ ์ธ Gradient-descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด Deep neural network์—์„œ ์•ฝํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š”๊ฐ€
  • Random Initialization์„ ์ ์šฉํ•œ Logistic sigmoid ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ํ‰๊ท ๊ฐ’ ๋•Œ๋ฌธ์— Deep network์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.
    • ์ƒ์œ„ layer๋ฅผ ํฌํ™”(saturation)ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ƒ๋‹นํžˆ ๋น ๋ฅธ ์ˆ˜๋ ด์„ ๊ฐ€์ ธ์˜ค๋Š” ์ƒˆ๋กœ์šด Initialization Scheme๋ฅผ ๋„์ž…ํ•œ๋‹ค.

Deep Neural Networks

  • ๋”ฅ๋Ÿฌ๋‹์€ ์ถ”์ถœํ•œ ํŠน์ง•์„ ์ด์šฉํ•˜์—ฌ ํŠน์ง• ๊ณ„์ธต์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์ง„ํ–‰ํ•œ๋‹ค.
    • ์ถ”์ถœํ•œ ํŠน์ง• : ๋‚ฎ์€ ์ˆ˜์ค€์˜ Feature๋“ค์˜ ํ•ฉ์„ฑ์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ ๋†’์€ ์ˆ˜์ค€์˜ Layer๋กœ ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ๊ฒƒ
  • ๋ณต์žกํ•œ ๊ธฐ๋Šฅ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ณ ์ˆ˜์ค€์˜ ์ถ”์ƒํ™”๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•˜๋ฉฐ, ์ด๋ฅผ ์œ„ํ•œ ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด Deep architecture์˜ ํ•„์š”์„ฑ์ด๋‹ค.
  • ์ตœ๊ทผ ๋Œ€๋ถ€๋ถ„์˜ ๋”ฅ ์•„ํ‚คํ…์ณ๋“ค์€ ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต์˜ ํšจ๊ณผ๋กœ ์ธํ•ด Random initialization๊ณผ gradient๊ธฐ๋ฐ˜์˜ opimization๋ณด๋‹ค ํ›จ์”ฌ ๋” ์ž˜ ์ž‘๋™๋œ๋‹ค.
    • ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต์ด ์ตœ์ ํ™” ์ ˆ์ฐจ์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ดˆ๊ธฐํ™”๋ฅผ ๋” ์ž˜ํ•˜๋Š” ์ผ์ข…์˜ regularizer ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.
    • ์ด๊ฒƒ์€ local minimum์ด ๋” ๋‚˜์€ generalization๊ณผ ์—ฐ๊ด€์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.
  • ๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต์ด๋‚˜ ๋ฐ˜-ํ‘œ์ค€ ์ง€๋„ํ•™์Šต์„ ๋”ฅ ์•„ํ‚คํ…์ณ์— ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์— ์ดˆ์ ์„ ๋‘” ๋Œ€์‹  ๊ณ ์ „์ ์ด์ง€๋งŒ ๊นŠ๊ณ , ์ข‹์€ ๋‹ค์ค‘ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด ์ž˜๋ชป ํ•™์Šต๋  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์— ์ดˆ์ ์„ ๋‘”๋‹ค.
  • ๊ฐ layer๋“ค์˜ activation๋“ค๊ณผ gradient์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜์˜€๊ณ , ๋ถ„์„ํ•œ๋‹ค. ์ด ํ›„ activation function์˜ ์„ ํƒ๊ณผ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•œ๋‹ค.

Experimental Setting and Datasets

 

Online Lenaring on an Infinite Dataset

  • ์ตœ๊ทผ ๋”ฅ ์•„ํ‚คํ…์ณ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ๋Š” ํฐ ํ•™์Šต ์„ธํŠธ๋‚˜ ์˜จ๋ผ์ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ๋–„, ๋น„์ง€๋„ ํ•™์Šต์„ ํ†ตํ•ด ์ดˆ๊ธฐํ™”๋ฅผ ํ•˜๊ฒŒ ๋˜๋ฉด ํ•™์Šต์— ์žˆ์–ด ํ•™์Šต์˜ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‹ค๊ณ  ์‹ค์งˆ์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ผ์–ด๋‚œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค.
    • ์˜จ๋ผ์ธ ํ•™์Šต์€ ์ž‘์€ ํ‘œ๋ณธ์˜ ์ •๊ทœํ™”์— ๋Œ€ํ•œ ํšจ๊ณผ๋ณด๋‹ค ์ตœ์ ํ™”์˜ ๋ฌธ์ œ์— ๋Œ€ํ•ด์„œ ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค

Fig 1. ๋ฐ์ดํ„ฐ ์…‹

  • ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ œํ•œ ์กฐ๊ฑด์„ ๊ฐ€์ง„ ๋‘ ๊ฐœ์˜ ๊ฐ์ฒด๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ˜ํ”Œ๋ง ํ•˜์˜€๋‹ค. 
  • ์ด 9๊ฐœ์˜ ํด๋ž˜์Šค๋กœ ๊ตฌ์„ฑ

Finite Datasets

  • MNIST ๋ฐ์ดํ„ฐ 
  • CIFAR-10 ๋ฐ์ดํ„ฐ์…‹ ์ค‘ 10,000๊ฐœ๋ฅผ ์œ ํšจ์„ฑ ์ด๋ฏธ์ง€๋กœ ์ถ”์ถœ
  • Small-ImageNet

Experimental Setting

  • 1~5๊ฐœ์˜ hidden layer๋ฅผ ๊ฐ€์ง„ Feed Forward ANN์„ ์ตœ์ ํ™” ํ•˜์˜€๋‹ค.
    • Layer๋‹น 1000๊ฐœ์˜ ์ˆจ๊ฒจ์ง„ ์œ ๋‹›์„ ๊ฐ€์ง„๋‹ค
    • ์ถœ๋ ฅ Layer์€ Softmax logistic regression
    • Cost function์€ negative log-likelihood( $ -logP(y|x) $)๋ฅผ ์‚ฌ์šฉํ•จ.
    • 10๊ฐœ์˜ mini-batch ํฌ๊ธฐ๋ฅผ ๊ฐ–๋Š” stocastic back-propagation์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ํ™” ๋˜์—ˆ๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ Hidden layer์— ๋น„์„ ํ˜• ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜์˜€๋‹ค
    • Sigmoid, tanh, ์ƒˆ๋กœ ์ œ์•ˆ๋œ softsign($ x / (1 + |x|) $)
    • ๋ชจ๋ธ๋“ค์—์„œ ๋ณ„๋„๋กœ ์ตœ๊ณ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ฐพ์•˜๋‹ค. ๊ฐ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋ณ„๋กœ ๊ฒฐ๊ณผ๋Š” Sigmoid๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ํ•ญ์ƒ 5์˜€์œผ๋ฉฐ, Sigmoid๋Š” 4์˜€๋‹ค.
  • biases๋Š” 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๊ฐ layer์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑ ๋ฐฉ๋ฒ•์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜์˜€๋‹ค.

Effect of Activation Functions and Saturation During Training

Experiments with the Sigmoid

  • Sigmoid์˜ ๋น„์„ ํ˜•์„ฑ์€ ์ด๋ฏธ none-zero mean์œผ๋กœ ์ธํ•ด์„œ Hessian์—์„œ ํŠน์ด๊ฐ’์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๋•Œ๋ฌธ์— ํ•™์Šต์„ ์ €ํ•˜์‹œํ‚ค๋Š” ์š”์†Œ๋กœ ์•Œ๋ ค์ ธ์žˆ๋‹ค.

Fig 2.

  • ๊ด€์ฐฐ
    • Layer 1์€ ์ฒซ๋ฒˆ์งธ hidden layer์˜ ์ถœ๋ ฅ์„ ๋‚˜ํƒ€๋‚ด๊ณ , ๊ทธ ์™ธ์— 4๊ฐœ์˜ hidden layer๊ฐ€ ์žˆ๋‹ค.
    • ๊ทธ๋ž˜ํ”„๋Š” ํ‰๊ท ๊ณผ ๊ฐ๊ฐ์˜ activation์˜ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
    • Layer4(๋งˆ์ง€๋ง‰ hidden layer)์˜ ๋ชจ๋“  sigmoid์˜ activation ๊ฐ’์€ 0์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ด๋™ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋ฐ˜๋Œ€๋กœ ๋‹ค๋ฅธ layer์˜ ํ‰๊ท  activation๊ฐ’์€ 0. ์ด์ƒ์ด๊ณ , ์ถœ๋ ฅ layer์—์„œ ์ž…๋ ฅ layer๋กœ ๊ฐˆ์ˆ˜๋ก ์ด ๊ฐ’์€ ๊ฐ์†Œํ•œ๋‹ค.
    • sigmoid activation function์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, ์ด๋Ÿฌํ•œ ์ข…๋ฅ˜์˜ ํฌํ™”๊ฐ€ ๋ชจ๋ธ์ด ๋” ๊นŠ์„์ˆ˜๋ก ์˜ค๋ž˜ ์ง€์†๋œ๋‹ค. ํ•˜์ง€๋งŒ 4๊ฐœ์˜ ์ค‘๊ฐ„ ๊ณ„์ธต์˜ hidden layer๋Š” ์ด๋Ÿฌํ•œ ํฌํ™”์ง€์—ญ์„ ๋ฒ—์–ด ๋‚  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ตœ์ƒ์œ„ hidden layer๊ฐ€ ํฌํ™”๊ฐ’์„ ๋ฒ—์–ด๋‚˜๋Š” ๋™์•ˆ ์ž…๋ ฅ layer๋Š” ํฌํ™”๋˜๋ฉด์„œ ์•ˆ์ •์„ ์ฐพ๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. 
  • ๊ฒฐ๊ณผ
    • ์ด๋Ÿฌํ•œ ๋™์ž‘์€ ๋žœ๋ค ์ดˆ๊ธฐํ™”์™€ 0์„ ์ถœ๋ ฅํ•˜๋Š” ํžˆ๋“ ๋ ˆ์ด์–ด๊ฐ€ ํฌํ™” ์ƒํƒœ์ธ sigmoid ํ•จ์ˆ˜์™€ ์ผ์น˜ํ•œ๋‹ค.
    • softmax๋Š” ์ฒ˜์Œ์—๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์˜ํ–ฅ์„ ๋ฐ›์€ ์ตœ์ƒ์œ„ hidden layer์˜ ํ™œ์„ฑ๊ฐ’ h๋ณด๋‹ค biases b์— ๋” ์˜์กดํ•  ๊ฒƒ์ด๋‹ค. ์™œ๋ƒํ•˜๋ฉด h๋Š” y๋ฅผ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ณ€ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์•„๋งˆ h์™€๋Š” ๋‹ค๋ฅด๊ณ , ์กฐ๊ธˆ ๋” ์šฐ์„ธํ•œ ๋ณ€์ˆ˜์ธ x์™€ ์ง€๋ฐฐ์ ์œผ๋กœ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์˜ค์ฐจ ๊ธฐ์šธ๊ธฐ๋Š” W๋ฅผ 0์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒฝํ–ฅ์ด ์žˆใ…‡๋ฉฐ, ์ด๊ฒƒ์€ h๋ฅผ 0์œผ๋กœ ๋ณ€ํ™˜์‹œํ‚ค๋ฉด ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋”ฐ.
    • tanh๋‚˜ softsign๊ฐ™์ธ symmetric activation function ๊ฐ™์€ ๊ฒฝ์šฐ์—” ๋” ์ข‹์€๋ฐ ์™œ๋ƒ๋ฉด gradient๊ฐ€ ๋’ค๋กœ ํ๋ฅผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ sigmoid์˜ ์ถœ๋ ฅ์„ 0์œผ๋กœ ๋ฐ€๊ฒŒ ๋˜๋ฉด ๊ทธ๊ฒƒ์€ ํฌํ™”์ง€์—ญ์œผ๋กœ ์ด๋ˆ๋‹ค.

Experiments with the Hyperbolic tangent

Fig 3

  • hyperbolic tangent๋ฅผ activation function์œผ๋กœ ์‚ฌ์šฉํ•œ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ 0์„ ์ค‘์‹ฌ์œผ๋กœ ๋Œ€์นญ์ (symmentry)์ด๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ƒ์œ„ hidden layer์˜ ํฌํ™” ๋ฌธ์ œ๋ฅผ ๊ฒช์ง€ ์•Š๋Š”๋‹ค.
  • ํ•˜์ง€๋งŒ ํ‘œ์ค€ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”์ธ $ U[-1/\sqrt{n}, 1/\sqrt{n}] $๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด layer 1์—์„œ๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ ํฌํ™”ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค. (Fig 3)
  • ์œ„ ์ด๋ฏธ์ง€ : ํ•™์Šตํ•˜๋Š” ๋™์•ˆ, activation function์„ hyperbolic tangent๋ฅผ ์‚ฌ์šฉํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ activation ๊ฐ’์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ฐฑ๋ถ„์œ„ ์ ์ˆ˜(๋งˆ์ปค)์™€ ํ‘œ์ค€ํŽธ์ฐจ(์‹ค์„ ) -> ์ฒซ๋ฒˆ์งธ hidden layer๊ฐ€ ๋จผ์ € ํฌํ™”๋˜๊ณ  ๋‘๋ฒˆ์งธ๊ฐ€ ํฌํ™”๋˜๋Š” ํ˜•ํƒœ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ์•„๋ž˜ ์ด๋ฏธ์ง€ : softsign์„ ์‚ฌ์šฉํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ activation ๊ฐ’์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ฐฑ๋ถ„์œ„ ์ ์ˆ˜(๋งˆ์ปค)์™€ ํ‘œ์ค€ํŽธ์ฐจ(์‹ค์„ ) -> ์—ฌ๊ธฐ์„œ ๋‹ค๋ฅธ layer๋“ค์€ ๋” ์ ๊ฒŒ ํฌํ™”๋˜๊ณ  ํ•จ๊ผ ๊ฒฐํ•ฉํ•œ๋‹ค.

Experiments with the Softsign

Fig 4

  • Softsign์€ hyperbolic tangent์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ง€์ˆ˜ํ•ญ์ด ์•„๋‹Œ ๋‹คํ•ญ์‹์œผ๋กœ ์ธํ•ด์„œ ํฌํ™”์˜ ๊ด€์ ์—์„œ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค. Fig 3์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ํฌํ™”๊ฐ€ ํ•˜๋‚˜์˜ layer์—์„œ ๋‹ค๋ฅธ layer๋กœ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š” ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 
  • Softsign์€ ์ฒ˜์Œ์—๋Š” ๋น ๋ฅด๊ณ , ์กฐ๊ธˆ ์ง€๋‚˜๋ฉด ๋А๋ ค์ง„๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  layer๋Š” ํฐ ๊ฐ€์ค‘์น˜ ๊ฐ’์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  • ํ‰ํ‰ํ•œ ์ง€์—ญ์€ ๋น„์„ ํ˜•์„ฑ์ด ์žˆ์ง€๋งŒ, gradient๊ฐ€ ์ž˜ ํ๋ฅผ ์ˆ˜ ์žˆ๋Š” ์ง€์—ญ์ด๋‹ค.
  • ์œ„ ์ด๋ฏธ์ง€ : hyperbolic tangent, ๋‚ฎ์€ layer์˜ ํฌํ™” ์ƒํƒœ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ์•„๋ž˜ ์ด๋ฏธ์ง€: softsign, ํฌํ™”๋˜์ง€ ์•Š๊ณ  (-0.6 ,-0.8), (0.6 ,0.8) ์ฃผ๋ณ€์— ๋ถ„ํฌํ•˜๊ณ  ์žˆ๋Š” ๋งŽ์€ ํ™œ์„ฑํ™” ๊ฐ’๋“ค์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Studying Gradients and their Propagation

  • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ์กฐ๊ฑด๋ถ€ ๋Œ€์ˆ˜ ์šฐ๋„ ๋น„์šฉํ•จ์ˆ˜๊ฐ€ feed forward ํ™˜๊ฒฝ ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์ „ํ†ต์ ์œผ๋กœ ์‚ฌ์šฉ๋œ 2์ฐจ ๋น„์šฉ๋ณด๋‹ค ํ›จ์”ฌ ๋” ์ž˜ ์ž‘๋™๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.
  • ๋’ค๋กœ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต ์‹œํ‚ฌ๋•Œ๋งˆ๋‹ค back-propagated gradients์˜ ๋ถ„์‚ฐ์€ ๊ฐ์†Œํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. 

FIg 5

  • Fig 5 : 2๊ฐœ์˜ layer๋ฅผ ๊ฐ–๋Š” ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ 2๊ฐœ์˜ ๊ฐ€์ค‘์น˜์˜ ํ•จ์ˆ˜๋กœ์จ, Cross entropy(๊ฒ€์€์ƒ‰), 2์ฐจ ๋น„์šฉ(๋นจ๊ฐ„์ƒ‰). ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ W1๊ณผ ๋‘๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ W2
  • ์ •๊ทœํ™” ์š”์†Œ๋Š” ๊ณ„์ธต์„ ํ†ตํ•œ ๊ณฑ์…ˆ ํšจ๊ณผ ๋•Œ๋ฌธ์— deep network๋ฅผ ์ดˆ๊ธฐํ™”ํ•  ๋•Œ ์ค‘์š”ํ•˜๋‹ค. 

์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™”

  • activation variances์™€ back-propagated gradients variance๊ฐ€ ๋„คํŠธ์›Œํฌ์—์„œ ์œ„ ์•„๋ž˜๋กœ ์›€์ง์ผ ์ˆ˜ ์žˆ๋Š” ์•ˆ์ „ํ•œ ์ดˆ๊ธฐํ™” ์ ˆ์ฐจ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๊ฒƒ์„ ์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™”๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

Fig 6

  • ์œ„ ์ด๋ฏธ์ง€ : hyperbolic tangent activation์„ ์‚ฌ์šฉํ•œ activation value๋ฅผ ์ •๊ทœํ™”ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ, ํ‘œ์ค€ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•
  • ์•„๋ž˜ ์ด๋ฏธ์ง€: ์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•

Back-propagated Gradients During Learning

  • ์œ„ ์ด๋ฏธ์ง€ : hyperbolic tangent activation์„ ์‚ฌ์šฉํ•œ ์ •๊ทœํ™”๋œ Back-propagated gradients ํžˆ์Šคํ† ๊ทธ๋žจ , ํ‘œ์ค€ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•
  • ์•„๋ž˜ ์ด๋ฏธ์ง€ : ์ •๊ทœํ™”๋œ ํ‘œ์ค€ํ™”
  • Fig7์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ํ‘œ์ค€ ์ดˆ๊ธฐํ™” ์ดํ›„์— ์ง„ํ–‰๋˜๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— back-propagated gradient์˜ ๋ถ„์‚ฐ์ด ์•„๋ž˜๋กœ ์ „ํŒŒ๋จ์— ๋”ฐ๋ผ ๋” ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๊ฒฝํ–ฅ์ด ํ•™์Šตํ•˜๋Š” ๋™์•ˆ ์•„์ฃผ ๋น ๋ฅด๊ฒŒ ๋ฐ”๋€๋‹ค.
  • ์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๊ทธ๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Error Curves and Conclusions

Fig 8

  • ์œ„ ์ด๋ฏธ์ง€ : hyperbolic tangent๋ฅผ activation function์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ํ‘œ์ค€ ์ดˆ๊ธฐํ™” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐํ™”ํ•œ ํ›„ weight gradient๋ฅผ ์ •๊ทœํ™” ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ.
  • ์•„๋ž˜ ์ด๋ฏธ์ง€ : ์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™” ๊ธฐ๋ฒ•์„ ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์— ์‚ฌ์šฉํ•œ ๊ฒƒ

Table1

  • ์ด ๊ฒฐ๊ณผ๋Š” activation๊ณผ ์ดˆ๊ธฐํ™”์˜ ์„ ํƒ์— ๋Œ€ํ•œ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. "N"์€ ์ •๊ทœํ™”๋œ ์ดˆ๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ
  • ๊ฒฐ๋ก 
    • sigmoid๋‚˜ hyperbolic tangent์™€ ํ‘œ์ค€ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์ „ํ†ต์ ์ธ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ ์ƒํƒœ๊ฐ€ ์•ˆ์ข‹๋‹ค. ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ , local minima์— ์ทจ์•ฝํ•˜๋‹ค
    • softsign์„ ์‚ฌ์šฉํ•œ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ tanh ๋ณด๋‹ค ์ดˆ๊ธฐํ™” ์ ˆ์ฐจ์— ๋” ๊ฐ•์ธํ•œ๋ฐ, ๋” ๋ถ€๋“œ๋Ÿฌ์šด ๋น„์„ ํ˜•์„ฑ ๋•Œ๋ฌธ์ผ ๊ฒƒ์ด๋‹ค.
    • tanh ๋„คํŠธ์›Œํฌ์˜ ๊ฒฝ์šฐ, ์ œ์•ˆ๋œ ์ •๊ทœํ™” ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•์„ ๊ฝค ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ ˆ์ด์–ด ๊ฐ„ ๋ณ€ํ™˜์ด ํ™œ์„ฑํ™”(์œ„๋กœ ํ๋ฅด๋Š”), gradient(๋’ค๋กœ ํ–ฅํ•˜๋Š”) ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Fig 11

 

'Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration  (0) 2023.01.04
Convolutional Neural Network Pruning: A Survey  (0) 2022.12.30
Batch Normalization : Accelerating Deep Network Training byReducing Internal Covariate Shift  (0) 2022.12.27
Deep Residual Learning for Image Recognition  (0) 2022.12.26
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION  (0) 2022.12.23
  1. [๋…ผ๋ฌธ๋ฆฌ๋ทฐ]
  2. ABSTRACT
  3. Deep Neural Networks
  4. Experimental Setting and Datasets
  5. Effect of Activation Functions and Saturation During Training
  6. Studying Gradients and their Propagation
'Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration
  • Convolutional Neural Network Pruning: A Survey
  • Batch Normalization : Accelerating Deep Network Training byReducing Internal Covariate Shift
  • Deep Residual Learning for Image Recognition
velpegor
velpegor
velpegor
๐Ÿ’ป
velpegor
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • ALL (19)
    • Paper Review (11)
    • Book Review (0)
    • Projects (1)
    • AI (6)
      • Deep Learning (0)
      • Machine Learning (6)
    • Algorithm (0)
      • BOJ (0)
    • Backend (1)
      • Python (1)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ํ™ˆ

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • Transformer
  • ์ฐจ์›์˜์ €์ฃผ
  • batch normalization
  • ๋…ผ๋ฌธ๋ฆฌ๋ทฐ
  • ๋จธ์‹ ๋Ÿฌ๋‹
  • ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
  • ํ•ธ์ฆˆ์˜จ๋จธ์‹ ๋Ÿฌ๋‹
  • VGG
  • AI-hub
  • FastAPI
  • ์ฑ… ๋ฆฌ๋ทฐ
  • ICLR
  • ์„œํฌํŠธ๋ฒกํ„ฐ๋จธ์‹ 
  • GAN
  • Swin
  • Xavier Initialization
  • vision transformer
  • ํ•ด์ปคํ†ค
  • ํ•ธ์ฆˆ์˜จ๋จธ์‹ ๋Ÿฌ๋‹2
  • ํ•ธ์ฆˆ์˜จ ๋จธ์‹ ๋Ÿฌ๋‹
  • AI ๊ณต๋ชจ์ „
  • VGGNet
  • Token Pruning
  • ๋”ฅ๋Ÿฌ๋‹
  • ViT
  • resnet
  • pruning
  • token
  • YOLO
  • ์ฑ…๋ฆฌ๋ทฐ

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

hELLO ยท Designed By ์ •์ƒ์šฐ.
velpegor
Understanding the difficulty of training deep feedforward neural networks
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.