Современные информационные
технологии / Вычислительная техника и программирование
Kryuchin O.V., Vyazovova
E.V., Rakitin A.A.
Tambov State University named after G.R. Derzhavin
DEVELOPMENT OF PARALLEL
GRADIENT AND QUASINEWTON ALGORITHMS FOR THE ECONOMIC PROBLEM SOLVING
In this paper it suggests parallel methods
training artificial neural networks based on the target function decomposition
to a Teylor row. It suggests parallel algorithms for few methods which are
steepest descent, QuickProp, RPROP, BFGS and DFP.
Keywords: parallel algorithms, artificial neural networks
Introduction
As we know artificial neural
networks (ANNs) implement the human brain mathematical model. The human brain
can solving different economic tasks for example to develop models or to
forecast row (currency pair) and etc. But the modern computers mighty is not so
high as the human brain mighty that is why the ANNs training needs large time
expenses very often. Sometime it is not problem but in other situations it is
not available. This problem solution is the clustering. The ANN which is
trained on computer cluster needs less time expenses than network which is
trained on personal computer. But for the training with the computer cluster
usage we have develop special algorithms considering the parallel computing
specific.
As far as the ANNs training is
priory defined by the target function minimization, we can use gradient
algorithms which are very effective in the optimization theory. These methods
use the decomposition of target function to the Tailor row in the current minimum area. In the papers [1, 2] this decomposition is described by
equation
|
(1) |
where is the gradient, is the Hessian which is the simmetric matrix consisting of second
factor derivatives and is the vector defining the direction and the depending of vector values. In the real usually it calculates three first members only
and ignores other.
This paper aim is to develop
parallel algorithms of several methods which are the method of the steepest
descent, QuickProp, RPROP, BFGS and DFP.
Gradient methods
The gradient descent is a
first-order optimization algorithm. To find a local minimum of a function using
gradient descent, one takes steps proportional to the negative of the gradient
(or of the approximate gradient) of the function at the current point. If
instead one takes steps proportional to the positive of the gradient, one approaches
a local maximum of that function, the procedure is then known as gradient
ascent.
Gradient descent is also known as
steepest descent, or the method of steepest descent. When known as the latter,
gradient descent should not be confused with the method of steepest descent for
approximating integrals.
Gradient descent is
based on the observation that if the multivariable function is defined and differentiable in a
neighborhood of a point , then
decreases fastest if one goes from in the direction of the
negative gradient of function at , which is . It follows that, if we
have formula
|
(2) |
(for
a small enough number) then [3, 4].
In the QuickProp algorithm
changing the -th weight coefficient on the -th iteration uses formula
, |
(3) |
where is the coefficient minimizing weight coefficient values, is the moment factor coefficient. There are two difference from
steepest descent methods. These are adders weights minimizer and moment factor [5]. Usually minimizer has value equaling to and wanted for weakening weight links, and it is necessary to have
a moment factor for the algorithm adaptation to the current training results.
Coefficient is unequal for each weight and its calculation consists of two
steps. The first step calculates by formula
, |
(4) |
and the second step find the minimal value of and values. In paper [6] Falman suggests to use 1.75 as the value of .
The method RPROP idea
is to ignore the gradient value and to use its sign only. In paper [4, 5] the
changing of weights is calculated by formulas
|
(5) |
|
(6) |
|
(7) |
In the paper of Osowsky [5] it
suggest values , , , .
Quisinewton methods
In optimization, quasi-Newton
methods (a special case of variable metric methods) are algorithms for finding
local maxima and minima of functions. Quasi-Newton methods are based on Newton's
method to find the stationary point of a function, where the gradient is 0.
Newton's method assumes that the function can be locally approximated as a
quadratic in the region around the optimum, and uses the first and second
derivatives to find the stationary point. In higher dimensions, Newton's method
uses the gradient and the Hessian matrix of second derivatives of the function
to be minimized.
In quasi-Newton methods the Hessian
matrix does not need to be computed. The Hessian is updated by analyzing
successive gradient vectors instead. Quasi-Newton methods are a generalization
of the secant method to find the root of the first derivative for
multidimensional problems. In multi-dimensions the secant equation is
under-determined, and quasi-Newton methods differ in how they constrain the solution,
typically by adding a simple low-rank update to the current estimate of the Hessianks [7].
Quasinewton algorithms calculate new
weights by formula
|
(8) |
which is the base of the Newton optimization algorithm
and is the theoretical equation because needs the positive Hessian definition
in the each step. But it is not real thus we use the Hessian approximation
instead it. In the each step the Hessian approximation is modified.
We will define the matrix which is
reversed to the Hessian as and the gradient changing as . As it is written in papers [9, 10]
in such situation new weights can be calculated by formula
|
(9) |
The initial matrix value is equal to one () and next values usually are calculated
using formula Broyden-Fletcher-Goldfarb-Shanno
|
(10) |
|
(11) |
|
(12) |
We will symbolized the multiplying
of changing gradient to transported changing gradient as
|
(13) |
the multiplying of changing weight coefficients vector
to transposed changing weight coefficients vector as
|
(14) |
the multiplying of transposed changing weights
coefficients vector to changing gradient as
|
(15) |
the multiplying of transposed changing gradient to matrix as
, |
(16) |
and the multiplying of transposed vector to gradient changing vector as
|
(17) |
Thus we can write equation (9) as
|
(18) |
In other acquainted algorithm Davidon-Fletcher-Powell
[8] the matrix new value is calculated by formula
|
(19) |
If we will use definitions (12)-(16) then we can get equation
|
(20) |
Parallel algorithms
Gradient methods are based on a
gradient vector calculation and change the values of the weight coefficients in
the opposite direction:
|
(21) |
Here is the gradient.
In these algorithms the vector of
weight coefficients is divided into n parts and each part is located on
a different processor. The lead processor calculates weight coefficients, and the other processors calculate weights. The values of and are calculated by formulas
|
(22) |
and
|
(23) |
This idea is fully described in papers [11, 12].
Parallel quasinewton algorithms
Parallel quasinewton algorithms can
be developed by two ways. The first way is to calculate values of as the typical parallel gradient algorithm (and too), to pass calculated values from non-zero to the lead processor, to calculate a value and weights changing on the lead processor and to send calculated weights to all
processors. After it all processors calculate new weight values themself.
Second parallel variants of BFGS
and DFP algorithms are different. The BFGS algorithm consists of
fourteen steps:
1. gradient is calculated by all processors like in other gradient methods (gradient
changing is calculated by all processors too);
2. all non-zero processors send calculated gradient parts to the lead processor;
3. the lead processor forms general gradient and sends its new values to all processors;
4. each processor except zero calculates its matrix rows and vector elements and then sends it to the lead; at that time the lead
processor calculates new value;
5. non-lead processors pass calculated matrix rows and vector elements to the lead processor;
6. zero processor calculates new value of vector ;
7. all processors except zero calculates its matrix rows; the lead processor at that time calculates matrix
; |
(24) |
8. non-zero processors pass calculated elements of matrix
|
(25) |
to the lead processor
9.
the lead processor sends matrix , scalar and vector to all other processors;
10. the
lead processor calculates matrix
, |
(26) |
and all other processors calculate their rows of
matrix
, |
(27) |
|
(28) |
and
; |
(29) |
11. non-lead processors pass calculated matrix rows to the lead processor;
12. the lead processor calculates matrix and weights changing ;
13. zero processor sends matrix and weights changing to all other processors;
14. all processors calculate new values of weights .
First steps of parallel DPF
algorithm are equaling to BFGS algorithm steps but last steps are
different:
7. non-zero processors calculate its matrix rows;
8. non-zero processors pass calculated rows of matrix to zero processor;
9. the lead processor forms general matrix and sends it, matrix and scalar to all processors;
10. non-lead processors calculate its rows of matrix
, |
(30) |
|
(31) |
and
|
(32) |
and zero processor at that time
calculates
|
(33) |
11. matrix rows are passed from non-zero processors to the lead;
12. the lead processor calculates matrix
|
(34) |
and weights changing ;
13. the lead processor sends matrix and weights changing to all other processors;
14. all processors calculate new values of weights .
Conclusion
So we have developed parallel
variants of few gradient and qusinewton algorithms which can be used for the
artificial neural networks training.
Reference
1. Gill P., Murray W., Wrights M. Practical Optimisation. - N.Y.: Academic
Press, 1981.
2. Widrow B., Stearns S. Adaptive signal processing. - N.Y.: Prentice Hall,
1985.
3. Mordecai Avriel. Nonlinear
Programming: Analysis and Methods // Dover Publishing, 2003.
4. Jan A. Snyman. Practical
Mathematical Optimization: An Introduction to Basic Optimization Theory and
Classical and New Gradient-Based Algorithms // Springer Publishing, 2005.
5. Осовский С. Нейронные
сети для обработки информации; пер. с пол. И.Д.
Рудинского. - М.: Финансы и статистика, 2002. - 344 с.: ил. (Osovsky S.
Neural networks for information analysis // Moscow: Finance and statistic, 2002
– 344 p.)
6. Riedmiller M., Brawn H. RPROP - a fast adaptive learning algorithms. Technacal
Report // Karlsruhe: University Karlsruhe. 1992.
7. Zadeh L.A. The concept of linguistic variable and its application to
approximate reasoning. Part 1-3 // Information Sciences. 1975. - Pp. 199-249.
8. Riedmiller M. Untersuchungen zu konvergenz und generalisierungsverhalten
uberwachter lernverfahren mit dem SNNS. In Proceedings of the SNNS. 1993.
9. Bonnans J.F., Gilbert J.Ch., Lemarechal C., Sagastizbal C.A., Numerical
optimization, theoretical and numerical aspects. // Springer: Second edition.,
2006. - 437 p.
10. Davidon
W.C. Variable Metric Method for Minimization // SIOPT, 1991 - Vol. 1, Iss. 1 -
Pp. 1-17.
11. Крючин О.В.
Разработка параллельных градиентных алгоритмов обучения искусственной нейронной
сети // Электронный журнал "Исследовано в России", 096, стр.
1208-1221, 2009 г. // Режим доступа: http://zhurnal.ape.relarn.ru/articles/2009/096.pdf , свободный. – Загл. с экрана.
(Kryuchin O.V. Development of parallel gradient algorithms
of artificial neural network teaching // Electronic journal “Analysed in
Russia”, 2009 - Pp 1208-1221 // http://zhurnal.ape.relarn.ru/articles/2009/096.pdf)
12. Крючин О.В. Параллельные
алгоритмы обучения искусственных нейронных сетей с использованием градиентных
методов. // Актуальные вопросы современной науки, техники и технологий.
Материалы II Всероссийской научно-практической (заочной) конференции. —
М.: Издательско-полиграфический комплекс НИИРРР, 2010 — 160 с., с. 81-86. (Kryuchin O.V. Parallel algorithms of artificial neural networks
training with gradient methods usage // Actual problems of the modern science,
technique and technology. Processings of the II AllRussian scientific-practical
conference. - Moscow:NIIRRR, 2010 – 160 p, Pp 81-86)