INTRODUCTION
Analyzing data for clinical decision support is a task of great importance
to help save time of both patients and doctors and to minimize the risk
of making wrong diagnoses. With the advent of new technology and the development
of a broad variety of computerized analytical techniques, this process
is being done by computers rather than by human intervention. In the past
a number of algorithms for cardiovascular risk assessment have been proposed
to the medical community (Wilson et al., 1998; Assmann et al.,
2002; Mennotti et al., 2000; Mennotti et al., 2002; Conroy
et al., 2003). These algorithms that were used were mainly based
from statistical analyses performed on longitudinal study cohorts. There
are some drawbacks in considering the algorithms presented by classical
statistical approach mainly in dealing with nonlinear and complex data
that arise in analyzing the heart disease data set (Enzo, 2006). The drawback
is the inability to capture the disease complexity and the process dynamics.
In this study we use a datamining technique, namely Artificial Neural
Network (ANN) to diagnose whether a patient has heart disease. The analysis
will help as clinical decision support, that is, it will confirm the presence
or absence of heart disease for a patient. The novelty in this study is
the new network that is presented using the LevenbergMarquardt Algorithm
and the resilient backpropagation. The analysis shows that our ANN is
fast and reliable in predicting. Since, ANNs are able to handle a very
high number of variables at the same time and can furthermore capture
the nonlinearity in a data set; it gives a great advantage over classical
statistical techniques. ANNs are more concerned about the actual number
of variables rather about their nature. Due to the particular mathematical
infrastructure, ANNs can handle large amounts of variables, which constitute
the basis for developing recursive algorithms. In our model we first check
whether the variables in the data set are highly correlated and this is
done by a principal component analysis. The output is then fed into the
neural network.
MATERIALS AND METHODS
Due to rapid innovation of computer technology, we use the ANN technique
which is a powerful tool for nonlinear modeling compared to the classical
ARIMA model. Two main advantages are that ANN has the ability to learn
a complex nonlinear relationship with limited prior knowledge of the system
structure and that it can perform inferences for an unknown combination
of input variables.
The ANN method imitates the way by which the brain processes information.
Given an input vector p = (p_{1}, p_{2}, ..., p_{n})^{T},
the network produces an output vector
where, n indicates the number of inputs and m the number of output units.
A neural network is typically organized into several layers of nodes.
The first layer is the input layer, the number of nodes corresponding
to the number of variables and the last layer is the output. The input
and output layer can be separated by one or more hidden layers. The nodes
in adjacent layers are fully connected. Each neuron receives information
from the preceding layer and transmits to the following layer only. The
neuron then performs a weighted summation of its inputs; if the sum passes
a threshold the neuron transmits, otherwise it remains inactive. An example
of a fully connected ANN model with one hidden layer is shown in Fig.
1, where, are
the inputs at time t,
are the hidden outputs. The variables r_{t} and
are the actual and ANN model output, respectively. The vector p represents
the input to the ANN model where
is the level of activity at the ith input. Associated with the vector
is a series of weight vectors W_{j} = (w_{1j}, w_{2j},
..., w_{nj}) so that, w_{ij} represents the strength of
the connection between the input
and the unit b_{j}. There may also be the input bias
modulated with the weight w_{0j} associated with the inputs. The
total input of the node b_{j} is the dot product between vectors
p and w_{j} less the weighted bias. It is then passed through
a nonlinear activation function to produce the output value of processing
unit b_{j} defined as:
The activation function introduces a degree of nonlinearity to the model
and prevents the output from reaching very large values that can paralyze
ANN models and inhibit training. In this study we choose f(x) = (e^{x}e^{x})/(e^{x}+e^{x})
as the activation function. The modeling process begins by assigning random
values to the weights. The output value of the processing unit is passed
on to the output layer. If the output is optimal, the process is halted.
Else the weights are adjusted by using an appropriate algorithm, which
we choose as the back propagation algorithm for our work. The process
continues until an optimal solution is found that is when the output or
optimisation error, which is the difference between the actual value and
the ANN model output, is minimized.
Here propose a twolayer network based on two algorithms namely the LevenbergMarquardt
algorithm and the resilient backpropagation method. We, first describes
the LevenbergMarquardt algorithm, which is a combination of the method
of steepest descent and the GaussNewton method. We next study the resilient
backpropagation (Riedmiller, 1994) algorithm, which performs supervised
batch learning in multilayer networks.
LevenbergMarquardt Method (LM)
The LevenbergMarquardt method is an iterative technique, which locates
the minimum of a function expressed as the sum of squares of nonlinear
realvalued functions. Whenever the current solution is far from the current
one, the algorithm behaves like a steepest descent method: slow but with
the guarantee to converge. On the other hand, if the current solution
is close to the correct one, it becomes a GaussNewton method. We next
give a short description of the steepest descent and the GaussNewton
method.
Gradient Descent Method
Gradient descent is an optimisation algorithm. To find a local minimum
of a function, the steps are taken proportional to the negative of the
gradient (or the approximate gradient) of the function at the current
point. Gradient descent is based on the following observation: if the
realvalued function F(x) is defined and differentiable in a neighborhood
of a point a, then F(x) decreases fastest if one goes from a in the opposite
direction of the gradient of F at a denoted by âˆ‡F(a). It follows
that if b = aÎ³âˆ‡F(a), for Î³>0, then F(b)â‰¤F(a). With
this observation in mind, one starts with a guess x_{0} for a
local minimum of F and considers the sequence x_{0}, x_{1},
x_{2}, such that x_{n+1} = x_{n} Î³âˆ‡F(x_{n}),
nâ‰¥0.
We have F(x_{0})â‰¥F(x_{1})â‰¥F(x_{2})â‰¥...,
so that the sequence (x_{n}) converges to the desired local minimum.
The value of the step size Î³ is allowed to change at each iteration.
GaussNewton Algorithm
The GaussNewton Algorithm is used to solve nonlinear least squares
problems. It is a modification of the Newton`s method that does not use
second derivatives. Given m functions f_{1}, f_{2}, f_{m}
of n parameters p_{1}, p_{2},..., p_{n} with mâ‰¥n,
the aim is to minimize the sum where, p = (p_{1}, p_{2},
..., p_{n}).
The GaussNewton algorithm is an iterative procedure and hence an initial
guess for the parameter vector p has to be provided which is denoted as
p^{0}. Subsequent guesses p^{k} are then obtained by the
recurrence relation given by p^{k+1} = p^{k}(J_{f}(p^{k})^{T}J_{f}(p^{k}))^{1}
J_{f}(p^{k})^{T} f(p^{k}), where f = (f_{1},
f_{2}, ..., f_{m}) and J_{f} (p) is the Jacobian
of f at p. The matrix inverse is never computed explicitly in practice.
Instead, we use p^{k+1} = p^{k}+Î´^{k} and
we find the update Î´^{k} by solving the linear system J_{f}
(p^{k})^{T} J_{f} (p^{k})Î´^{k}
= J_{f} (p^{k})^{T} f(p^{k}).
The LevenbergMarquardt Algorithm
We assume f is a functional relation, which maps a parameter vector
pÎµR^{m} to an estimated measurement vector
An initial parameter estimate p^{0} and a measured vector x are
provided and the aim is to find the vector p^{+} that will best
satisfy the function relation f, that is, a function that minimises the
squared distance Îµ^{T}Îµ with
The LevenbergMarquardt algorithm (Lourakis Manolis, 2005) is an iterative
algorithm, which is based on linear approximation to the neighborhood
of p.
The method, initiated at the starting point p_{0}, produces a
series of vectors p_{1}, p_{2}, ... that converge towards
a local minimiser p^{+} for f. At each iteration, we have to look
for the Î´_{p} that minimises the quantity xf(p+Î´_{p})â‰ˆxf(p)J
Î´_{p}=ÎµJ Î´_{p}. The sought Î´_{p}
is therefore, the solution to a linear leastsquares problem: we reach
the minimum when J Î´_{p}Îµ is orthogonal to the column
space of J. This leads to J^{T}(J Î´_{p}Îµ) =
0), which yields the GaussNewton step Î´_{p} as the solution
of the socalled normal equation:
Ignoring the second derivative terms, the matrix J^{T}J in the
left hand side of Eq. 1 is the approximate. Hessian,
that is, an approximation to the matrix of second order derivatives. However,
the LevenbergMarquardt method actually solves a slight variation of Eq.
1, known as the augmented normal equations:
where the offdiagonal elements of N are identical to the corresponding
elements of J^{T}J and the diagonal elements are given by N =
J^{T}J+Î¼I for some Î¼>0 and where I is the
identity matrix. This strategy of altering the diagonal elements of J^{T}J
is called damping and Î¼ is called the damping term.
The algorithm, which is given in Appendix A, terminates when at least
one of the following conditions is met:
â€¢ 
The magnitude of the gradient of Îµ^{T}Îµ,
that is, J^{T}Îµ in the righthand side of Eq.
1, drops below a threshold Îµ_{1} 
â€¢ 
The relative change in the magnitude of Î´_{p} drops
below a threshold Îµ_{2} 
â€¢ 
The error Îµ^{T}Îµ drops below a threshold Îµ_{3} 
â€¢ 
A maximum number of iterations k_{max} is completed 
Resilient Backpropagation Algorithm (RP)
Resilient backpropagation (Riedmiller, 1994) is a local adaptive learning
scheme, which performs supervised batch learning in multilayer networks.
The basic principle is to eliminate the harmful influence of the size
of the partial derivative on the weight step. Consequently, consider only
the sign of the derivative to indicate the direction of the weight update.
The size of the weight change is exclusively determined by a weightspecific,
socalled update value :
denotes the summed gradient information over all patterns of the data
set (batch learning).
The second step of resilient backpropagation learning is to determine
the new updatevalue Î”_{ij}(t). This procedure is based on
a signdependent adaptation process.
The adaptation rule works as follows: whenever the partial derivative
of the corresponding weight w_{ij} changes in sign indicating
that the previous update was too large and the algorithm has jumped over
a local minimum, the updatevalue is
decreased by the factor Î·G. If the sign of the derivative remains
unchanged, the updatevalue is slightly increased in order to accelerate
convergence in shallow regions. Moreover, if there is a change in sign,
there should be no adaptation in the succeeding learning step. In practice,
this can be achieved by setting in the above adaptation rule.
The resilient backpropagation algorithm follows the principle of batch
learning or learning by epoch since it tries to adapt its learning process
to the topology of the error function. Weightupdate and adaptation are
performed after the gradient of the whole pattern set is computed.
RESULTS AND DISCUSSION
The heart disease data set problem is a pattern recognition problem.
The objective of the network is to decide if an individual has cardio
pathologies, based on personal data (age, sex) and the results of medical
examinations (e.g., blood pressure, cholesterol, maximum heart rate, etc).
The data was obtained from the Long Beach data set (more specifically
Cleveland Clinic Foundation data set [Heart Disease Databases]) with fourteen
attributes (including the target) and 303 records and the network has
been implemented in Matlab 7.0. Table 1 shown an extract
of the data set.
Table 1: 
An extract of the data set showing the records of the
patients 

Before training, it is useful to scale the input and targets so that
they always fall within a specified range. One approach is to normalise
the mean and standard deviation of the training set. This procedure is
implemented in the function prestd in Matlab. It normalises the inputs
and targets so that, they will have zero mean and unity standard deviation.
Principal Component Analysis
In some situations, when the input vector is large, the components
of the vectors are highly correlated (redundant). It is useful in this
situation to reduce the dimension of the input vectors and one way is
to perform a principal component analysis. The aim of the principal component
analysis is to eliminate those components that contribute the least to
the variation in the data set.
The matrix ptrans contains the transformed input vectors. The matrix
transMat contains the principal component transformation matrix. After
the network has been trained, the matrix should be used to transform any
future inputs that are applied to the network. We must note that pn *
transMat = ptrans.
For the data set we find that all components account up to 99% of the
variation.
Training Session
The next step is to divide the data up into training, validation and
test subsets. One fourth of the data has been taken for the validation
set, one fourth for the test set and one half for the training set. Now
we can create a network and train it. For this purpose, here used a twolayer
feedforward network, with tansigmoid transfer function in the hidden
layer and linear transfer function in the output layer. First, we start
with the resilient backpropagation algorithm:
net=newff(minmax(p),[50,1],{`logsig`,`purelin`},`trainrp`);
net.trainParam.show=2;
net.trainParam.epochs=100;
net.trainParam.goal=0;
net.trainParam.lr=0.01;
[net,tr]=train(net,ptr,ttr,[],[],val,test); 
The performance of the network during the training is shown in Table
2. The training stopped after 22 iterations because the validation
error increased. The training, validation and test errors have been plotted
to check the progress of the training. The network`s performance is shown
in Fig. 2 and 3. The result here is
reasonable because the test set error and the validation set error have
almost similar characteristics and it does not appear that any significant
overfitting has occurred.
Table 2: 
Performance of the resilient backpropagation method 


Fig. 2: 
Minimum square error per epoch with RP algorithm 

Fig. 3: 
Minimum square error per epoch using the LM algorithm 
Now, we train the same set of data using the LevenbergMarquardt algorithm.
The same commands have been used with the same parameters except for the
training function`s parameter, which has been set to â€˜trainlmâ€™
(Table 1).
The training stopped after eight iterations since the minimum gradient
had been reached (Fig. 2, 3).
For each approach, the technique for calculating the weight changes,
so as to minimize the mean square error between the targeted output and
the network`s output, have been analyzed. When applied to our data, for
the resilient backpropagation algorithm, the training started with a mean
square error of 27.378 and it reached a minimum of 0.03465 after 22 epochs
as shown in Table 2. On the other hand, in Table
3, we find that for the LevenbergMarquardt algorithm, the training
started with a mean square error of 32.7958 and it attained a minimum
of 1.18597x10^{0.29} after eight epochs. With these levels of
error, we can consider the results as being reliable.
Table 3: 
Performance of the LevenbergMarquardt method 

Table 4: 
Network`s output and targets with the LevenbergMarquardt
and the resilient backpropagation algorithms 

Table 4 shows a sample of the network `s output compared
to the targeted output. When analyzing the network`s output, we can see
that it has not been limited to 0 and 1 only since we have used the linear
transfer function in the output layer. In such a way, it can be analyze
the severity of the disease. For example, a patient for whom the network`s
output is 0.8 is not likely to have heart disease, whereas, a network`s
output of 1.5 indicates the presence of the disease and it can be considered
as being severe. Also, the cases where the doctor may have made wrong
diagnoses are in bold.
CONCLUSIONS
In this study, we have considered two algorithms for training the artificial
neural network for heart disease database. From this experiment we obtain
a well trained network with a relatively good accuracy by using the LM
method, which is faster than the RP method. Most of the diagnoses made
by the cardiologists were accurate but still there were a few cases where
the diagnosis of heart disease might have been wrong. Based on this analysis,
cardiologists can reconsider the records of those patients to see if there
has been any error in the diagnosis. The same approach may be used in
other medical problems where prediction based on subsets of attributes
is required on several target attributes at the same time.
APPENDIX A
We present the LevenbergMarquardt algorithm.
Input: A vector function f : R^{m}â†’R^{n}
with nâ‰¥m, a measurement vector xÎµR^{n} and an initial
parameter`s estimate p_{0} Îµ R^{m}
Output: A vector p^{+} Îµ R^{m} minimising
xf(p)^{2}
Algorithm:
Next, we consider the Resilient Backpropagation algorithm. The minimum
(maximum) operator is supposed to deliver the minimum (maximum) of two
numbers; the sign operator returns +1 if the argument is positive and
1 if the argument is negative and 0 otherwise.
Repeat
For all weights and biases {
The algorithm takes two parameters: the initial updatevalue Î”_{0}
and a limit for the maximum step size, Î”_{max}. All the updatevalues
are set to an initial value Î”_{0} when learning starts. Since,
the size of the first weight step is directly determined by Î”_{0},
it should be chosen according to the initial values of the weights themselves,
for example Î”_{0} = 0.1 (default setting). The choice of
this value is rather uncritical since it is adapted as learning proceeds.
To prevent the weights from becoming too large, the maximum weight step
determined by the updatevalue is limited. The upper bound is set by the
second parameter Î”_{max} and its default value is set to
Î”_{max} = 50.0. The minimum step size is constantly fixed
to Î”_{min} = 10^{6}.