Masich I.S.
Siberian State Aerospace University
named after academician M. F. Reshetnev, Russia
Search algorithms for combinatorial
optimization in diagnosis
and prognosis
problems
A data classification method based on
combinatorics and optimization is considered in the paper. The decision rule
bases on the model received by solving combinatorial optimization problems.
Search algorithms of pseudo-Boolean optimization are designed and investigated
for solving these problems.
The main
object of study in this work is the use of algorithms for combinatorial
optimization problems in data classification. One of the most promising
directions in classification are logical classification algorithms, the
principle of which is to identify patterns in data and formalize them into a
set of logical rules. The process of formation of decision rules is implemented through solving problems of choosing the best
alternative according to some criterion. In some logic algorithms, including
decision trees and decision lists, this is done implicitly, using some
heuristic procedures. Formalization of this process as a series of
combinatorial optimization problems forms a flexible and efficient algorithm of
logical analysis for data classification. Construction of effective rules and
classification models is a difficult combinatorial problem. The results of the
solution are determined by the type of the criteria and constraints that are
formed, and by used optimization algorithms. Their effectiveness depends on
accuracy and laboriousness of the method of
classification.
The investigated data classification algorithm consists of steps, each
of which requires decision a series of combinatorial optimization problems [1].
Criteria and constraints in the problems
are defined by pseudo-Boolean
functions, characterized by the properties of unimodality and
monotonicity that distinguish them
in a special class of problems where the feasible set is connected. In the general
case these functions are defined algorithmically,
i.e. evaluated through a specific
sequence of operations.
Therefore,
the most suitable for solving this problem are the so-called search
optimization algorithms that do not require the assignment of functions in
explicit form, using algebraic expressions, and use calculations of the
functions in points.
For solving
this problem a regular pseudo-Boolean optimization algorithm was previously
developed [2], which finds the exact optimal solution for a limited time; it is
shown that this algorithm implements the informational complexity of this class
of problems.
At the same
time for multiple solutions of this problem with a large number of variables
the most appropriate is using of approximation algorithms [3], designed
specifically for this class of problems and based on the behavior of monotone
functions, which are the criteria and constraints, in the space of Boolean
variables. These optimization algorithms are based on finding the boundary
points of the feasible region.
Experimental
study of the developed procedure of classification and its comparison with
other methods of data classification was carried out on the problem of
predicting complications of myocardial infarction (MI) [4]. The challenge is to
predict for patients at the hospital the number of complications: atrial fibrillation
(AF), ventricular fibrillation (VF), pulmonary edema (PE), heart failure (HF).
For this purpose the data sample consisting of 1700 objects, data of which lies
in 117 characters, is used.
Comparison of
the developed logical classification algorithm (LCA) was carried out with
different data classification algorithms: naïve Bayesian classifier,
nearest neighbor (IB1), support vector machine (SMO), decision trees (J48), algorithm for constructing 1-rule
(1-R) and other classification
algorithms. Solving by these algorithms was carried out in the data analysis
system WEKA. The results for VF are presented in Table 1.
Table 1
Method Indicator |
NaiveBayes |
IB1 |
SMO |
1-R |
J48 |
LCA |
The number of correctly
classified negative objects (%) |
89 |
89 |
78 |
78 |
78 |
94 |
The number of correctly
classified posirive objects (%) |
69 |
46 |
77 |
92 |
92 |
100 |
Realized
classification procedure is not inferior in accuracy to other algorithms for
data classification on the practical problems of forecasting. But it has several
important advantages in practical use. First of all, explicitly known the rules
by which a decision is made about belonging to a class. This makes the method
is interpreted and provides an opportunity to apply it to solve those problems
in which the loss from making the wrong decisions can be large, and the
decision itself must be justified.
Many methods of pattern recognition (e.g., neural networks, support
vector machine), although find a solution with good accuracy, but do not give
any explanation about them. However, when using systems of diagnosis and
prognosis in practical problems (e.g., problems of medical diagnosis and
prognosis), sometimes you want to know why some new observation belongs to a
class, how far it is from the “boundary” between the classes, how this decision is justified.
In such cases, the method of data classification, which in addition to the
solutions of the problem provides an explicit decision rule, that is, detects
knowledge from available data.
Literature
1. Hammer P.L., Bonates
T. Logical Analysis of Data: From Combinatorial Optimization to Medical
Applications. RUTCOR Research Report 10-2005, 2005.
2. A.N. Antamoshkin,
I.S. Masich. Unimprovable algorithm for monotone pseudo-Boolean function
conditional optimization. Engineering & automation problems. V. 6, N. 1,
2008, 71-75.
3. A.N. Antamoshkin,
I.S. Masich. Heuristic search algorithms for monotone pseudo-boolean function
conditional optimization. Engineering
& automation problems.
N. 3, 2007, 41-45.
4. Complications of
myocardial infarction: a database for testing systems of recognition and
forecasting / S.E. Golovenkin, A.N. Gorban, V.A. Shulman and others.
Krasnoyarsk, Computing Center of RAS, Preprint 6, 1997.