Современные информационные
технологии/2. Вычислительная техника и программирование
Zhumanov Zh.
Al-Farabi Kazakh
National University, Almaty, Kazakhstan
Model and algorithm for determining the
meaning of ambiguous words with a parallel corpus of Kazakh and English
languages
Introduction
This article
deals with the problem of statistical approach to machine translation of
languages that do not have linguistic corpora. Statistical machine translation
is a leading technology to create translation software now. However, it
requires analysis of large volumes of pre-processed text data (corpora) for
each of the languages involved, and joint (parallel). This article shows how
one can take advantage of this approach, even when such data is unavailable. In
this case, the technology is used to improve the quality
of translation by determining the meaning of ambiguous words.
Statistical approach to machine translation
Statistical machine
translation is an approach to solving the problem of machine translation, where
translation is based on statistical models. The parameters of these models are
determined by analysis of bilingual parallel corpora. The idea of statistical
machine translation was suggested by William Weaver in 1949. This approach has
been “born again” in 1991 by members of one of the research centers of IBM.
Currently, it is one of the most studied methods of machine translation. [1]
Let us represent the main
points of this approach. P(k) - the probability that the translation software
will be presented with a sentence k. P(e|k) - conditional probability - the
probability that the sentence in the target language e corresponds to the
sentence in the input language k.
Assume that the
translation software is given a sentence k in the Kazakh language. The program's
objective - to find such a sentence in English e that maximizes P(e|k). Found
sentence is usually called the most likely translation. Then it will be denoted
by e'. With this in mind, we can write:
e' = argmax P(e|k) (1)
e
Using Bayes formula:
, (2)
we get:
e' = argmax P(e )
P(k|e) (3)
e
Component P(e) is called a
model of language. In practice, it is determined by frequency of sentence e
usage in the language and designed to "control" the correctness of
this sentence. As a rule, sentences with wrong grammatical structure and
semantics are used rarely in any language. Consequently, the probability of use
of such sentences will be very small.
Component P(k|e) is called a
model of translation. It shows the probability that the sentence k and e in
different languages correspond to each other.
Given that in natural
languages, there are valid sentences that may differ very slightly (e.g. with
genus or number of its members) is more efficient for software to analyze not
complete sentences, but groups of words from which they are composed. This is
done by assuming that the correct sentence consists of the correct phrases.
Group consisting of n words is
called n-gram. By analogy, a group consisting of 1 word is called unigramm, 2 -
bigram, 3 - trigram. Thus, we get one of the fundamental propositions of the
statistical approach - if the sentence consists of valid n-grams, it is likely
that it is correct.
For the model of bigrams can
be offered: P(y|x) - conditional bigram probability - the probability that the
word «y» follows the word «x». This probability is defined as follows: the
number of occurrences of the group «xy» divided by the total number of
occurrences of the word «x».
When constructing the language
model, the probability P (k) is defined as follows:
(4)
Trigram
model became more widely used: P(z|xy) = number of occurrences of the group
«xyz» divided by the number of uses of «xy».
(5)
To
determine the translation model, P (k|e), the trigram model is also used. In a
parallel aligned corpus, consisting of trigrams, multi-step analysis is
conducted, at each stage of which the correspondence between elements of the
trigrams in different languages is successively determined and these
correspondences are assigned probabilities. [2]
The main difficulty in applying the statistical
approach is the need to use a parallel bilingual corpus and, sometimes, corpus
for each of the languages involved. The task of creating such corpora is far
enough from the current problem and has its own characteristics. In case that
there are no ready-made developments in the field of corpus linguistics for
languages involved in translation, the use of the statistical approach to
machine translation becomes inefficient.
One of the advantages of the statistical approach is
that it allows dealing with the problem of ambiguous words. As can be seen from
the above overview, in statistical approach each ambiguous word is associated
not with literal translation, but with most likely one, which is determined on
the basis of linguistic corpus used. If instead of a full parallel corpus make
a corps from the sentences in which the ambiguous words used in different
contexts, and modify the described mathematical model, the resulting model will
solve the problem of ambiguous words translation.
The task of creating of
corpus, whose sentences contain ambiguous words, seems to be simpler in the
sense that in this case it is easier to control its quality
(representativeness, balance of topics, genres, etc.). Also requirements to the
volume of the corpus decrease.
Model for
determining the meaning of ambiguous words based on the statistical approach
Let x be ambiguous word in the
Kazakh language, y - its English translation (depending on the context), P(y|x)
- the probability that y is the translation of x in this context. As shown in
[3], there are 5 types of context: co-text, rel-text, chron-text, bi-text, and
non-text. For this problem we have co-text - the words of the sentence directly
related to the ambiguous word x. Related word (z) can precede x or be after it.
Necessary to consider two cases: P (y|zx) - y is a translation of x in the
context of the «zx»; and P(y|xz) - y is a translation of x in the context of
the «xz». (Note: x and z are words of the Kazakh language, y is a word of
English).
By analogy with (3) we obtain:
y' = argmax P(y|zx) = argmax
P(y) P(zx|y)
y' = argmax P(y|xz) = argmax P(y)
P(xz|y)
where, y'- the required translation of ambiguous word
x.
P(y) - reflects how correct the word is in English.
Since initially all the translations of ambiguous word are known (taken from
the dictionary), then P(y) is always equal to 1. Choosing the correct value of
y' in this case depends on P(zx|y) or P(xz|y). These values are defined as
follows: the number of sentences with y in the English part of the corpus is
divided by the number of sentences related to them in the Kazakh part of the corpus
and contain the group zx or xz. In the case when the context of use of x
depends on the words in front of it (in the case zx), the expression P (xz|y)
will be very small. When the context of the use of x depends on words, standing
after it (the case of xz), the expression P(zx|y) will be very small.
The
algorithm for determining of ambiguous words' meaning
The
model described can be realized by the following algorithm of determining the
meaning of ambiguous words with the parallel corpus of Kazakh and English
languages:
1
Define all possible translations of x using a dictionary.
2 For
each of the translation calculate the values of P(zx|y) and P(xz|y) by using
the corpus.
3 For
translation of the word x, corresponding to a given context take the value y
that maximizes the probability, calculated in the previous step.
4 Case
in which all the calculated probabilities are equal to 0, indicates the
incompleteness of the corpus. In this case, as the most likely translation
takes the value y which is often used in English-language corpus, together with
x.
5 If
some of the translations have the same probability, then take the most common
in the corpus from them, together with x without regard to context.
The last two cases are
possible when corpus does not have all possible options for the use of x. In
handling such exceptions fixing of these events must be included to further
supplement the corpus.
Conclusion
This article describes a
model and an algorithm for determining the value of ambiguous words with a
parallel corpus of Kazakh and English languages. The advantages of statistical
machine translation may be used to improve the quality of translation, without
having large linguistic corpora.
References
1 Manning, Christopher
D. & Hinrich Schütze. 1999. Foundations of Statistical Natural
Language Processing. Cambridge, MA: MIT Press.
2 Kevin Knight. A
Statistical MT Tutorial Workbook JHU summer workshop. April 30, 1999.
3 Alan K. Melby,
Christopher Foster. Context in translation: Definition, access and teamwork The
International Journal for Translation & Interpreting Research Vol 2, No 2
(2010).