개발 일기

Khaiii Github - Key terms & Concepts 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - Key terms & Concepts

flow123 2021. 7. 12. 11:41

This post contains key terms that appear frequently in Korean morphological analysis and natural language processing (NLP). It encompasses concepts that are used in Khaiii project and will be helpful for those are not familiar with the Korean structure. I would recommend to review this glossary first before reading the translations.

Terms

(1) Korean Unit

A. Jaso: a Korean Character ( e.g. ㄱ,ㄴ,ㄷ,ㅏ,ㅑ,ㅓ,ㅕ). A composition of 2 or 3 characters (Jaso) forms a syllable

B. Eojeol(Word): An eojeol is similar to a word in English. The eojeol is a linguistic unit segmented by a white space. In this document, a word refers to an eojeol in Korean. (an eojeol is called a word in this document)

C. Eumjeol (Syllable): An eumjeol is similar to a syllable in English. As most theseis on this topic call an eumjeol a syllable, I've done the same here. A sequence of eumjeols forms a word.

D. Josa: Korean postpositions. e.g. 는 (neun, topic marker), 를(reul, object marker), etc.

E. Eomi: verb endings

F Morpheme: the smallest linguistic unit that can have meaning.

G. Free morpheme (or unbound morpheme): One that can stand alone

H: Bound morpheme (the elementary unit of morphosyntax): One that can appear only as part of a larger expression.

(2) Morphological analysis

-Part of speech (POS) tagging: There are different tags for each part of speech. It considers a given context and assigns a POS tag.

-Complex tagging: To complete the Korean word, a free morpheme must be combined with a bound morpheme (Josa) or a postposition(Eomi). A word can consist of a couple of morphemes. It combines multiple tags for each morpheme into a single one.

Concepts

(1) Morpheme analysis

As a morpheme is the smallest meaningful part of a word, it cannot be further analyzed. Morphological analysis is a type of analysis that aims to segment words or sentences into morphemes, and a morphological analyzer refers to software which separates words into morphemes.

(2) Differences in morphological analysis in between Korean and English

For Korean, morphological analysis involves segmenting words into morphemes, restoring their morphological origin, and assigning POS tags accurately. In English POS tagging, a word and its tag have a one-to-one mapping so that the length of an input sequence is equal to that of an output sequence. On the other hand, they are usually different in Korean, because a word consists of several morphemes. (like 져줄래 is decomposed into 지/VV, 어/EC, 주/VX, ㄹ래/EF.)

#Why is morphological analysis more complicated in Korean?

Korean is an agglutinative language, a form of synthetic language in which each affix typically represents one unit of meaning (such as "diminutive," "past tense," "plural," etc.), and bound morphemes are expressed by affixes (and not by internal changes of the root of the word, or changes in stress or tone). There are more than 1,000 unique combinations of free and bound morphemes. A word can have a couple of analysis results, which can easily cause analysis errors.

When a Korean word is used, it often involves changes such as irregular conjugation of verbs and adjectives or vowels dropped off. The word often transforms according to its surroundings. In the example seen below, ‘나는’(flying) is decomposed into 날(fly) + 는 (verbal ending). Morphemes colored in orange are the original syllables that have been restored from the input sequence

imghttps://aclanthology.org/D19-1150.pdf

Resource: https://arxiv.org/pdf/1708.01766.pdf

(3) Syllable Restoration

Korean is an agglutinative language and a word conjugates in different ways. Once a morpheme is transformed, it is difficult for an analyzer to interpret correctly. That’s why we restore syllables as they are described in the restoration dictionary. A user can restore syllables by searching the dictionary with a key to find their origin. For example, with the complex tag key 했/ I-VX:I-EP:0 (with the complex tag key), 하/I-VX, 였/I-EP will be restored.

(4) Syllable-Based POS Tagging

The syllable-based POS tagging segments a sentence into syllables, and assigns POS tags to each syllable. After syllable tagging, the original syllables are restored and the final morpheme sequence is built. The POS tag is then attached accordingly. The final morpheme sequence will be built based on the syllable analysis result and POS tag will be attached accordingly.

(5) What is the IOB1 format?

The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics. I,O,B prefix before a tag indicates meaning as below

I-prefix: the tag is inside a chunk

O-prefix: a token belongs to no chunk

B-prefix: the tag is in the beginning of a chunk

(6) What is chunking? It is to identify proper chunks(like phrases) from a sequence of tokens (such as words), and classify these chunks into some grammatical classes. Part of speech tagging can be regarded as a form of chunking.

*You can find translator's notes italicized

*The original document can be found https://github.com/kakao/khaiii .Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process.

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: [Link](

Comments