개발 일기

Khaiii Github - Read Me.md (Translated in Eng) 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - Read Me.md (Translated in Eng)

flow123 2021. 6. 24. 18:16

Khaiii

Khaiii (Kakao Hangul Analyzer III) is the third morphological analyzer developed by Kakao. It is also named after DHA2 (Daumkakao Hangul Analyzer 2), the 2nd version of Kakao’s morphological analyzer.

A morpheme is the smallest linguistic unit that can have meaning. As it is the smallest meaningful part of a word, a morpheme cannot be further analyzed. A morphological analyzer refers to software which separates words into morphemes. Morphological analysis is the most basic process in natural language processing and the first step in executing syntax or semantic analysis (Resource: https://ko.wikipedia.org/wiki/형태소)

Data-Based Algorithm

While the prior version (DHA2) analyzes based on dictionaries and rules, Khaiii utilizes data (or machine learning) based algorithms for analysis. The corpus used for training is built based on the 21st Century Sejong Plan Final Results distributed by the National Institute of Korean Language, and the Kakao team reviewed the errors and added some data.

Excluding sentences involving errors in the pre-processing, a corpus consisting of 0.85million sentences and 10million eojeols (will be translated into 'words' from now on) was used for training the analyzer. Please refer to the Corpus document for details about corpus and parts of speech structure

* An eojeol is composed of one or more morphemes and is separated from another eojeol with a space and/or a punctuation mark. It is similar to words in English (Resource: A Korean Morphological Analyzer for Speech Translation System)

Algorithm

Out of the neural network algorithms, the Convolutional Neural Network (CNN) algorithm was used for machine learning. In Korean, morphological analysis is the most basic pre-processing approach for natural language processing, which means speed is a fundamental part of it. We expected that Recurrent Neural Network(RNN) such as Long-Short Term Memory(LSTM) would lead to low speed and therefore we excluded it.

Please find more information about the CNN Model in the CNN Model page below.

Performance

Accuracy

v0.3

The main hyperparameters of a CNN model are a win value which refers to the size of the left/right parts of a target syllable, and an emb value which refers to the dimension of syllable embedding. Win has a value of {2,3,4,5,7,10} and emb has a value of {20, 30, 40, 50, 70, 100, 150, 200, 300, 500}. The combination of the two values resulted in 60 tests (6 x10). The tests were executed and resulted in the performance graph shown below. The performance index is an F-Score which is the harmonic mean of accuracy and recall.

img

The win parameter shows the best performance at 3 or 4 and the performance drops at higher values. The emb parameter increases alongside the F-score upto 150, after which the F-score remains constant. Out of the top 5 models, a relatively small model has a value of win (3), emb(150), F-score (97.11). This model will be called the ‘Large model’ from now on.

v0.4

This model has been improved through the test for specialized spacing error model. While the v0.4 model shows a better performance for inputs which are not properly spaced, the accuracy decreases when tested with Sejong Corpus. To make up for this error, we have slightly changed the base and large model parameters as seen below.

  • Base Model: win=4, emb=35, F-Score: 94.96
  • Large Model: win=4, emb=180, F-Score: 96.71

Speed

v0.3

As the size of model increases, accuracy increases but speed decreases due to the additional calculation volume. Therefore, among the models with fair accuracy, a smaller, faster model was chosen as the base. For example, the model with win = 3, emb=30, F-score 95.30 has a small model size and an F-score greater than 95.

In order to compare the speed of two models, we have analyzed and compared 100K (Total 903KB, sentence average F-score 91) texts. The base model takes 10.5 seconds and the large model takes about 78.8 seconds.

v0.4

As the model sizes increased, we re-measured the speed of the base and large models as seen below. It became slower in v0.4 than v0.3.

  • Base Model: 10.8 -> 14.4
  • Large Model: 87.3 -> 165

User Dictionary

A neural network is a so-called blackbox algorithm, and it’s difficult for a human being to follow the process of analoging results. Hence, when an analysis error occurs, it is very difficult to revise the model’s parameters and obtain the correct results. To assist users, Khaiii has two types of user dictionaries; You can find a pre-analyzed dictionary at the beginning of the algorithm and an analysis error patch at the end.

Pre-analyzed dictionary

The pre-analyzed dictionary is used when a word’s analysis shows a consistent result regardless of its context. For example, if we have a data entry as written below,

Input Word Result of Analysis
Ethereum * Ethereum / NNP (Proper nouns)

every word starting with Ethereum will be analyzed as Ethereum/NNP without using the neural network algorithm.

When a pre-analyzed dictionary is automatically extracted from Sejong Corpus’s words without ambiguity for analysis, about 80K entries will be created ( A pre-analyzed dictionary does not interpret a word in its context of a sentence. If there can be multiple interpretations, please note that it may lead to inaccurate results.) When it was applied to the base model, it resulted in a slight speed improvement of about 10% (appr.9.2seconds).

For more details about the pre-analyzed dictionary and its technical guide, please refer to the pre-analyzed dictionary.

Analysis Error Patch

The analysis error patch is used when an analysis should be corrected based on the context across multiple words. For example, with the data entry (이 다른 것) below,

Input Text Errored analysis result Right- analysis result
이 다른 것 이/JKS + _ + 다/VA + 른/MM + _ + 것/NNB 이/JKS + _ + 다르/VA + ㄴ/ETM + _ + 것/NNB

If Khaiii outputs the errored analysis error result as above, it will be corrected to the right analysis result. “_” here refers to the space between words. For more details about the analysis error patch and its technical guide, please refer to the analysis error patch document.

Build and install

Please refer to the Build and Installation for details.

*You can find translator's notes italicized at the end of each paragraph

*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process

 

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

 

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: Link

Comments