일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
- Morphological analysis #Corpus
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트v
- #스파르타코딩클럽후기 #내일배움캠프후기
- 파이썬
- 마크다운
- gitbash
- 자바파이썬
- Kakao
- Machine Learning
- 클라이언트사이드렌더링
- 모바일웹스킨
- 코딩온라인
- 비동기
- terminate
- github markdown
- PID
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트
- Anaconda
- Technical Writing
- 플젝후체크
- SSR
- khaiii
- expression statement is not assignment or call html
- 파이콘
- 서버사이드렌더링
- 출처: 자바의 신 8장
- 필사
- taskkill
- github
- address
- Today
- Total
개발 일기
Khaiii Github - Read Me.md (Translated in Eng) 본문
Khaiii Github - Read Me.md (Translated in Eng)
flow123 2021. 6. 24. 18:16Khaiii
Khaiii (Kakao Hangul Analyzer III) is the third morphological analyzer developed by Kakao. It is also named after DHA2 (Daumkakao Hangul Analyzer 2), the 2nd version of Kakao’s morphological analyzer.
A morpheme is the smallest linguistic unit that can have meaning. As it is the smallest meaningful part of a word, a morpheme cannot be further analyzed. A morphological analyzer refers to software which separates words into morphemes. Morphological analysis is the most basic process in natural language processing and the first step in executing syntax or semantic analysis (Resource: https://ko.wikipedia.org/wiki/형태소)
Data-Based Algorithm
While the prior version (DHA2) analyzes based on dictionaries and rules, Khaiii utilizes data (or machine learning) based algorithms for analysis. The corpus used for training is built based on the 21st Century Sejong Plan Final Results distributed by the National Institute of Korean Language, and the Kakao team reviewed the errors and added some data.
Excluding sentences involving errors in the pre-processing, a corpus consisting of 0.85million sentences and 10million eojeols (will be translated into 'words' from now on) was used for training the analyzer. Please refer to the Corpus document for details about corpus and parts of speech structure
* An eojeol is composed of one or more morphemes and is separated from another eojeol with a space and/or a punctuation mark. It is similar to words in English (Resource: A Korean Morphological Analyzer for Speech Translation System)
Algorithm
Out of the neural network algorithms, the Convolutional Neural Network (CNN) algorithm was used for machine learning. In Korean, morphological analysis is the most basic pre-processing approach for natural language processing, which means speed is a fundamental part of it. We expected that Recurrent Neural Network(RNN) such as Long-Short Term Memory(LSTM) would lead to low speed and therefore we excluded it.
Please find more information about the CNN Model in the CNN Model page below.
Performance
Accuracy
v0.3
The main hyperparameters of a CNN model are a win value which refers to the size of the left/right parts of a target syllable, and an emb value which refers to the dimension of syllable embedding. Win has a value of {2,3,4,5,7,10} and emb has a value of {20, 30, 40, 50, 70, 100, 150, 200, 300, 500}. The combination of the two values resulted in 60 tests (6 x10). The tests were executed and resulted in the performance graph shown below. The performance index is an F-Score which is the harmonic mean of accuracy and recall.
The win parameter shows the best performance at 3 or 4 and the performance drops at higher values. The emb parameter increases alongside the F-score upto 150, after which the F-score remains constant. Out of the top 5 models, a relatively small model has a value of win (3), emb(150), F-score (97.11). This model will be called the ‘Large model’ from now on.
v0.4
This model has been improved through the test for specialized spacing error model. While the v0.4 model shows a better performance for inputs which are not properly spaced, the accuracy decreases when tested with Sejong Corpus. To make up for this error, we have slightly changed the base and large model parameters as seen below.
- Base Model: win=4, emb=35, F-Score: 94.96
- Large Model: win=4, emb=180, F-Score: 96.71
Speed
v0.3
As the size of model increases, accuracy increases but speed decreases due to the additional calculation volume. Therefore, among the models with fair accuracy, a smaller, faster model was chosen as the base. For example, the model with win = 3, emb=30, F-score 95.30 has a small model size and an F-score greater than 95.
In order to compare the speed of two models, we have analyzed and compared 100K (Total 903KB, sentence average F-score 91) texts. The base model takes 10.5 seconds and the large model takes about 78.8 seconds.
v0.4
As the model sizes increased, we re-measured the speed of the base and large models as seen below. It became slower in v0.4 than v0.3.
- Base Model: 10.8 -> 14.4
- Large Model: 87.3 -> 165
User Dictionary
A neural network is a so-called blackbox algorithm, and it’s difficult for a human being to follow the process of analoging results. Hence, when an analysis error occurs, it is very difficult to revise the model’s parameters and obtain the correct results. To assist users, Khaiii has two types of user dictionaries; You can find a pre-analyzed dictionary at the beginning of the algorithm and an analysis error patch at the end.
Pre-analyzed dictionary
The pre-analyzed dictionary is used when a word’s analysis shows a consistent result regardless of its context. For example, if we have a data entry as written below,
Input Word | Result of Analysis |
---|---|
Ethereum * | Ethereum / NNP (Proper nouns) |
every word starting with Ethereum will be analyzed as Ethereum/NNP without using the neural network algorithm.
When a pre-analyzed dictionary is automatically extracted from Sejong Corpus’s words without ambiguity for analysis, about 80K entries will be created ( A pre-analyzed dictionary does not interpret a word in its context of a sentence. If there can be multiple interpretations, please note that it may lead to inaccurate results.) When it was applied to the base model, it resulted in a slight speed improvement of about 10% (appr.9.2seconds).
For more details about the pre-analyzed dictionary and its technical guide, please refer to the pre-analyzed dictionary.
Analysis Error Patch
The analysis error patch is used when an analysis should be corrected based on the context across multiple words. For example, with the data entry (이 다른 것) below,
Input Text | Errored analysis result | Right- analysis result |
---|---|---|
이 다른 것 | 이/JKS + _ + 다/VA + 른/MM + _ + 것/NNB | 이/JKS + _ + 다르/VA + ㄴ/ETM + _ + 것/NNB |
If Khaiii outputs the errored analysis error result as above, it will be corrected to the right analysis result. “_” here refers to the space between words. For more details about the analysis error patch and its technical guide, please refer to the analysis error patch document.
Build and install
Please refer to the Build and Installation for details.
*You can find translator's notes italicized at the end of each paragraph
*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process
Translator's Note
Introduce Khaiii Github Translation Project: Link
[Khaiii GIthub] Key terms & Concepts: Link
Other Khaiii Translation
[Khaiii Github] Read Me.md: Link
[Khaiii Github] Pre Analysis Dictionary: Link
[Khaiii Github] CNN Model: Link
[Khaiii Github] Test for Specialized Spacing Error Model: Link
[Khaiii Github] CNN Model Training Process: Link
[Khaiii Github]: Analysis Error Patch: Link
'Technical Writing > Khaiii Wiki Translation' 카테고리의 다른 글
Khaiii Github - CNN Model Training Process (0) | 2021.07.14 |
---|---|
Khaiii Github - Test for Specialized Spacing Error Model (0) | 2021.07.12 |
Khaiii Github - Key terms & Concepts (0) | 2021.07.12 |
Khaiii Github - Pre Analysis Dictionary (Translated in Eng) (0) | 2021.06.24 |
Introduce Khaiii Github Translation Project (0) | 2021.06.24 |