개발 일기

Khaiii Github - Analysis Error Patch 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - Analysis Error Patch

flow123 2021. 7. 16. 10:07

Analysis Error Patch

There can be errors in any machine learning model’s analysis. There cannot be a 100% accurate morpheme analyzer for any input. The analysis error patch is a user dictionary which can correct the model’s analysis errors.

Pre- Analysis Dictionary vs Analysis Error Patch

The difference between the pre-analysis dictionary and the Analysis Error Patch is written below.

Pre- Analysis Dictionary Analysis Error Patch
Applied before running the machine learning model Applied to the result of the machine learning model
Speeds up analysis Slows down analysis
Can only be used for a single word Can be used for multiple words and morphemes

Dictionary files

There are four files under the directory rsc/src

  • Base.errpatch.auto: Base model’s Analysis error patch entries which were automatically extracted from the corpus
  • Base.errpatch.manual: User’s entries which are manually entered to correct the base model’s analysis errors.
  • Large.errpatch.auto: Large model’s analysis error patch entries which was automatically extracted from the corpus
  • Large.errpatch.manual: User’s entries which are manually entered to correct the large model’s analysis errors.

The base and large models have an analysis patch. Like the pre-analysis dictionary, you should use all the files starting with base.errpatch to build the patch. Also, if there’s a change made in the dictionary, make sure to run the make resource command and rebuild the dictionary

Dictionary Format

The format of the analysis error patch is 3 columns x one row. Below is part of the dictionary.

Input Errored Analysis Correct Analysis
중증급성호흡기증후군 중증급/NNG + 성호흡기/NNG + 증후군/NNG 중증/NNG + 급성/NNG + 호흡기/NNG + 증후군/NNG
된다는 것 되/XSV + ㄴ다/EF + 는/ETM + _ + 것/NNB 되/XSV + ㄴ다는/ETM + _ + 것/NNB
하지만, | + 하지/MAJ + 만/EC + ,/SP | + 하지만/MAJ + ,/SP
# 아래는 설명을 위한 가상의 엔트리입니다.
복잡하다. _ + 복잡/XR + 하/XSV + 다/EC + ./SF + | _ + 복잡/XR + 하/XSA + 다/EF + ./SF + |
검색질의 | + 검색/NNG + 질/XSN + 의/JKG + | | + 검색/NNG + 질의/NNG + |

There are unique morphemes such as spacing and ‘I’.The boundary between two words indicates spacing, which can be used in original words. If there’s spacing in the original words, you should use ‘_’ as a corresponding syllable in the analysis error / correct - analysis column. Spacing can be at the very beginning or end of a word and please make sure not to delete it.

‘I’ indicates the beginning and end of input words or sentences. In the example above, the original word "하지만," only applies when it is at the very beginning of the input sentence . The fourth entry "복잡하다." only runs when it is located at the end.

Like the last entry, if you use ‘I’ at the beginning and end of the sentence, it can be only applied when one is fully matched with the entry to prevent any analysis error. Also, if the machine learning model causes an analysis error from the corpus, you can automatically create an analysis error patch.

Build the Dictionary

You can use make resource command and build all resources in the build directory, but you can also build just the analysis error patch instead. It’s available with the following command.

cd rsc
mkdir -p ../build/share/khaiii
PYTHONPATH=$(pwd)/lib ./bin/compile_errpatch.py --model-size=base --rsc-src ./src --rsc-dir=../build/share/khaiii

Choose “base” or ‘large” for the --model-size option. --rsc-src option indicates the rsc/src directory which has base.errpatch.auto and base.errpatch.manual files. --rsc-dir 옵션은 ../build/share/khaiii directory which will print the binary directory. Once it’s successfully built, the 3 files will be created as follows.

errpatch.tri
errpatch.len
errpatch.val

Load the Dictionary

Like the pre-analyzed dictionary, if you add a new entry to the analysis error patch, you should copy the entire dictionary once more in the following path /usr/local/share/khaiii or python site-packages path. For the installation path of the dictionary, please refer to the ‘about install location’)

Or, you should use the following command to run the dictionary path in an API format.

from khaiii import KhaiiiApi

api = KhaiiiApi(rsc_dir='/path/to/custom/khaiii/dictionary')

Arrangement per syllable

Like the pre-analyzed dictionary, it is important for the analysis error patch dictionary to arrange the original syllables and analysis result, and create a POS tag per syllable. For details about arrangement per syllable, please refer to the CNN Model and pre-analyzed dictionary document.

A known issue

The analysis error patch runs twice per an entry. One is to compare the input words and the errored result and the other is between the inputs and the correct result. If there’s an error in the latter, you should add a new rule in the rsc/src/char_align.map file as described in the pre-analyzed dictionary.

In order to speed things up, the inputs and errors need to be compared. If a new rule about the analysis error is added to the rsc/src/char_align.map , it affects the comparison between the correct result and the input. Therefore, you must not do this. Currently, an entry cannot be added when there’s an error in the comparison between input words and analysis errors. This is a known issue that KHaiii should work on, and please wait until it’s updated (The Khaiii team welcomes those who can contribute to the project)

You can find translator's notes italicized

*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: [Link](

Comments