일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트v
- 파이콘
- Technical Writing
- github markdown
- #스파르타코딩클럽후기 #내일배움캠프후기
- Anaconda
- khaiii
- 출처: 자바의 신 8장
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트
- PID
- 마크다운
- 플젝후체크
- Machine Learning
- SSR
- Kakao
- 파이썬
- 필사
- 서버사이드렌더링
- github
- expression statement is not assignment or call html
- 클라이언트사이드렌더링
- gitbash
- 비동기
- taskkill
- 코딩온라인
- 모바일웹스킨
- terminate
- address
- 자바파이썬
- Morphological analysis #Corpus
- Today
- Total
개발 일기
Khaiii Github - Analysis Error Patch 본문
Khaiii Github - Analysis Error Patch
flow123 2021. 7. 16. 10:07Analysis Error Patch
There can be errors in any machine learning model’s analysis. There cannot be a 100% accurate morpheme analyzer for any input. The analysis error patch is a user dictionary which can correct the model’s analysis errors.
Pre- Analysis Dictionary vs Analysis Error Patch
The difference between the pre-analysis dictionary and the Analysis Error Patch is written below.
Pre- Analysis Dictionary | Analysis Error Patch |
---|---|
Applied before running the machine learning model | Applied to the result of the machine learning model |
Speeds up analysis | Slows down analysis |
Can only be used for a single word | Can be used for multiple words and morphemes |
Dictionary files
There are four files under the directory rsc/src
- Base.errpatch.auto: Base model’s Analysis error patch entries which were automatically extracted from the corpus
- Base.errpatch.manual: User’s entries which are manually entered to correct the base model’s analysis errors.
- Large.errpatch.auto: Large model’s analysis error patch entries which was automatically extracted from the corpus
- Large.errpatch.manual: User’s entries which are manually entered to correct the large model’s analysis errors.
The base and large models have an analysis patch. Like the pre-analysis dictionary, you should use all the files starting with base.errpatch
to build the patch. Also, if there’s a change made in the dictionary, make sure to run the make resource
command and rebuild the dictionary
Dictionary Format
The format of the analysis error patch is 3 columns x one row. Below is part of the dictionary.
Input | Errored Analysis | Correct Analysis |
---|---|---|
중증급성호흡기증후군 | 중증급/NNG + 성호흡기/NNG + 증후군/NNG | 중증/NNG + 급성/NNG + 호흡기/NNG + 증후군/NNG |
된다는 것 | 되/XSV + ㄴ다/EF + 는/ETM + _ + 것/NNB | 되/XSV + ㄴ다는/ETM + _ + 것/NNB |
하지만, | | + 하지/MAJ + 만/EC + ,/SP | | + 하지만/MAJ + ,/SP |
# 아래는 설명을 위한 가상의 엔트리입니다. | ||
복잡하다. | _ + 복잡/XR + 하/XSV + 다/EC + ./SF + | | _ + 복잡/XR + 하/XSA + 다/EF + ./SF + | |
검색질의 | | + 검색/NNG + 질/XSN + 의/JKG + | | | + 검색/NNG + 질의/NNG + | |
There are unique morphemes such as spacing and ‘I’.The boundary between two words indicates spacing, which can be used in original words. If there’s spacing in the original words, you should use ‘_’ as a corresponding syllable in the analysis error / correct - analysis column. Spacing can be at the very beginning or end of a word and please make sure not to delete it.
‘I’ indicates the beginning and end of input words or sentences. In the example above, the original word "하지만," only applies when it is at the very beginning of the input sentence . The fourth entry "복잡하다." only runs when it is located at the end.
Like the last entry, if you use ‘I’ at the beginning and end of the sentence, it can be only applied when one is fully matched with the entry to prevent any analysis error. Also, if the machine learning model causes an analysis error from the corpus, you can automatically create an analysis error patch.
Build the Dictionary
You can use make resource
command and build all resources in the build
directory, but you can also build just the analysis error patch instead. It’s available with the following command.
cd rsc
mkdir -p ../build/share/khaiii
PYTHONPATH=$(pwd)/lib ./bin/compile_errpatch.py --model-size=base --rsc-src ./src --rsc-dir=../build/share/khaiii
Choose “base” or ‘large” for the --model-size
option. --rsc-src
option indicates the rsc/src
directory which has base.errpatch.auto
and base.errpatch.manual
files. --rsc-dir
옵션은 ../build/share/khaiii
directory which will print the binary directory. Once it’s successfully built, the 3 files will be created as follows.
errpatch.tri
errpatch.len
errpatch.val
Load the Dictionary
Like the pre-analyzed dictionary, if you add a new entry to the analysis error patch, you should copy the entire dictionary once more in the following path /usr/local/share/khaiii
or python site-packages path. For the installation path of the dictionary, please refer to the ‘about install location’)
Or, you should use the following command to run the dictionary path in an API format.
from khaiii import KhaiiiApi
api = KhaiiiApi(rsc_dir='/path/to/custom/khaiii/dictionary')
Arrangement per syllable
Like the pre-analyzed dictionary, it is important for the analysis error patch dictionary to arrange the original syllables and analysis result, and create a POS tag per syllable. For details about arrangement per syllable, please refer to the CNN Model and pre-analyzed dictionary document.
A known issue
The analysis error patch runs twice per an entry. One is to compare the input words and the errored result and the other is between the inputs and the correct result. If there’s an error in the latter, you should add a new rule in the rsc/src/char_align.map
file as described in the pre-analyzed dictionary.
In order to speed things up, the inputs and errors need to be compared. If a new rule about the analysis error is added to the rsc/src/char_align.map
, it affects the comparison between the correct result and the input. Therefore, you must not do this. Currently, an entry cannot be added when there’s an error in the comparison between input words and analysis errors. This is a known issue that KHaiii should work on, and please wait until it’s updated (The Khaiii team welcomes those who can contribute to the project)
You can find translator's notes italicized
*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process
Translator's Note
Introduce Khaiii Github Translation Project: Link
[Khaiii GIthub] Key terms & Concepts: Link
Other Khaiii Translation
[Khaiii Github] Read Me.md: Link
[Khaiii Github] Pre Analysis Dictionary: Link
[Khaiii Github] CNN Model: Link
[Khaiii Github] Test for Specialized Spacing Error Model: Link
[Khaiii Github] CNN Model Training Process: Link
[Khaiii Github]: Analysis Error Patch: [Link](
'Technical Writing > Khaiii Wiki Translation' 카테고리의 다른 글
Khaiii Github - CNN Model (0) | 2021.07.27 |
---|---|
Khaiii Github - CNN Model Training Process (0) | 2021.07.14 |
Khaiii Github - Test for Specialized Spacing Error Model (0) | 2021.07.12 |
Khaiii Github - Key terms & Concepts (0) | 2021.07.12 |
Khaiii Github - Pre Analysis Dictionary (Translated in Eng) (0) | 2021.06.24 |