일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- 파이썬
- Anaconda
- 서버사이드렌더링
- 필사
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트
- Morphological analysis #Corpus
- expression statement is not assignment or call html
- Technical Writing
- github
- address
- 마크다운
- 파이콘
- 비동기
- PID
- SSR
- 플젝후체크
- 출처: 자바의 신 8장
- github markdown
- 클라이언트사이드렌더링
- taskkill
- gitbash
- khaiii
- terminate
- Machine Learning
- Kakao
- 코딩온라인
- #스파르타코딩클럽후기 #내일배움캠프후기
- 모바일웹스킨
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트v
- 자바파이썬
- Today
- Total
개발 일기
Khaiii Github - Pre Analysis Dictionary (Translated in Eng) 본문
Khaiii Github - Pre Analysis Dictionary (Translated in Eng)
flow123 2021. 6. 24. 18:42Pre-analyzed Dictionary
Pre-analyzed dictionary is used when a word’s analysis shows a consistent result regardless of its context.
Type of Dictionary Entry
There are two types of pre-analyzed dictionaries.
- Exact Match: When two words are fully matched
- Partial Match: When two words have the same beginning in common but are not exact matches
Below are the examples of pre-analyzed dictionaries
Number | Input word | Analysis Result |
---|---|---|
1 | 이더리움* | 이더리움/NNP |
2 | 이러쿵 저러쿵 | 이러쿵저러쿵/MAG |
3 | 고통스러워하* | 고통/NNG + 스럽/XSA + 어/EC + 하/VX |
4 | 고통스러웠* | 고통/NNG + 스럽/XSA + 었/EP |
5 | 고통스러웠다. | 고통/NNG + 스럽/XSA + 었/EP + 다/EF + ./SF |
Entries 1, 3, 4 have “” at the end of each input word and they are considered partial matches. For example, the entry 1 `이더리움` can be applied to words like “이더리움이 (이 is a subject marker)”, “이더리움을 (을 is an object marker)”.
Unlike the three previous entries, Entries 2 and 5 are exact matches. These entries can be only applied when a word is exactly matched to the entry. For example, for entry 5 고통스러웠다.
, "고통스러웠다" cannot be applied as it misses the punctuation “.”.
Entries 4 and 5 have parts in common (고통스러웠). For example, “고통스러웠다.” can be applied to both entries 4 and 5. In this case, the longer entry is applied and “고통스러웠다.” is fully analyzed by the pre-analyzed dictionary.
On the other hand, "고통스러웠다” misses punctuation and it applies to entry 4. Though the first 5 syllables, “고통스러웠”, are analyzed by the pre-analyzed dictionary, the last syllable “다” is evaluated by a machine learning classifier.
Dictionary Files
There are two files under the rsc/src
directory.
- preanal.auto: File with entries automatically extracted from the Sejong Corpus
- preanal.manual: Files with entries manually added by the users
You may add a new file like preanal.my
A builder program will use all the files starting with preanal.
and build a dictionary.
If a new entry is added to the dictionary, you should run the make resource
command and re-build the dictionary. A binary dictionary will be created based on the dictionary sources under rsc/src
and the dictionary can be executed under build/share/khaiii
Dictionary Format
The format of the pre-analyzed dictionary is <word (pattern)> <tab> <analysis result>
Below is one of the examples.
이더리움* 이더리움/NNP
이러쿵저러쿵 이러쿵저러쿵 /MAG
# Below entries are much the same~
고통스러워하* 고통/NNG + 스럽/XSA + 어/EC + 하/VX
고통스러웠* 고통/NNG + 스럽/XSA + 었/EP
고통스러웠다. 고통/NNG + 스럽/XSA + 었/EP + 다/EF + ./SF
The line starting with “#” will be ignored when the program is run. This works like a comment if it comes at the very beginning of the line. If “#” appears in the middle, what comes after it is not considered a comment.
Build Dictionary
You can build all the resources with the make resource
command in the build
directory. You can also build the pre-analyzed dictionary separately with the command below.
cd rsc
mkdir -p ../build/share/khaiii
PYTHONPATH=$(pwd)/lib ./bin/compile_preanal.py --rsc-src=./src --rsc-dir=../build/share/khaiii
--rsc-src
flag allows you to specify where the rsc/src
directory is located, which contains the preanal.auto
and preanal.manual
files. --rsc-dir
flag allows you to specify where the ../build/share/khaiii
directory and a binary dictionary will be printed. Once the build is successful, the two files below will be created.
preanal.tri
preanal.val
Load Dictionary
When you add a new entry and build the pre-analyzed dictionary, you should copy the whole dictionary in the existing distribution path such as /usr/local/share/khaiii
or in the Python site-packages path. For more details about the dictionary’s installation path, you can refer to Installation path. Or you can include and load the path to dictionary in form of an API parameter.
from khaiii import KhaiiiApi
api = KhaiiiApi(rsc_dir='/path/to/custom/khaiii/dictionary')
Arrangement Per syllable
When a dictionary is built, as described in the CNN model, it processes the word analysis, arranges results, and creates an output tag for each syllable of the word. If there is an error in this process, you will see an error written below.
INFO:root:preanal.manual
INFO:root:preanal.auto
ERROR:root:{M:N} [이더] [리움] []
{M:N} [이/I-NNP 더/I-NNP] [륨/I-NNP] []
ERROR:root:preanal.manual:2: fail to align: "이더리움 이더륨/NNP"
ERROR:root:1 errors
To demonstrate, we have put “이더륨(Etherum)” in the analysis result on purpose, though the input word is “이더리움 (Ethereum) ”. As “리움(Reum)” of the Ethereum is not matched with the “륨(Rum)” of the “이더륨(Etherum), this mismatch leads to a build error. Most of the time, these cases happen due to analysis result errors, it can be successfully built with the corrected data..
On the other hand, the example below fails to match the input with the correct result, though both are the right data.
INFO:root:preanal.manual
INFO:root:preanal.auto
ERROR:root:{M:N} [] [해야] [겠습니다.]
{M:N} [] [하/I-VV 여/I-EC 야/I-EC] [하/I-VX 겠/I-EP 습/I-EF 니/I-EF 다/I-EF ./I-SF]
ERROR:root:preanal.auto:79823: fail to align: "해야겠습니다. 하/VV + 여야/EC + 하/VX + 겠/EP + 습니다/EF + ./SF"
ERROR:root:1 errors
This error happens as “해야” doesn’t match "하/I-VV 여/I-EC 야/I-EC" (analysis result). To change how they are matched, we can match them manually by adding rules as below to the rsc/src/char_align.map
file, and successfully build the dictionary.
해야 하/I-VV 여/I-EC 야/I-EC 21
The format of above rule is <syllables> <Tab> <syllables and tags of analysis result> <Tab> <arrangement Info>
. The first two columns refer to the input and results that were not matched. The ‘2’ in “21” specifies that there are 2 results in the second column that correspond to the first syllable in the first column, and the ‘1’ specifies that there is 1 result in the second column that corresponds to the second syllable in the first column.
syllables | syllables and tags of analysis result | arrangement Info |
---|---|---|
해 | 하/I-VV, 여/I-EC | 2 |
야 | 야/I-EC | 1 |
As seen above, after the analysis, the syllable “해” can be matched to the two syllable results, and “야” can be matched to the one syllable result. Hence, the arrangement info is “21” which is always written in arabic numerals , where the number of digits equals the number of syllables entered.
When it’s successfully arranged, “해” will have a complex tag like “I-VV: I-EC:1” which would need syllable restoration. If there’s no such rule in the syllable restoration dictionary, the data will be automatically added in the rsc/src/restore.dic
file. Also, if it is a new complex tag, tag information will be automatically added in the rsc/src/vocab.out.more
file.
Guides for inputting data to the dictionary
The CNN model reads the input around the syllable by reading 3 windows to left and right of the syllable, (which can include syllables, spaces, zero vectors, etc) and evaluates appropriate tags for the syllable. If a word is longer than 4 syllables, it’s difficult to predict its analysis results. The pre-analyzed dictionary can remedy this issue. Hence, I would like to recommend using the pre-analyzed dictionary in cases like long proper nouns to avoid analysis errors.
The pre-analyzed dictionary only reads a word. If the words to the left or right of the target word lead to different analysis results, please note that using the pre-analyzed dictionary in these cases can cause analysis errors.
When a word and syllable analysis result are changed, there will be cases that no longer match. When you input the revised entry to rsc/src/char_align.map
, there’s a possibility to influence the match of other entries. Please input the new entry only when you are fully aware of the syllable arrangement and are confident in making the change.
*You can find translator's notes italicized
*The original document can be found https://github.com/kakao/khaiii .Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process.
Translator's Note
Introduce Khaiii Github Translation Project: Link
[Khaiii GIthub] Key terms & Concepts: Link
Other Khaiii Translation
[Khaiii Github] Read Me.md: Link
[Khaiii Github] Pre Analysis Dictionary: Link
[Khaiii Github] CNN Model: Link
[Khaiii Github] Test for Specialized Spacing Error Model: Link
[Khaiii Github] CNN Model Training Process: Link
[Khaiii Github]: Analysis Error Patch: [Link](
'Technical Writing > Khaiii Wiki Translation' 카테고리의 다른 글
Khaiii Github - CNN Model Training Process (0) | 2021.07.14 |
---|---|
Khaiii Github - Test for Specialized Spacing Error Model (0) | 2021.07.12 |
Khaiii Github - Key terms & Concepts (0) | 2021.07.12 |
Khaiii Github - Read Me.md (Translated in Eng) (0) | 2021.06.24 |
Introduce Khaiii Github Translation Project (0) | 2021.06.24 |