개발 일기

Khaiii Github - Pre Analysis Dictionary (Translated in Eng) 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - Pre Analysis Dictionary (Translated in Eng)

flow123 2021. 6. 24. 18:42

Pre-analyzed Dictionary

Pre-analyzed dictionary is used when a word’s analysis shows a consistent result regardless of its context.

Type of Dictionary Entry

There are two types of pre-analyzed dictionaries.

  • Exact Match: When two words are fully matched
  • Partial Match: When two words have the same beginning in common but are not exact matches

Below are the examples of pre-analyzed dictionaries

Number Input word Analysis Result
1 이더리움* 이더리움/NNP
2 이러쿵 저러쿵 이러쿵저러쿵/MAG
3 고통스러워하* 고통/NNG + 스럽/XSA + 어/EC + 하/VX
4 고통스러웠* 고통/NNG + 스럽/XSA + 었/EP
5 고통스러웠다. 고통/NNG + 스럽/XSA + 었/EP + 다/EF + ./SF

Entries 1, 3, 4 have “” at the end of each input word and they are considered partial matches. For example, the entry 1 `이더리움` can be applied to words like “이더리움이 (이 is a subject marker)”, “이더리움을 (을 is an object marker)”.

Unlike the three previous entries, Entries 2 and 5 are exact matches. These entries can be only applied when a word is exactly matched to the entry. For example, for entry 5 고통스러웠다., "고통스러웠다" cannot be applied as it misses the punctuation “.”.

Entries 4 and 5 have parts in common (고통스러웠). For example, “고통스러웠다.” can be applied to both entries 4 and 5. In this case, the longer entry is applied and “고통스러웠다.” is fully analyzed by the pre-analyzed dictionary.

On the other hand, "고통스러웠다” misses punctuation and it applies to entry 4. Though the first 5 syllables, “고통스러웠”, are analyzed by the pre-analyzed dictionary, the last syllable “다” is evaluated by a machine learning classifier.

Dictionary Files

There are two files under the rsc/src directory.

  • preanal.auto: File with entries automatically extracted from the Sejong Corpus
  • preanal.manual: Files with entries manually added by the users

You may add a new file like preanal.my A builder program will use all the files starting with preanal.and build a dictionary.

If a new entry is added to the dictionary, you should run the make resource command and re-build the dictionary. A binary dictionary will be created based on the dictionary sources under rsc/src and the dictionary can be executed under build/share/khaiii

Dictionary Format

The format of the pre-analyzed dictionary is <word (pattern)> <tab> <analysis result> Below is one of the examples.

이더리움*    이더리움/NNP
이러쿵저러쿵  이러쿵저러쿵 /MAG
# Below entries are much the same~
고통스러워하* 고통/NNG + 스럽/XSA + 어/EC + 하/VX
고통스러웠*      고통/NNG + 스럽/XSA + 었/EP
고통스러웠다. 고통/NNG + 스럽/XSA + 었/EP + 다/EF + ./SF

The line starting with “#” will be ignored when the program is run. This works like a comment if it comes at the very beginning of the line. If “#” appears in the middle, what comes after it is not considered a comment.

Build Dictionary

You can build all the resources with the make resource command in the build directory. You can also build the pre-analyzed dictionary separately with the command below.

cd rsc
mkdir -p ../build/share/khaiii
PYTHONPATH=$(pwd)/lib ./bin/compile_preanal.py --rsc-src=./src --rsc-dir=../build/share/khaiii

--rsc-src flag allows you to specify where the rsc/src directory is located, which contains the preanal.auto and preanal.manual files. --rsc-dir flag allows you to specify where the ../build/share/khaiii directory and a binary dictionary will be printed. Once the build is successful, the two files below will be created.

preanal.tri
preanal.val

Load Dictionary

When you add a new entry and build the pre-analyzed dictionary, you should copy the whole dictionary in the existing distribution path such as /usr/local/share/khaiii or in the Python site-packages path. For more details about the dictionary’s installation path, you can refer to Installation path. Or you can include and load the path to dictionary in form of an API parameter.

from khaiii import KhaiiiApi
api = KhaiiiApi(rsc_dir='/path/to/custom/khaiii/dictionary')

Arrangement Per syllable

When a dictionary is built, as described in the CNN model, it processes the word analysis, arranges results, and creates an output tag for each syllable of the word. If there is an error in this process, you will see an error written below.

INFO:root:preanal.manual

INFO:root:preanal.auto

ERROR:root:{M:N} [이더] [리움] []

{M:N} [이/I-NNP 더/I-NNP] [륨/I-NNP] []


ERROR:root:preanal.manual:2: fail to align: "이더리움 이더륨/NNP"

ERROR:root:1 errors

To demonstrate, we have put “이더륨(Etherum)” in the analysis result on purpose, though the input word is “이더리움 (Ethereum) ”. As “리움(Reum)” of the Ethereum is not matched with the “륨(Rum)” of the “이더륨(Etherum), this mismatch leads to a build error. Most of the time, these cases happen due to analysis result errors, it can be successfully built with the corrected data..

On the other hand, the example below fails to match the input with the correct result, though both are the right data.

INFO:root:preanal.manual

INFO:root:preanal.auto

ERROR:root:{M:N} [] [해야] [겠습니다.]

{M:N} [] [하/I-VV 여/I-EC 야/I-EC] [하/I-VX 겠/I-EP 습/I-EF 니/I-EF 다/I-EF ./I-SF]

ERROR:root:preanal.auto:79823: fail to align: "해야겠습니다. 하/VV + 여야/EC + 하/VX + 겠/EP + 습니다/EF + ./SF"

ERROR:root:1 errors

This error happens as “해야” doesn’t match "하/I-VV 여/I-EC 야/I-EC" (analysis result). To change how they are matched, we can match them manually by adding rules as below to the rsc/src/char_align.map file, and successfully build the dictionary.

해야 하/I-VV 여/I-EC 야/I-EC 21

The format of above rule is <syllables> <Tab> <syllables and tags of analysis result> <Tab> <arrangement Info>. The first two columns refer to the input and results that were not matched. The ‘2’ in “21” specifies that there are 2 results in the second column that correspond to the first syllable in the first column, and the ‘1’ specifies that there is 1 result in the second column that corresponds to the second syllable in the first column.

syllables syllables and tags of analysis result arrangement Info
하/I-VV, 여/I-EC 2
야/I-EC 1

As seen above, after the analysis, the syllable “해” can be matched to the two syllable results, and “야” can be matched to the one syllable result. Hence, the arrangement info is “21” which is always written in arabic numerals , where the number of digits equals the number of syllables entered.

When it’s successfully arranged, “해” will have a complex tag like “I-VV: I-EC:1” which would need syllable restoration. If there’s no such rule in the syllable restoration dictionary, the data will be automatically added in the rsc/src/restore.dic file. Also, if it is a new complex tag, tag information will be automatically added in the rsc/src/vocab.out.more file.

Guides for inputting data to the dictionary

The CNN model reads the input around the syllable by reading 3 windows to left and right of the syllable, (which can include syllables, spaces, zero vectors, etc) and evaluates appropriate tags for the syllable. If a word is longer than 4 syllables, it’s difficult to predict its analysis results. The pre-analyzed dictionary can remedy this issue. Hence, I would like to recommend using the pre-analyzed dictionary in cases like long proper nouns to avoid analysis errors.

The pre-analyzed dictionary only reads a word. If the words to the left or right of the target word lead to different analysis results, please note that using the pre-analyzed dictionary in these cases can cause analysis errors.

When a word and syllable analysis result are changed, there will be cases that no longer match. When you input the revised entry to rsc/src/char_align.map , there’s a possibility to influence the match of other entries. Please input the new entry only when you are fully aware of the syllable arrangement and are confident in making the change.

 

 

*You can find translator's notes italicized

*The original document can be found https://github.com/kakao/khaiii .Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process.

 

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

 

 

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: [Link](

Comments