개발 일기

Khaiii Github - CNN Model Training Process 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - CNN Model Training Process

flow123 2021. 7. 14. 17:32

CNN Model Training Process

Learning Corpus

Khaiii used the Sejong Corpus for the model’s learning. The corpus format is as seen below.

<text>
<group>
<text>
<body>
<source>
<date>
BTAA0001-00000001       1993/06/08      1993/SN + //SP + 06/SN + //SP + 08/SN
</date>
<page>
BTAA0001-00000002       19      19/SN
</page>
</source>
<head>
BTAA0001-00000003       엠마누엘        엠마누엘/NNP
BTAA0001-00000004       웅가로  웅가로/NNP
BTAA0001-00000005       /       //SP
BTAA0001-00000006       의상서  의상/NNG + 서/JKB
BTAA0001-00000007       실내    실내/NNG
BTAA0001-00000008       장식품으로…     장식품/NNG + 으로/JKB + …/SE
BTAA0001-00000009       디자인  디자인/NNG
BTAA0001-00000010       세계    세계/NNG
BTAA0001-00000011       넓혀    넓히/VV + 어/EC
</head>
<p>
BTAA0001-00000012       프랑스의        프랑스/NNP + 의/JKG
BTAA0001-00000013       세계적인        세계/NNG + 적/XSN + 이/VCP + ㄴ/ETM
BTAA0001-00000014       의상    의상/NNG
BTAA0001-00000015       디자이너        디자이너/NNG
BTAA0001-00000016       엠마누엘        엠마누엘/NNP
BTAA0001-00000017       웅가로가        웅가로/NNP + 가/JKS
BTAA0001-00000018       실내    실내/NNG
BTAA0001-00000019       장식용  장식/NNG + 용/XSN
BTAA0001-00000020       직물    직물/NNG
BTAA0001-00000021       디자이너로      디자이너/NNG + 로/JKB
BTAA0001-00000022       나섰다. 나서/VV + 었/EP + 다/EF + ./SF
</p>

Other than headers and metadata, those surrounded by <p> and </p>, <head> and </head> 혹은 <l> and </l> will be recognized and used for the model’s learning

Execution Environment

You can find the script required for training under the train directory. In the train directory, you should run the following command./map_char_to_tag.py. You can also find the python module to execute scripts under src/main/python. You should export the PYTHONPATH environment as written below.

export PYTHONPATH=/path/to/khaiii/src/main/python

You will need the following below packages to train for training the model.

  • tensorboardX
  • tqdm

You can install the pip command as written below.

pip install tensorboardX tqdm

When it comes to PyTorch, the learning works well in version 0.4.1, but not in version 1.0 (we will fix this in the future). Hence, please use an environment management tool such as virtualenv, install Pytorch version 0.4.1, and proceed with training the model. For details about installing PyTorch, please refer to

Syllable- Based arrangement

You can use the following command to arrange original words and their morpheme analysis results by syllables.

./map_char_to_tag.py -c corpus --output corpus.txt --restore-dic restore.dic

-c corpus option is a directory which has corpus files. The Sejong Corpus is distributed in UTF-16 encoding. In order to run the above script, you should convert it to UTF-8.

--output corpus.txt is the syllable arrangement result file. If the command is successfully executed, a file in the format below will be created.

엠마누엘    I-NNP I-NNP I-NNP I-NNP
웅가로    I-NNP I-NNP I-NNP
/    I-SP
의상서    I-NNG I-NNG I-JKB
실내    I-NNG I-NNG
장식품으로…    I-NNG I-NNG I-NNG I-JKB I-JKB I-SE
디자인    I-NNG I-NNG I-NNG
세계    I-NNG I-NNG
넓혀    I-VV I-VV:I-EC:0

프랑스의    I-NNP I-NNP I-NNP I-JKG
세계적인    I-NNG I-NNG I-XSN I-VCP:I-ETM:0
의상    I-NNG I-NNG
디자이너    I-NNG I-NNG I-NNG I-NNG
엠마누엘    I-NNP I-NNP I-NNP I-NNP
웅가로가    I-NNP I-NNP I-NNP I-JKS
실내    I-NNG I-NNG
장식용    I-NNG I-NNG I-XSN
직물    I-NNG I-NNG
디자이너로    I-NNG I-NNG I-NNG I-NNG I-JKB
나섰다.    I-VV I-VV:I-EP:0 I-EF I-SF

--restore-dic restore.dic is the syllable restoration dictionary file. If the command is successfully executed, a file in the format below will be created.

혀/I-VV:I-EC:0    히/I-VV 어/I-EC
혀/I-VV:I-EC:1    히/I-VV 여/I-EC
혀/I-VV:I-EC:2    허/I-VV 어/I-EC
혀/I-VV:I-EC:3    하/I-VV 여/I-EC
혀/I-VV:I-EC:4    혀/I-VV 어/I-EC
혀/I-VV:I-EC:5    치/I-VV 어/I-EC
혀/I-VV:I-EC:6    히/I-VV 아/I-EC
인/I-VCP:I-ETM:0    이/I-VCP ㄴ/I-ETM
인/I-VCP:I-ETM:1    이/I-VCP 은/I-ETM
섰/I-VV:I-EP:0    서/I-VV 었/I-EP
섰/I-VV:I-EP:1    시/I-VV 었/I-EP
섰/I-VV:I-EP:2    스/I-VV 었/I-EP

The restore.dic file is required in the learning process and you should copy it under rsc/src

Splitting the training corpus into three sets

With the following command, split the training corpus (created by map_char_to_tag.py ) into three sets (dev/test/train).

./split_corpus.py --input corpus.txt -o corpus

-input corpus.txt option is a corpus to be splited.

-o corpus is the prefix of the files to be printed. For example, if you type corpus , it will be split into corpus.dev, corpus.test, corpus.train.

Create Vocab

The following command will allow you to create a vocab file with the corpus’s training set (corpus.train)

./make_vocab.py --input corpus.train

If you don’t assign a directory option, it will automatically use ../rsc/src. vocab.invocab.out files will be created in the directory. The vocab.in file specifies a syllable and its frequency as seen below.

齒  25
齡  8
龍  300
龕  8
龜  16
가  499305
각  58237
간  77133
갇  478
갈  15383

The vocab.out file only specifies the POS tag without its frequency. vocab.in

I-XSN
I-XSV
I-ZN
I-ZV
I-ZZ
B-EP:I-EC:0
B-EP:I-EF:0
B-EP:I-ETM:0
B-JKB:I-JKG:0
B-JKB:I-JX:0

Model’s Learning

Once corpus and vocab files are ready, you can start training the model with the following command.

./train.py -i corpus

-i corpus refers to the prefix of three files, corpus.train, corpus.dev, corpus.test.

When training the large model, add --embed-dim 150 option.

Once the training is successfully executed, it will be printed as below.

INFO:root:vocab.in: 5109 entries, 512 cutoff
INFO:root:vocab.out: 500 entries, 0 cutoff
INFO:root:restore.dic: 4303 entries
munjong.dev: 100%|█████████████████████████████████████████| 64444/64444 [00:01<00:00, 43958.83it/s]
INFO:root:munjong.dev: 5000 sentences
munjong.test: 100%|████████████████████████████████████████| 64589/64589 [00:01<00:00, 39999.19it/s]
INFO:root:munjong.test: 5000 sentences
munjong.train: 100%|█████████████████████████████████| 10763939/10763939 [04:51<00:00, 36971.95it/s]
INFO:root:munjong.train: 844614 sentences
INFO:root:config: {'batch_size': 500,
 'best_epoch': 0,
 'context_len': 7,
 'cutoff': 2,
 'debug': False,
 'embed_dim': 30,
 'epoch': 0,
 'gpu_num': 5,
 'hidden_dim': 310,
 'in_pfx': 'corpus',
 'learning_rate': 0.001,
 'logdir': './logdir5',
 'lr_decay': 0.9,
 'model_id': 'corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500',
 'out_dir': './logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500',
 'patience': 10,
 'rsc_src': '../rsc/src',
 'spc_dropout': 0.0,
 'window': 3}
INFO:root:{{{{ training begin: 02/01 13:49:10 {{{{
EPOCH[0]: 100%|████████████████████████████████████████████| 844614/844614 [51:17<00:00, 274.42it/s]
INFO:root:[Los trn]  [Los dev]  [Acc chr]  [Acc wrd]  [F-score]           [LR]
INFO:root:   0.2512     0.1866     0.9448     0.8876     0.9269 BEST      0.00100000
EPOCH[1]: 100%|████████████████████████████████████████████| 844614/844614 [49:59<00:00, 281.55it/s]
INFO:root:[Los trn]  [Los dev]  [Acc chr]  [Acc wrd]  [F-score]           [LR]
INFO:root:   0.1654     0.1675     0.9496     0.8968     0.9333 BEST      0.00100000
EPOCH[2]: 100%|████████████████████████████████████████████| 844614/844614 [50:39<00:00, 277.84it/s]
INFO:root:[Los trn]  [Los dev]  [Acc chr]  [Acc wrd]  [F-score]           [LR]
INFO:root:   0.1530     0.1638     0.9515     0.8989     0.9348 BEST      0.00100000

...

EPOCH[90]: 100%|███████████████████████████████████████████| 844614/844614 [49:24<00:00, 284.94it/s]
INFO:root:[Los trn]  [Los dev]  [Acc chr]  [Acc wrd]  [F-score]           [LR]
INFO:root:   0.1058     0.1237     0.9651     0.9259     0.9524 < 0.9525  0.00000247
INFO:root:}}}} training end: 02/04 15:25:15, elapsed: 73:36:05, epoch: 90 }}}}
INFO:root:==== test loss: 0.1241, char acc: 0.9651, word acc: 0.9258, f-score: 0.9526 ====

For the model’s learning progress graph, please go to ./logdir and use Tensor Board. The model’s learning will be continued until the dev corpus’s performance does not increase anymore, where the process will finish automatically. If you use the whole corpus, it will require about 3 days to execute when using an NVIDIA P40 GPU

For the training results, you can find more details in the 5 files below which can be found under ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500

  • config.json
  • events.out.tfevents.0000000000.hostname
  • log.tsv
  • model.state
  • optim.state

Please don’t delete the files config.json , model.state as they are required to build resources later. With the script below, you can use the model and run a simple morpheme analysis.

$ ./tag.py -m ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500
INFO:root:vocab.in: 5109 entries, 512 cutoff
INFO:root:vocab.out: 500 entries, 0 cutoff
INFO:root:restore.dic: 4303 entries
안녕? 세상.
안녕?    안녕/IC + ?/SF
세상.    세상/NNG + ./SF

These are more options of training scripts.

Option Description Default Value
-i, --in-pfx Prefix of training corpus  
--rsc-src Resources and source directory ../rsc/src
--logdir Log Directory ./logdir
--window Size of the windows to left/right of syllables 3
--spc-dropout Spacing dropout rate 0.0
--cutoff Minimal frequency of input vocab entry 2
--embed-dim Embedding dimensions 30
--learning-rate learning rate 0.001
--lr-decay decrease rate of learning rate 0.9
--batch-size batch size 500
--patience The number of epochs to continue training even without reaching to the best performance *Epoch is when an entire dataset is passed forward and backward through the neural network only once 10
--gpu-num GPU Number to use 0
--debug Debug information  

Create a Pickle file

Out of the 5 files that come from the training results /previously mentioned, the model.state file is dependent on the version of Pytorch. In order to get rid of this dependency, you should create a pickle file once learning is complete so that you can build resources without PyTorch. You can create a pickle file with the following command.

./pickle_model.py -i./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500

you can assign the learning result directory here

-i ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500 .

By running this script, the files (in the left column) will be converted to the base model file as listed below.

/logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500 ../rsc/src
config.json base.config.json
model.state base.model.pickle

 

You can find translator's notes italicized

*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process

 

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: Link

Comments