일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트v
- 코딩온라인
- Morphological analysis #Corpus
- taskkill
- 파이콘
- 서버사이드렌더링
- khaiii
- 출처: 자바의 신 8장
- expression statement is not assignment or call html
- 플젝후체크
- Technical Writing
- SSR
- 모바일웹스킨
- 카우치코딩 #couchcoding #6주포트폴리오 #6주협업프로젝트
- github markdown
- #스파르타코딩클럽후기 #내일배움캠프후기
- 필사
- github
- Anaconda
- 마크다운
- PID
- 비동기
- 파이썬
- gitbash
- 클라이언트사이드렌더링
- Machine Learning
- 자바파이썬
- terminate
- Kakao
- address
- Today
- Total
개발 일기
Khaiii Github - CNN Model Training Process 본문
Khaiii Github - CNN Model Training Process
flow123 2021. 7. 14. 17:32CNN Model Training Process
Learning Corpus
Khaiii used the Sejong Corpus for the model’s learning. The corpus format is as seen below.
<text>
<group>
<text>
<body>
<source>
<date>
BTAA0001-00000001 1993/06/08 1993/SN + //SP + 06/SN + //SP + 08/SN
</date>
<page>
BTAA0001-00000002 19 19/SN
</page>
</source>
<head>
BTAA0001-00000003 엠마누엘 엠마누엘/NNP
BTAA0001-00000004 웅가로 웅가로/NNP
BTAA0001-00000005 / //SP
BTAA0001-00000006 의상서 의상/NNG + 서/JKB
BTAA0001-00000007 실내 실내/NNG
BTAA0001-00000008 장식품으로… 장식품/NNG + 으로/JKB + …/SE
BTAA0001-00000009 디자인 디자인/NNG
BTAA0001-00000010 세계 세계/NNG
BTAA0001-00000011 넓혀 넓히/VV + 어/EC
</head>
<p>
BTAA0001-00000012 프랑스의 프랑스/NNP + 의/JKG
BTAA0001-00000013 세계적인 세계/NNG + 적/XSN + 이/VCP + ㄴ/ETM
BTAA0001-00000014 의상 의상/NNG
BTAA0001-00000015 디자이너 디자이너/NNG
BTAA0001-00000016 엠마누엘 엠마누엘/NNP
BTAA0001-00000017 웅가로가 웅가로/NNP + 가/JKS
BTAA0001-00000018 실내 실내/NNG
BTAA0001-00000019 장식용 장식/NNG + 용/XSN
BTAA0001-00000020 직물 직물/NNG
BTAA0001-00000021 디자이너로 디자이너/NNG + 로/JKB
BTAA0001-00000022 나섰다. 나서/VV + 었/EP + 다/EF + ./SF
</p>
Other than headers and metadata, those surrounded by <p>
and </p>
, <head>
and </head>
혹은 <l>
and </l>
will be recognized and used for the model’s learning
Execution Environment
You can find the script required for training under the train
directory. In the train
directory, you should run the following command./map_char_to_tag.py
. You can also find the python module to execute scripts under src/main/python
. You should export the PYTHONPATH
environment as written below.
export PYTHONPATH=/path/to/khaiii/src/main/python
You will need the following below packages to train for training the model.
- tensorboardX
- tqdm
You can install the pip command as written below.
pip install tensorboardX tqdm
When it comes to PyTorch, the learning works well in version 0.4.1, but not in version 1.0 (we will fix this in the future). Hence, please use an environment management tool such as virtualenv, install Pytorch version 0.4.1, and proceed with training the model. For details about installing PyTorch, please refer to
Syllable- Based arrangement
You can use the following command to arrange original words and their morpheme analysis results by syllables.
./map_char_to_tag.py -c corpus --output corpus.txt --restore-dic restore.dic
-c corpus option
is a directory which has corpus files. The Sejong Corpus is distributed in UTF-16 encoding. In order to run the above script, you should convert it to UTF-8.
--output corpus.txt
is the syllable arrangement result file. If the command is successfully executed, a file in the format below will be created.
엠마누엘 I-NNP I-NNP I-NNP I-NNP
웅가로 I-NNP I-NNP I-NNP
/ I-SP
의상서 I-NNG I-NNG I-JKB
실내 I-NNG I-NNG
장식품으로… I-NNG I-NNG I-NNG I-JKB I-JKB I-SE
디자인 I-NNG I-NNG I-NNG
세계 I-NNG I-NNG
넓혀 I-VV I-VV:I-EC:0
프랑스의 I-NNP I-NNP I-NNP I-JKG
세계적인 I-NNG I-NNG I-XSN I-VCP:I-ETM:0
의상 I-NNG I-NNG
디자이너 I-NNG I-NNG I-NNG I-NNG
엠마누엘 I-NNP I-NNP I-NNP I-NNP
웅가로가 I-NNP I-NNP I-NNP I-JKS
실내 I-NNG I-NNG
장식용 I-NNG I-NNG I-XSN
직물 I-NNG I-NNG
디자이너로 I-NNG I-NNG I-NNG I-NNG I-JKB
나섰다. I-VV I-VV:I-EP:0 I-EF I-SF
--restore-dic restore.dic
is the syllable restoration dictionary file. If the command is successfully executed, a file in the format below will be created.
혀/I-VV:I-EC:0 히/I-VV 어/I-EC
혀/I-VV:I-EC:1 히/I-VV 여/I-EC
혀/I-VV:I-EC:2 허/I-VV 어/I-EC
혀/I-VV:I-EC:3 하/I-VV 여/I-EC
혀/I-VV:I-EC:4 혀/I-VV 어/I-EC
혀/I-VV:I-EC:5 치/I-VV 어/I-EC
혀/I-VV:I-EC:6 히/I-VV 아/I-EC
인/I-VCP:I-ETM:0 이/I-VCP ㄴ/I-ETM
인/I-VCP:I-ETM:1 이/I-VCP 은/I-ETM
섰/I-VV:I-EP:0 서/I-VV 었/I-EP
섰/I-VV:I-EP:1 시/I-VV 었/I-EP
섰/I-VV:I-EP:2 스/I-VV 었/I-EP
The restore.dic
file is required in the learning process and you should copy it under rsc/src
Splitting the training corpus into three sets
With the following command, split the training corpus (created by map_char_to_tag.py
) into three sets (dev/test/train).
./split_corpus.py --input corpus.txt -o corpus
-input corpus.txt
option is a corpus to be splited.
-o corpus
is the prefix of the files to be printed. For example, if you type corpus
, it will be split into corpus.dev
, corpus.test
, corpus.train
.
Create Vocab
The following command will allow you to create a vocab file with the corpus’s training set (corpus.train)
./make_vocab.py --input corpus.train
If you don’t assign a directory option, it will automatically use ../rsc/src
. vocab.in
및 vocab.out
files will be created in the directory. The vocab.in
file specifies a syllable and its frequency as seen below.
齒 25
齡 8
龍 300
龕 8
龜 16
가 499305
각 58237
간 77133
갇 478
갈 15383
The vocab.out
file only specifies the POS tag without its frequency. vocab.in
I-XSN
I-XSV
I-ZN
I-ZV
I-ZZ
B-EP:I-EC:0
B-EP:I-EF:0
B-EP:I-ETM:0
B-JKB:I-JKG:0
B-JKB:I-JX:0
Model’s Learning
Once corpus and vocab files are ready, you can start training the model with the following command.
./train.py -i corpus
-i corpus
refers to the prefix of three files, corpus.train
, corpus.dev
, corpus.test
.
When training the large model, add --embed-dim 150
option.
Once the training is successfully executed, it will be printed as below.
INFO:root:vocab.in: 5109 entries, 512 cutoff
INFO:root:vocab.out: 500 entries, 0 cutoff
INFO:root:restore.dic: 4303 entries
munjong.dev: 100%|█████████████████████████████████████████| 64444/64444 [00:01<00:00, 43958.83it/s]
INFO:root:munjong.dev: 5000 sentences
munjong.test: 100%|████████████████████████████████████████| 64589/64589 [00:01<00:00, 39999.19it/s]
INFO:root:munjong.test: 5000 sentences
munjong.train: 100%|█████████████████████████████████| 10763939/10763939 [04:51<00:00, 36971.95it/s]
INFO:root:munjong.train: 844614 sentences
INFO:root:config: {'batch_size': 500,
'best_epoch': 0,
'context_len': 7,
'cutoff': 2,
'debug': False,
'embed_dim': 30,
'epoch': 0,
'gpu_num': 5,
'hidden_dim': 310,
'in_pfx': 'corpus',
'learning_rate': 0.001,
'logdir': './logdir5',
'lr_decay': 0.9,
'model_id': 'corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500',
'out_dir': './logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500',
'patience': 10,
'rsc_src': '../rsc/src',
'spc_dropout': 0.0,
'window': 3}
INFO:root:{{{{ training begin: 02/01 13:49:10 {{{{
EPOCH[0]: 100%|████████████████████████████████████████████| 844614/844614 [51:17<00:00, 274.42it/s]
INFO:root:[Los trn] [Los dev] [Acc chr] [Acc wrd] [F-score] [LR]
INFO:root: 0.2512 0.1866 0.9448 0.8876 0.9269 BEST 0.00100000
EPOCH[1]: 100%|████████████████████████████████████████████| 844614/844614 [49:59<00:00, 281.55it/s]
INFO:root:[Los trn] [Los dev] [Acc chr] [Acc wrd] [F-score] [LR]
INFO:root: 0.1654 0.1675 0.9496 0.8968 0.9333 BEST 0.00100000
EPOCH[2]: 100%|████████████████████████████████████████████| 844614/844614 [50:39<00:00, 277.84it/s]
INFO:root:[Los trn] [Los dev] [Acc chr] [Acc wrd] [F-score] [LR]
INFO:root: 0.1530 0.1638 0.9515 0.8989 0.9348 BEST 0.00100000
...
EPOCH[90]: 100%|███████████████████████████████████████████| 844614/844614 [49:24<00:00, 284.94it/s]
INFO:root:[Los trn] [Los dev] [Acc chr] [Acc wrd] [F-score] [LR]
INFO:root: 0.1058 0.1237 0.9651 0.9259 0.9524 < 0.9525 0.00000247
INFO:root:}}}} training end: 02/04 15:25:15, elapsed: 73:36:05, epoch: 90 }}}}
INFO:root:==== test loss: 0.1241, char acc: 0.9651, word acc: 0.9258, f-score: 0.9526 ====
For the model’s learning progress graph, please go to ./logdir
and use Tensor Board. The model’s learning will be continued until the dev corpus’s performance does not increase anymore, where the process will finish automatically. If you use the whole corpus, it will require about 3 days to execute when using an NVIDIA P40 GPU
For the training results, you can find more details in the 5 files below which can be found under ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500
- config.json
- events.out.tfevents.0000000000.hostname
- log.tsv
- model.state
- optim.state
Please don’t delete the files config.json
, model.state
as they are required to build resources later. With the script below, you can use the model and run a simple morpheme analysis.
$ ./tag.py -m ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500
INFO:root:vocab.in: 5109 entries, 512 cutoff
INFO:root:vocab.out: 500 entries, 0 cutoff
INFO:root:restore.dic: 4303 entries
안녕? 세상.
안녕? 안녕/IC + ?/SF
세상. 세상/NNG + ./SF
These are more options of training scripts.
Option | Description | Default Value |
---|---|---|
-i, --in-pfx | Prefix of training corpus | |
--rsc-src | Resources and source directory | ../rsc/src |
--logdir | Log Directory | ./logdir |
--window | Size of the windows to left/right of syllables | 3 |
--spc-dropout | Spacing dropout rate | 0.0 |
--cutoff | Minimal frequency of input vocab entry | 2 |
--embed-dim | Embedding dimensions | 30 |
--learning-rate | learning rate | 0.001 |
--lr-decay | decrease rate of learning rate | 0.9 |
--batch-size | batch size | 500 |
--patience | The number of epochs to continue training even without reaching to the best performance *Epoch is when an entire dataset is passed forward and backward through the neural network only once | 10 |
--gpu-num | GPU Number to use | 0 |
--debug | Debug information |
Create a Pickle file
Out of the 5 files that come from the training results /previously mentioned, the model.state file is dependent on the version of Pytorch. In order to get rid of this dependency, you should create a pickle file once learning is complete so that you can build resources without PyTorch. You can create a pickle file with the following command.
./pickle_model.py -i./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500
you can assign the learning result directory here
-i ./logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500
.
By running this script, the files (in the left column) will be converted to the base model file as listed below.
/logdir/corpus.cut2.win3.sdo0.0.emb30.lr0.001.lrd0.9.bs500 | ../rsc/src |
---|---|
config.json | base.config.json |
model.state | base.model.pickle |
You can find translator's notes italicized
*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process
Translator's Note
Introduce Khaiii Github Translation Project: Link
[Khaiii GIthub] Key terms & Concepts: Link
Other Khaiii Translation
[Khaiii Github] Read Me.md: Link
[Khaiii Github] Pre Analysis Dictionary: Link
[Khaiii Github] CNN Model: Link
[Khaiii Github] Test for Specialized Spacing Error Model: Link
[Khaiii Github] CNN Model Training Process: Link
[Khaiii Github]: Analysis Error Patch: Link
'Technical Writing > Khaiii Wiki Translation' 카테고리의 다른 글
Khaiii Github - CNN Model (0) | 2021.07.27 |
---|---|
Khaiii Github - Analysis Error Patch (0) | 2021.07.16 |
Khaiii Github - Test for Specialized Spacing Error Model (0) | 2021.07.12 |
Khaiii Github - Key terms & Concepts (0) | 2021.07.12 |
Khaiii Github - Pre Analysis Dictionary (Translated in Eng) (0) | 2021.06.24 |