개발 일기

Khaiii Github - Test for Specialized Spacing Error Model 본문

Technical Writing/Khaiii Wiki Translation

Khaiii Github - Test for Specialized Spacing Error Model

flow123 2021. 7. 12. 14:28

Test for Specialized Spacing Error Model.**

Overfitting: If a model is overfitted, it would have a high accuracy to the training data, but it does not work properly on the verification or test data. An overfitted model struggles to evaluate new data. This means the model has been adapted too closely to the training data and even learned its noise.

A user often forgets to put a space (especially on mobile devices) where it’s supposed to. Khaiii was trained by the Sejong Corpus which does not have spacing error data and it became vulnerable to spacing errors. As you see in the CNN Model, in order to perform an analysis based on syllables, Khaiii creates context as the size of the windows and adds the virtual syllables ““ and “” which indicate the left and right borders of current words. This leads the v 0.3 model to be highly dependent on spacing.

Spacing Dropout Test

Decoder: unfolds a vector representing the sequence state and return something meaningful for us like text, tags, etc. Baeldung.com

The Khaiii team chose the Dropout algorithm to train neural networks. When adding the virtual syllables “” and <”/w”> which refer to spaces at the left and right boundaries, we trained the new model not to add the spaces randomly. We also tested the decoder and made it always use a spacing syllable when there was a spacing in input data.

Test Result

img

  • Blue: v0.3 model (dropout 0.0)

  • Red: Spacing was not added (dropout 1.0)

  • Orange: spacing drop out (seldomly selected neurons ignored) (dropout 0.5)

The Sejong Corpus has accurate spacing, it’s apparent that the v0.3 model has the best performance as it was trained by the Corpus. However, when compared with other models, the orange model (dropout 0.5) exhibits lower performance than the red one (dropout 1.0), which does not have any spacing at all.

We can assume that there are several factors that may explain the orange model’s low performance. In the orange model, when randomly adding the virtual syllables (spacing), there would probably be changes made in the syllable’s position encoding information, which could cause the input to the convolution filter to become uneven.

Element-Wise Sum Operation Model

Element-wise: a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands

Embedding: a relatively low-dimensional space into which you can translate high-dimensional vectors

Padding: the amount of pixels added to an image when it is being processed by the kernel of a CNN.

Masking: a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data.

Resources: Wikipedia, Google Developers, Google Colaboratory, kdnuggets.com,

In the v0.3 model, spacing syllables are positioned in the syllable context and they are treated just like any other syllable. In the new model, at the position of the word's left/right border, we will do the element wise sum for “” and ” (virtual syllables embedding). If this model performs similarly as v0.3, we will drop out when adding space syllables. With the element-wise method, elements in the corresponding location will be added, which could probably prevent the positioning information from being uneven.

Changes below have been made for this test.

  • The v0.3 model allows for left and right paddings, and virtual syllables, to be embedded. In the new version, we’ve changed the setting so that this padding has zero embeddings and can be skipped. The positional encoding still covers this padding, and we have not fully masked them.
  • For spaces, we have used a at the left of the word and a at the right, and added them through an element-wise operation.

Test Result

img

  • Blue: dropout=0
  • Pink > brown > green: dropout each 0.25, 0.5, 0.75
  • Baby blue: this model does not use spacing syllables. dropout = 1.0

For the v0.3 model, spacing syllables have their positions like other syllables, and the drop out = 0.5 model has the lowest performance. For the element-wise models, the drop out = 1.0 model has the lowest performance.

In this model, the accuracy (F-score) gradually decreases as the drop out ratio increases.

The dropout = 0.5 model has 1%p lower performance than the blue one (drop out = 0). The dropout = 0.25 model has only 0.5%p difference compared to the drop out = 0 model, which seems appropriate.

The Khaiii team decided to perform a multi-task test before choosing the model.

Spacing Model & Multi-Task learning

Fully- connected layers : *Fully Connected layers in neural networks are those layers where all the inputs from one layer are connected to every activation unit of the next layer. In most popular machine learning models, the last few layers are full connected layers which compile the data extracted by previous layers to form the final output (https://iq.opengenus.org/).

A v0.3 model evaluates appropriate POS tags, but it’s not specialized for spacing, while the specialized for spacing model cannot classify POSs. Hence, multi-tasked learning was introduced so that a model can learn both spacing and POS tagging.

The Sejong Corpus has accurate spacing, sentence splitting, and POS information, so we used this corpus to train POS tag and spacing models. The two models share syllable based embedding and convolutions, but they have different fully connected layers. Each layer classifies spacing and assigns POS tags. We have trained these two models together with multi-task learning.

Loss is as written below. #58

  • POS loss = cross entropy loss which evaluates POS by syllable
  • Spacing loss = cross entropy loss which evaluates whether there will be a space after a syllable
  • Total Loss = POS loss + spacing loss

Forward/backward steps are as written below.

  • Don’t use embeddings of spacing syllable, forward, and calculate loss
  • Use spacing syllables (with dropout applied), forward, and calculate loss
  • Get total Loss and backward at once.

img

  • Brown: existing single task learning
  • Babyblue: multi-task learning (MTL)
  • Pink: MTL dropout 0.25
  • Green: MTL dropout 0.5
  • Gray: MTL dropout 0.75

You will find a slight performance decrease in multi-task learning (baby blue), compared to single task learning (brown). If you compare MTL models with different dropout ratios, the performance decreases as the dropout ratio increases.

Does this mean the V0.3 model works the best? Let’s take a look at how the two models analyze the sentence, 아버지가 방에 들어가신다

Model Result
V0.3 아버지가방/NNG + 에/JKB + 들/VV + 어/EC + 가/VX + 시/EP + ㄴ다/EF + ./SF
MTL 아버지/NNG + 가/JKS + 방/NNG + 에/JKB + 들어가/VV + 시/EP + ㄴ다/EF + ./SF

We have looked through the models with dropout = 0.1 and an f-score of 95% and chosen the brown model with a window size of 4 and an embedding size of 35.

img

In order to compare the v0.3 and MTL models in a more quantitative way, we randomly deleted spacings in the test set and tested them as seen below.

img

The v0.3 model’s performance drastically decreases as more spacing is deleted. On the other hand, the MTL model shows a moderately decreasing curve. When all the spacing is fully deleted, the MTL model has an accuracy of 80.76%, while the v0.3 model has an accuracy of 45.51%.

Conclusion

The multi-task learning model has a slightly lower performance than that of v0.3, but is strong at handling spacing errors. We can see that the v0.3 model is overfitted to the corpus which does not have spacing errors at all. As the Sebastian Ruder’s thesis also mentions, multi-task learning seems to be helpful in generalizing the model.

You can find translator's notes italicized

*The original document can be found https://github.com/kakao/khaiii. Please note that this document has not been reviewed by the Kakao team and it's just my personal project. Please feel free to provide feedbacks on any error that may occur during the translation process

Translator's Note

Introduce Khaiii Github Translation Project: Link

[Khaiii GIthub] Key terms & Concepts: Link

Other Khaiii Translation

[Khaiii Github] Read Me.md: Link

[Khaiii Github] Pre Analysis Dictionary: Link

[Khaiii Github] CNN Model: Link

[Khaiii Github] Test for Specialized Spacing Error Model: Link

[Khaiii Github] CNN Model Training Process: Link

[Khaiii Github]: Analysis Error Patch: [Link](

Comments