-
I could split every two characters which would be a bit more efficient, but it's just added complexity and it sounds like in some cases it might mess things up?
what do you guys mean by every two characters?
a chinese word = a chinese character. in the past, , it take up double the byte of that of an ASCII character.
e.g.for the number of bytes you store "AB",
you can only store 1 chinese "中".thanks
-
While in Classical Chinese it was certainly the case that one character most often wrote one word, in modern Mandarin, it is more often the case that a word consists of two syllables, represented by two characters. This change can actually start to be seen as early as Xunzi, who uses a noticeably larger amount of disyllabic words than Mencius. Let's look again at the example sentence from the first article from the Universal Declaration of Human Rights, this time adding spaces between words*:
人人 生 而 自由﹐在 尊嚴 和 權利 上 一律 平等。他們 賦有 理性 和 良心﹐並 應 以 兄弟 關係 的 精神 互相 對待。
Sum of monosyllabic words (one graph, one word): 10
Sum of disyllabic words (two graphs, one word): 15*Note that for the purposes of this conversation, a word should be understood as something one would encounter in daily life and also something that one would find in a dictionary of modern Mandarin.
Thanks - well, that's promising. I'd propose:
U+4E00..U+9FFF
? This includes Japanese too) then encode each character as its own image. This will use a bit more memory, but will have the effect of wrapping each character whenever there is a new line (otherwise they will still be placed inline)I could split every two characters which would be a bit more efficient, but it's just added complexity and it sounds like in some cases it might mess things up?