Table of Contents:
0.Justification
1.Historical Background
2.Comparative Analysis
3.Solution Plan
0. Justification
From the standpoint of programming, computational sciences and combinatorics, Hangul Mathematics solves an algorithmic problem inherent to Hangul. The question is how to make combinations of 33 Hangul elements - an ancient script created by engineers of the past - assemble linearly according to clear rules and integer‑based logic, since more complex variants are unlikely given the place and time of the technology’s origin. The answer is unequivocally yes: it is possible. All elements can be translated into concrete, unique numeric indices; every syllable can be expressed as a unique sum of indices; every rule can be expressed as numeric ranges, and every element is equal to itself regardless of its hierarchical position.
From a business and development perspective, we obtain a transition from a table of 11 200 unique elements - whose relationship is purely sort‑order based, which led to endless re‑checks and lookup tables - to just 33 atoms, with font and system rendering performed by HarfBuzz or a similar engine. Such engines have already successfully become the standard for gluing elements together for Arabic and Hindi. HarfBuzz can already glue Hangul letters, but from the code’s viewpoint they remain separate elements, which is why the technique never achieved systematic adoption.
Mathematics provides the justification for completely abandoning the Hangul combination map. Once the exact formulation and the rules of the elements are known and incorporated mathematically, the syllable is computed algorithmically; there is no need to store maps and databases.
Brief description of the approach:
Hangul Mathematics FormulaL1 + L2 + L3(ㅗ|ㅜ|ㅣ) + L4×20 + L5×20(ㄹ|ㄴ|ㅅ)
Equality of letters to themselves and to their functions is the foundation of the method
ㄱ(초성)=ㄱ(종성)=ㄱ(ㄺ,ㄳ) | ㅗ+ㅏ=ㅘ
33 numeric tags, exactly as many as there are elements in Hangul (including ㅔ,ㅖ,ㅐ,ㅒ), without any need for re‑checks.
The Complete Hangul Map and its formula(19 자음 + 초성 빈 공간)*((14 모음 + 중성 빈 공간)+7 모음 합((ㅘㅙㅝㅞ…))*((16 자음 + 종성 빈 공간+3(ㄸ, ㅃ, ㅉ)) + 11 자음 합(ㅄㄳㄵㄶㅀ…)) = 13 640 요소
Element identityThe principle for all letters:ㄱ(초성)=ㄱ(종성)=ㄱ(ㄺ,ㄳ)
MethodologyVowels ≤ 21, Consonants ≥ 22 is a mandatory condition.
InterconnectionAll computations are based on a single core formula and its path‑ranges.
Zero ExceptionsPath‑ranges describe every function and every link between letters within a syllable.
1. Historical Background
The history of Hangul is the story of an invention created by engineers - an invention that passed through a phase of functional limitation and combinatorial constraints when a new challenge of the era, the typewriter, appeared. Later it failed to undergo a proper, high‑quality transition into the digital world because no combinatorics specialists were brought in, even though such specialists already existed at the time: Alexey Pajitnov (creator of Tetris, combinatorial assembly of elements based on pure mathematics), Wang Yongmin (creator of a unique, logical, digital approach to hieroglyphics - Wubi, which is still in use today), and Japanese developers of early consoles and PCs who were adept at solving software, game‑design and language‑support problems for different languages under tight hardware constraints.
Detailed chronology from the invention to the present day:
1443: The mathematical foundationThe creators of Hangul, the scholars of King Sejong’s era, laid a foundation that today would be called an engineering principle. The treatise Hunmin Jeongeum (훈민정음) states verbatim: 종성부용초성 - “The final sound re‑uses the initial consonant.” This meant that one and the same letter, for example ㄱ, is the same entity regardless of whether it appears at the beginning of a syllable, at the end, or as part of a double final consonant (batchim). The same applies to vowels: they combine with each other without changing their essence. The alphabet was designed from the outset as a combinatorial system: a minimal set of atomic elements capable of generating every possible syllable according to clear geometric rules. No separation into “initial” and “final” or “single” and “double” versions of the same letter was ever intended - it would have contradicted the very engineering logic of Hangul.
1933: The first compromise - mechanical typewritersIn 1933 the Korean Linguistic Society adopted the Unified Hangul Orthography Rules (한글 맞춤법 통일안). This document, created to standardize writing in the age of mechanical typewriters, introduced the first serious departure from the original principle 종성부용초성. It was administratively decreed that the consonants ㄸ, ㅃ, ㅉ could not be used as final consonants (batchim). This decision had neither a phonetic nor a mathematical basis - it was dictated exclusively by the limitations of mechanical typewriters, for which manufacturing separate slugs for rarely‑used syllables was economically unfeasible. The mathematical principle was sacrificed for the convenience of the technology of the time.
1987: The fatal fork - the KS C 5601 standardBy the mid‑1980s South Korea faced the task of digitizing its national alphabet. A committee was convened to develop the national encoding standard KS C 5601. It included government officials, engineers from major corporations (above all Samsung, GoldStar/LG, Daewoo), and computer manufacturers such as TriGem Computer (Sambo). The committee faced a choice between two approaches:
Algorithmic (dynamic assembly): encode only the basic letters (Jamo), and form the syllable in software. Preliminary calculations suggested it would require more computational power, and the answer to whether Hangul possesses a purely engineering essence or only a phonetic and linguistic one was unknown.
Hieroglyphic (ready‑made syllables): encode every possible syllable as a separate symbol. This was extremely simple to implement, but very costly and complicated for any downstream processing, and it stripped Hangul of its simplicity.
The engineers of TriGem and other manufacturers, basing their decision on memory and performance constraints, chose the second path, which was closer to the typewriter paradigm that heavily influenced early computer manufacturers. The standard was frozen with 2 350 ready‑made syllables selected by frequency. All other possible syllables were simply ignored. The letter ㄱ at the start of a syllable and ㄱ at the end of a syllable became different code‑points. The principle 종성부용초성 was not taken into account during the design. It is also important to note that the problem was not unsolvable even in the 1980s, because Pajitnov solved the Tetris problem in the same years, using far weaker equipment and relying on mathematics and combinatorics.
1993: The error is cemented - the Unicode standardWhen the Unicode Consortium began creating a single worldwide standard in the early 1990s, it was confronted with the existing data ecosystem in the KS C 5601 encoding (known as EUC‑KR). South Korea imposed a strict requirement: ensure full round‑trip compatibility with the already existing data.
The Consortium chose a compromise that at the time seemed wise and exhaustive: Unicode 1.1 (1993) included both approaches. A block Hangul Syllables with 11 172 ready‑made syllables was created (for compatibility with legacy data), and blocks Hangul Jamo with individual letters were created (for the theoretical possibility of dynamic assembly). However, the latter method has never been adopted and, more than three decades later, still lacks full‑fledged tools for practical work. This decision also gave rise to a dual representation of the same text (NFC and NFD), where the cruder, lower‑quality Syllables representation became dominant over the database of ready‑made graphemes, which would have required learning how to use correctly and discovering its regularities.
2026: Absolute technological dead‑endNo Jamo support; zero genuine, full‑fledged Jamo‑based APIs, support packages, or repositories - everything is tied solely to the 11 172 ready‑made syllables. It is as if, instead of Arabic letters, a system contained every possible combination of letters - all systems suffer from an architectural bug, and everyone ends up fighting symptoms. (This aspect is examined in detail in Chapter 2, Comparative Analysis.)
As part of an experiment, checks were performed on the operability and display of Jamo‑based syllables. For 100% verification, old (obsolete) Hangul was chosen - it is not currently in use and cannot become a syllable via NFC normalization (the technique that converts a native letter sequence into a syllable). For a moderate check, an ordinary syllable from the list was chosen, also without normalization. Both were produced using HTML (font-family: "Noto Sans CJK KR", "Malgun Gothic", sans-serif;) from Jamo (graphemes). In essence it is a simulation: the graphemes remain separate, and for the system they are three distinct characters.
Results: Microsoft
Visual Studio Insiders 2026 Win UI 3 C++: zero support, neither for old Hangul nor for ordinary Hangul (without normalization) - it cannot even display the syllable. The experiment was halted due to the absence of developer tooling.
File search and display: old‑Hangul syllables are displayed but cannot be searched; new syllables are automatically replaced in any system by the syllable from the database. That is, the format of gluing letters into a syllable is normal and possible within the Microsoft ecosystem, but it is not workable.
Android
Old syllables are displayed only, without the ability to select or copy them. An ordinary syllable automatically undergoes normalization and becomes the syllable from the list. The experiment in Android Studio was not started due to the lack of technology support.
iOS/macOS
Old Hangul is not displayed at all - only graphemes, and not all of them. An ordinary syllable automatically undergoes normalization and becomes the syllable from the list. The experiment in Xcode was not started due to the lack of technology support.
Summary Project SAMSEGI is a return to the fork that the industry shot past in 1987. It is the restoration, within the digital medium, of the original principle 종성부용초성, recorded in Hunmin Jeongeum almost 600 years ago. The examples that have been created to demonstrate the technology use mapping (the 13 640‑element map) because architectural constraints make any other example impossible. But in truth it is 33 atoms, very concrete numeric tags, used to obtain sums by means of ranges and sums, not by re‑checks. Hence the map as such is not required by the technology.
2. Comparative Analysis
This section provides a concrete breakdown of IT solutions for various languages in several specific sectors. In addition to the direct comparison, a new variable is introduced - the Coat Multiplier, an indicator that measures how many unique, language‑specific external workarounds, patches and bypasses are required to ensure basic functionality. The larger the number, the lower the score.
The main criterion is “closedness of the question.” The more fresh publications, patents, bug‑reports and active discussions on GitHub and developer forums there are, the lower the score. English (10) is the ideal where everything has already been solved.
Summary TableSector | English | Tiếng Việt | 中文 | हिन्दी | العربية | 한국어 |
LLM tokenization & API cost | 10 | 9 | 7 | 5 | 4 | 3 |
Search (FTS, Regexp, SQLite) | 10 | 9 | 7 | 6 | 6 | 4 |
Input systems (IME) | 10 | 8 | 7 | 7 | 6 | 2 |
Embedded systems & IoT | 10 | 9 | 6 | 6 | 5 | 4 |
Electronic commerce | 10 | 8 | 7 | 6 | 6 | 5 |
Coat Multiplier | 10 | 8 | 5 | 5 | 4 | 1 |
Total weighted score | 10 | 8.5 | 6.5 | 5.8 | 5.2 | 3.2 |
Detailed justification (based on search data):
1.
Coat Multiplier Korean (1/10): Absolute leader in the number of “crutches.” New patents continue to be registered in 2024–2025 (Samsung, KEPCO, Naver). Special libraries exist (libhangul), but they have variants and adaptations for different systems and are not a standard, only a commonly accepted accessible solution. Dual representation (NFC/NFD) and the EUC‑KR legacy require constant “dancing with a tambourine.”
Arabic (4/10): Requires a whole set of tools: arabic_reshaper for correct display, python-bidi for bidirectional text, special tokenizers, and mandatory CTL/RTL support everywhere.
Hindi (5/10): Needs special fonts with support for complex ligatures (Devanagari), special processing of grapheme clusters in rendering engines, and adapted NLP tools.
Chinese (5/10): Requires separate word‑segmentation libraries and specific tokenizers for search (ICU/trigrams).
Vietnamese (8/10): After the adoption of Unicode, the problem almost disappeared. Only a few tools like Unikey remain.
English (10/10): Zero crutches.
2.
Input Systems (IME) Here Korean exhibits catastrophic lag.
Korean (2/10): This is the most problematic sector. Bug‑trackers are overflowing: claude‑code (#12528, #59426), warp (#6891), OpenCode, Ghostty - everywhere there are dozens of open bugs related to continuous syllable composition.
Arabic (6/10): Requires manual fixes in VS Code terminal, Matplotlib and other environments due to RTL/Arabic shaping. The problem is known, but its solution comes down to implementation complexity, not an architectural defect.
Hindi (7/10): Mostly stable, but there are specific rendering bugs with composite symbols (matras) in certain libraries such as PyMuPDF.
Chinese (7/10): Pinyin is stable, but entry of rare characters still requires improvements.
Vietnamese (8/10): Rare bugs in terminals, not directly related to the language.
English (10/10): Efficiency standard.
3.
LLM Tokenization and API Cost The hottest sector for many languages except English.
Korean (3/10): Token fertility ≈ 2.36× relative to English. This problem is actively researched (“Korean Penalty”, 2024), and dozens of patents are appearing.
Arabic (4/10): Situation is worse than Korean. The “token tax” reaches 230% (token tax ×3.3). Special tools like Tokenizer Lab are being created.
Hindi (5/10): Standard tokenizers are inefficient because of complex conjuncts. New papers regularly appear (WWHO architecture).
Chinese (7/10): Active area of research, but there are fewer breakthrough problems.
Vietnamese (9/10): Diacritics slightly increase token consumption, but this is not a systemic problem.
English (10/10): Efficiency standard.
4.
Search (FTS, Regexp, SQLite) Korean (4/10): Dual representation (NFC/NFD) breaks search. Bug reports in SQLite FTS (issue #4, 2026) confirm that the problem is current.
Arabic (6/10): Requires processing of diacritics and special symbols, but basic solutions are known.
Hindi (6/10): Similar to Arabic - requires understanding of grapheme clusters.
Chinese (7/10): The main problem is word segmentation. Solved with ICU.
Vietnamese (9/10): No problems.
English (10/10): Efficiency standard.
5.
Embedded Systems and IoT Korean (4/10): Storing 11,172 ready‑made syllables overflows memory and requires lookup tables or other external aids if data manipulation is needed. Discussions are active on forums such as SEGGER.
Arabic (5/10): Mandatory RTL/CTL support on low‑power devices is a hard problem.
Hindi (6/10): Complex ligatures need more memory than Latin script.
Chinese (6/10): Requires storage of thousands of characters.
Vietnamese (9/10): Latin + diacritics, few problems.
English (10/10): 26 letters in kilobytes of memory.
6.
Electronic Commerce Korean (5/10): Difficulties with input on local platforms (Naver, Coupang) and with foreign names lead to abandoned carts. A separate UX cost item.
Arabic (6/10): RTL localization - a mature but more complex process.
Hindi (6/10): Problems are solved within the overall multilingual support of India.
Chinese (7/10): Adaptation to local platforms is well‑tuned.
Vietnamese (8/10): Localization is stable.
English (10/10): Global standard without problems.
ConclusionThe table clearly shows that Korean (3.2) is in the worst position not because of intrinsic complexity, but because of an architectural defect. Its lag is especially noticeable in the critically important sectors of LLM (3/10) and IME (2/10), as well as in the Coat Multiplier (1/10).
Specific references: English (English) - Standard (10) Almost no links to unsolved problems. Everything works.
LLM tokenization: no specific problems.
Search: no specific problems.
IME: not required.
IoT: a 26‑letter font, no issues.
E‑commerce: global standard.
Coat Multiplier: 0 external workarounds. Score: 10.
Vietnamese (Tiếng Việt) - Question closed (8.6) LLM tokenization (9): isolated diacritic studies.
Search (9): no problems since 2003 (standard TCVN 6909:2001).
IME (8): rare bugs in CLI environments (not language‑specific).
IoT (9): Latin + diacritics, few problems.
E‑commerce (8): localization stable.
Coat Multiplier: Unikey (stable). Score: 8.
Chinese (中文) - Objective complexity (6.8) LLM tokenization (7):
Jieba (Python): https://github.com/fxsjy/jieba
PKUSeg: https://github.com/lancopku/pkuseg‑python
Search (7):
SQLite FTS3/4/5 + ICU: https://sqlite.org/fts5.html
IME (7):
Wubi (Pinyin): https://github.com/rust‑wubi/wubi
IoT (6):
Noto CJK: https://github.com/notofonts/noto‑cjk
E‑commerce (7): adaptation to local platforms is well‑tuned.
Coat Multiplier: Jieba, PKUSeg, Wubi, ICU. Score: 5.
Hindi (हिन्दी) - Visual complexity (6.0) LLM tokenization (5):
WWHO Architecture (2024): https://arxiv.org/abs/2410.04281
Devanagari Tokenizer Issue (Hugging Face): https://huggingface.co/ai4bharat/indic‑bert/issues/8
Search (6): grapheme clusters.
IME (7):
PyMuPDF (matra bug): https://github.com/pymupdf/PyMuPDF/issues/2350
IoT (6):
Noto Devanagari: https://github.com/notofonts/devanagari
E‑commerce (6): multilingual support.
Coat Multiplier: Indic‑BERT, Noto Devanagari, PyMuPDF. Score: 5.
Arabic (العربية) - Second‑level script (5.4) LLM tokenization (4):
Tokenizer Lab (2025): https://tokenizerlab.com/arabic
Arabic Token Tax (230%): https://blog.premai.io/arabic‑nlp/
Search (6): diacritics and left half‑ring.
IME (6):
VS Code Terminal RTL: https://github.com/microsoft/vscode/issues/147358
Matplotlib Arabic: https://matplotlib.org/stable/users/explain/text/rendering.html
IoT (5):
Embedded RTL (SEGGER): https://forum.segger.com/index.php/Thread/9063‑Arabic‑TTF‑rendering/
E‑commerce (6): RTL localization.
Coat Multiplier: arabic_reshaper, python‑bidi. Score: 4.
Korean (한국어) - Architectural failure (3.6) LLM tokenization (3):
Korean Token Penalty (2024): https://arxiv.org/abs/2405.09137
TianPan Research (2026): https://tianpan.co
Search (4):
SQLite FTS5 CJK (2026): https://github.com/arjunkmrm/recall/issues/4
IME (2):
Warp Terminal: https://github.com/warpdotdev/warp/issues/6891
Claude Code #12528: https://github.com/anthropics/claude‑code/issues/12528
Claude Code #59426: https://github.com/anthropics/claude‑code/issues/59426
Claude Code #18291: https://github.com/anthropics/claude‑code/issues/18291
Ghostty: https://github.com/ghostty‑org/ghostty/issues/5404
OpenCode: https://github.com/anthropics/opencode/issues/303
Gemini CLI: https://github.com/google‑gemini/gemini‑cli/issues/1479
BossTerm: https://github.com/steveasi/bossterm/issues/87
libhangul: https://github.com/libhangul/libhangul
IoT (4):
SEGGER Korean TTF: https://forum.segger.com/index.php/Thread/9063‑Arabic‑TTF‑rendering/
E‑commerce (5):
SIR KCP Encoding (2025): https://sir.kr/bbs/board.php?bo_table=yc5_plugin&wr_id=515
EUC‑KR Issue (2024): https://sir.kr/bbs/board.php?bo_table=yc_issue&wr_id=789
Coat Multiplier: libhangul, EUC‑KR patches, IME fixes. Score: 1.
3. Solution Plan
Established FactsThe problem of the digital representation of Hangul runs deeper than was previously assumed. Thirty years of industry effort have been directed at fighting symptoms, not the root cause - an architectural defect embedded in the 1993 Unicode standard. This defect gave rise to a fragmented system of 11 200+ elements that requires constant checks and external crutches.
It has been experimentally confirmed that a mechanism for dynamically assembling syllables from individual graphemes physically exists in modern operating systems (DirectWrite in Windows, HarfBuzz in Android and browsers). It is capable of correctly displaying both modern and old (obsolete) syllables that are absent from Unicode. However, this mechanism operates blindly, creating only a visual illusion of a complete symbol and failing to form a full‑fledged structural unit for search, analysis and data processing; moreover, it does not properly handle the graphemes.
Project SAMSEGI fills this critical gap by providing the missing mathematical map. The model of 33 numeric tags and the formula L1+L2+L3+L4×20+L5×20 endows the assembly mechanism with precise, computable addressing, for the first time turning the visual illusion into rigorous engineering reality - where every syllable receives a unique mathematical index.
Allocation of Responsibilities Guaranteed result from adopting SAMSEGI
The SAMSEGI mathematical model, on its own and without any modification to existing legacy code, eliminates the fundamental architectural defect. Its adoption provides:
An increase in the Coat Multiplier from 1 to 8. This reaches the level of Vietnamese, where the digital standardization problem was successfully solved.
Replacement of 11 200+ elements. All processing collapses to 33 tags and formulas, radically cutting storage, verification and data‑transmission costs.
A mathematical index for every syllable. This index is suitable for search, sorting and analysis, without requiring the loss of the original atoms. The method is universal and applicable to any syllable, including those that are absent from the modern Unicode standard.
Directions for further development Turning SAMSEGI into a full‑fledged systemic standard requires integrating the provided mathematical map into the key components of software infrastructure. This work lies within the responsibility of corporate engineering teams and can follow the directions described below:
At the OS level: replacing the “blind” addressing in rendering engines (DirectWrite, CoreText) with the computable SAMSEGI formula.
At the font level: moving from 11 172 ready‑made syllables to 33 atomic glyphs. This will reduce font size by hundreds of times and completely eliminate the problem of damaged or missing symbols.
At the input‑system (IME) level: abandoning continuous syllable composition in favor of its instant computation, thereby eradicating an entire class of bugs in terminals, IDEs and all text fields.
At the search and NLP level: abandoning heavy lookup tables (ICU, trigrams) and switching to direct access of the L1–L5 mathematical tags.
SAMSEGI provides an irreproachable, mathematically verified foundation that did not previously exist.
Economic Impact Direct savings: a two‑fold reduction in token fertility, elimination of 11 172 stored ready‑made syllables in fonts, removal of duplicated infrastructure (EUC‑KR, ICU) and the associated overheads.
Strategic gain: Korean ceases to be a “digital exception” and reaches the efficiency of Vietnamese, and in the long term approaches the performance of English.
New markets: opportunities open up for the mass adoption of Hangul in niche and fast‑growing segments such as the Internet of Things, embedded systems and AI services, which were previously blocked by the high cost of processing.
Conclusion Mathematics is a more fundamental and more universal language for IT systems than any natural language. SAMSEGI does not propose an improvement of the existing mechanism - the project restores the original mathematical harmony of Hangul, laid down by Sejong but lost in a chain of technological compromises. The proposed solution is not another “patch,” but a foundation for a new technological era for the Korean language. What has been created is not a tool for eliminating symptoms, but a basis for building a fundamentally different, mathematically‑verified architecture.
For inquiries and collaborations, please contact:
info@samsegi.com