Show abstract
AN APPROACH TO BOOTSTRAPPING THE DEVELOPMENT OF MULTILINGUAL RULE-BASED GRAMMARS FOR UNDERRESOURCED LANGUAGES USING CROSS-LINGUISTIC SIMILARITIES: A CASE STUDY OF A SUB-SET OF KENYAN BANTU LANGUAGES
Grammar development through the traditional rule-based method remains a challenge because the method is slow, time-consuming, expensive, knowledge-intensive, and laborious, particularly for under-resourced languages. Moreso, for the spoken Bantu languages. However, there is a high demand for these grammars for deep natural language processing, generation of well-formed output, or both, Controlled Natural languages Applications, and High precision machine translation. An in-depth review of previous research on improving grammar development reveals that these studies concentrated on rich-resourced languages and neglected under-resourced ones and have only concentrated on the syntax, ignoring the morphology in the shareable grammar. Therefore, there is an urgent need for cost-efficient methodologies that can accelerate grammar development to enable these languages to thrive in the digital ecosystem and minimize the language technology digital divide with the rich-resourced languages. Consequently, this research investigated an approach to reducing grammar development efforts for under-resourced languages in a rule-based multilingual environment by leveraging on cross-linguistic similarities to develop a congruent Bantu parameterized grammar and leveraging on the shared parameterized grammar to bootstrap Swahili grammar. The descriptive analysis method was used to analyze descriptive grammar for each geolinguistics and purposively chosen Bantu languages to empirically identify the point of generalization of parameters, regular expressions and grammar rules. Furthermore, universal and individual comparative analyses were used to produce a generalized descriptive grammar for the subset of the Bantu languages. Then, quasi-experiments were set up in Grammatical Framework (GF) using the morphology-driven approach to develop the Bantu parameterized grammar utilizing grammar and to bootstrap Swahili grammar to the Bantu parameterized grammar. The GF regression method was used to test each grammar during development and reusability evaluation was done using shared and modified rules metrics for shareability and portability respectively while accuracy evaluation used a 100-English sentence test-suite. The Bantu parameterized grammar shareability at morphology (parameters at 68.75% and paradigms at 65.3% ) and syntax at 89.57%, while portability at morphology (14.29% at paradigms and 18.75% at parameter) and syntax at 10.43%. The bootstrapped ivSwahili grammar had a shareability of at morphology (parameters at 68.75% and paradigms at 71.11%) and syntax at 91.41%, respectively, while portability at morphology (15.55% at paradigms and 18.75% at parameter) and syntax at 8.59%. In terms of accuracy, the grammars had 4-gram BLEU scores of 83.05%, 77.95% and 55.95% and WER of 12.82%, 13.39% and 23.90%, plus PER of 10.96%, 9.46% and 19.49% for Kikamba, Swahili and Ekegusii languages in that order. The research makes two conclusions, leveraging on the cross-linguistic similarities of principles and parameters significantly reduces multilingual grammars’ development effort and leveraging on congruent grammar to bootstrap a similar grammar takes less effort since most of the rule-base will be inherited from the congruent grammar. The study has several contributions. First, it has provided an approach of bootstrapping the development of multilingual grammar that significantly reduces the effort. Then extended GF reusability by providing standardized Swahili, Kikamba and Ekegusii grammars that are open resources. Furthermore, a hundred sentences test suite for the evaluation of grammars was created. Finally, by providing the missing parts through elicitation, mainly in the numeral, preposition fusion, and subject marker morpheme of the verb, a contribution was made to the descriptive grammar. Keywords: Parameterized grammar, grammar engineering, bootstrapping, grammar sharing, grammar porting, complex morphology and under-resourced languages
more details
- download pdf
- 0 of 0
- 150%