期刊名称:International Journal of Computer Science & Information Technology (IJCSIT)
印刷版ISSN:0975-4660
电子版ISSN:0975-3826
出版年度:2017
卷号:9
期号:1
页码:1
出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching(PPM) compression algorithm. This achieves significantly better compression for different naturallanguage texts compared to other well-known compression methods. Our method first generates a grammarbased on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) inthe text being compressed and then substitutes these sequences using the respective non-terminal symbolsdefined by the grammar in a pre-processing phase prior to the compression. This leads to significantlyimproved results in compression for various natural languages (a 5% improvement for American English,10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). Wedescribe further improvements using a two pass scheme where the grammar-based pre-processing isapplied again in a second pass through the text. We then apply the algorithms to the files in the CalgaryCorpus and also achieve significantly improved results in compression, between 11% and 20%, whencompared with other compression algorithms, including a grammar-based approach, the Sequituralgorithm.