Compression techniques for Hindi, Hinglish, and Multilingual SMS


Hindi is used by a lot of people for sending text messages. Text message in Hindi allows nearly seventy characters (8-10 words) per unit of SMS in the present scenario. While longer messages are automatically split into message units of appropriate size, the user is still charged according to the number of message units sent. Since 8-10 words are often not enough to contain a full Hindi sentence, it motivates the need for encoding/compression schemes that can allow 160 or more Hindi characters


To develop an encoding scheme that can allow more number of characters of Hindi language in SMS text.
To try and test some standard lossless compression algorithm for this purpose.

Database creation:

SMS text messages are short (Per unit message allow 160 characters). SMS text database is not available publicly, a database of tweets from Twitter has been developed. Twitter also has limit of 140 character per tweet. The database consists of tweets which are either Hindi or English or both. The algorithm is supposed to be applied on short and meaningful messages, Twitter served this purpose too.

Database Creation


For encoding algorithms see ** Standard encoding schemes ** and ** Developed encoding schemes **.



1. Ankit Jalan, Ketan Rajawat, Rajesh M. Hegde, “ New Encoding Schemes for Efficient Multilingual Text Messaging ”, Proceedings of the Twentieth National Conference on Communications, NCC 2014, Kanpur, March 2014

2. Manu Seth, “Pairwise Encoding (Internal File)