Improving Chinese Message Search in Telegram

Conclusion

To enhance the search performance for Chinese messages in Telegram, solutions include manually inserting invisible delimiters or developing a custom Tokenizer. Furthermore, leveraging AI technology for semantic search can significantly improve search accuracy.

Key Points

Telegram Database: Telegram uses SQLite as its database.
Full-text Search Mechanism: Telegram's full-text search feature segments strings into phrases using a Tokenizer, generating hash values that are then compared against a hash table during searches.
Token Generator: The token generator relies on separators and delimiters to segment strings.
Token Definition: Content outside of separators and delimiters is considered a "token," comprising three types: uppercase letters (*), numbers (N), and other characters (Co).
CJK Character Handling: Most Unicode CJK (Chinese, Japanese, Korean) characters are recognised as tokens.

Because there are no delimiters between Chinese characters, Telegram hashes entire strings of Chinese characters, leading to suboptimal search performance. This article delves into the limitations of Chinese message search in Telegram from a code perspective.

Improvement Suggestions

Manually Insert Delimiters: Manually add invisible delimiters between Chinese characters to improve search performance.
Custom Tokenizer: Develop a custom Tokenizer and modify the Telegram client to enhance search functionality.

AI Semantic Search

Beyond traditional search methods, the introduction of AI offers a superior solution for semantic search. The project telegram-search utilises an embedding model, which allows users to find desired content even without exact keyword matches. For example, entering "昨晚吃饭的那个人" (the person who ate last night) can retrieve results like "昨天晚上和我们一起吃饭的男的" (the guy who had dinner with us last night).

Through these methods, the search experience for Chinese messages in Telegram can be significantly improved.