IntentChat Logo
← Back to English Blog
Language: English

Improving Chinese Message Search in Telegram

2025-06-24

Improving Chinese Message Search in Telegram

Conclusion

To enhance the search effectiveness of Chinese messages in Telegram, it can be achieved by manually inserting invisible delimiters or by developing a custom Tokenizer. Additionally, leveraging AI technology for semantic search can significantly improve search accuracy.

Key Points

  • Telegram Database: Telegram uses SQLite as its database.
  • Full-Text Search Mechanism: Telegram's full-text search functionality cuts strings into phrases using a Tokenizer and generates hash values, which are then compared against a hash table during search.
  • Tokenizer: The tokenizer relies on separators and delimiters to break down strings.
  • Token Definition: Content outside of separators and delimiters is considered a "token," including three types: uppercase letters (*), numbers (N), and other characters (Co).
  • CJK Character Handling: Most Chinese, Japanese, and Korean (CJK) characters, which are part of Unicode CJK, are recognized as tokens.

Because there are no delimiters between Chinese characters, Telegram hashes entire strings of Chinese characters, leading to ineffective search results. This article delves into the limitations of Telegram's Chinese message search from a code perspective.

Improvement Suggestions

  1. Manually Inserting Delimiters: Manually add invisible delimiters between Chinese characters to improve search effectiveness.
  2. Custom Tokenizer: Develop a custom Tokenizer and modify the Telegram client to enhance search functionality.

AI Semantic Search

Beyond traditional search methods, the introduction of AI offers a better solution for semantic search. The project telegram-search utilizes an embedding model, which allows users to find desired content even without exact keyword matches. For example, typing "昨晚吃饭的那个人" (the person who ate last night) can retrieve "昨天晚上和我们一起吃饭的男的" (the man who ate with us last night).

By using the methods described above, the search experience for Chinese messages in Telegram can be significantly improved.