Enhancing Chinese Message Search Capability in Telegram
Conclusion
To improve the search effectiveness of Chinese messages in Telegram, it can be achieved by manually inserting invisible delimiters or by developing a custom Tokenizer. Additionally, leveraging AI technology for semantic search can significantly enhance search accuracy.
Key Points
- Telegram Database: Telegram uses SQLite as its database.
- Full-text Search Mechanism: Telegram's full-text search functionality cuts strings into phrases using a Tokenizer, generates hash values, which are then compared against a hash table during search.
- Token Generator: The token generator relies on separators and delimiters to cut strings.
- Token Definition: Content outside of separators and delimiters is considered a 'token', including three types: uppercase letters (*), numbers (N), and other characters (Co).
- CJK Character Handling: CJK (Chinese, Japanese, Korean) characters are mostly identified as tokens within Unicode CJK.
Since there are no delimiters between Chinese characters, Telegram hashes the entire string of Chinese characters, leading to poor search performance. This article delves deeply into the limitations of Chinese message search in Telegram from a code perspective.
Improvement Suggestions
- Manually Insert Delimiters: Manually add invisible delimiters between Chinese characters to improve search effectiveness.
- Custom Tokenizer: Develop a custom Tokenizer and modify the Telegram client to enhance search functionality.
AI Semantic Search
Besides traditional search methods, the introduction of AI provides a better solution for semantic search. The project telegram-search uses an embedding model, allowing users to find desired content even without an exact keyword match. For instance, typing "The person who ate dinner last night" can search for "The man who had dinner with us last night."
By using the methods mentioned above, the search experience for Chinese messages in Telegram can be significantly improved.