* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
在线演示
=========
http://209.222.69.242:9000/
jieba
========
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Features
========
* Support two types of segmentation mode:
* 1) Default mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
* 2) Full mode, break the words of the sentence into words scanned, which is suitable for search engines.
Usage
========
* Fully automatic installation: `easy_install jieba` or `pip install jieba`
* Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , after extracting run `python setup.py install`
* Manutal installation: place the `jieba` directory in the current directory or python site-packages directory.
* Use `import jieba` to import, which will first build the Trie tree only on first import (takes a few seconds).
Algorithm
========
* Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG)
* Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination
* For unknown words, the character position probability-based model is used, using the Viterbi algorithm
Function 1): cut
==========
* The `jieba.cut` method accepts to input parameters: 1) the first parameter is the string that requires segmentation, and the 2) second parameter is `cut_all`, a parameter used to control the segmentation pattern.
* `jieba.cut` returned structure is an iterative generator, where you can use a `for` loop to get the word segmentation (in unicode), or `list(jieba.cut( ... ))` to create a list.
他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
Function 2): Add a custom dictionary
==========
* Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.
* Usage:`jieba.load_userdict(file_name) # file_name is a custom dictionary path`
* The dictionary format is the same as that of `dict.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space