From 22d1e3c043f00709f0e069be82d9d8dc759c2fa1 Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Mon, 26 Nov 2012 11:55:49 +0800 Subject: [PATCH 1/7] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 8d5175f..d466616 100644 --- a/README.md +++ b/README.md @@ -19,9 +19,9 @@ Usage Algorithm ======== -* 基于Trie树结构实现高效的词图扫描,生成句子中汉字构成的有向无环图(DAG) -* 采用了记忆化搜索实现最大概率路径的计算, 找出基于词频的最大切分组合 -* 对于未登录词,采用了基于汉字位置概率的模型,使用了Viterbi算法 +* 基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图(DAG) +* 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合 +* 对于未登录词,采用了基于汉字成词能力的HMM模型,使用了Viterbi算法 功能 1):分词 ========== @@ -45,7 +45,7 @@ Algorithm Output: - Full Mode: 我/ 来/ 来到/ 到/ 北/ 北京/ 京/ 清/ 清华/ 清华大学/ 华/ 华大/ 大/ 大学/ 学 + Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学 Default Mode: 我/ 来到/ 北京/ 清华大学 From 1d4c0445c6d5fd971dbdd681321876fb131ce1c4 Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Mon, 26 Nov 2012 11:57:00 +0800 Subject: [PATCH 2/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d466616..abf6fee 100644 --- a/README.md +++ b/README.md @@ -152,7 +152,7 @@ Code example: segmentation Output: - Full Mode: 我/ 来/ 来到/ 到/ 北/ 北京/ 京/ 清/ 清华/ 清华大学/ 华/ 华大/ 大/ 大学/ 学 + Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学 Default Mode: 我/ 来到/ 北京/ 清华大学 From 3a0887cdbc669189a24483e9425a3c6d05fe3ebd Mon Sep 17 00:00:00 2001 From: Richard Wong Date: Mon, 26 Nov 2012 15:58:58 +0800 Subject: [PATCH 3/7] Correct dict.txt to analyse/idf.txt --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index abf6fee..3feb15b 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Algorithm * 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合 * 对于未登录词,采用了基于汉字成词能力的HMM模型,使用了Viterbi算法 -功能 1):分词 +功能 1):分词 ========== * jieba.cut方法接受两个输入参数: 1) 第一个参数为需要分词的字符串 2)cut_all参数用来控制分词模式 * 待分词的字符串可以是gbk字符串、utf-8字符串或者unicode @@ -56,7 +56,7 @@ Output: * 开发者可以指定自己自定义的词典,以便包含jieba词库里没有的词。虽然jieba有新词识别能力,但是自行添加新词可以保证更高的正确率 * 用法: jieba.load_userdict(file_name) # file_name为自定义词典的路径 -* 词典格式和dict.txt一样,一个词占一行;每一行分为两部分,一部分为词语,另一部分为词频,用空格隔开 +* 词典格式和`analyse/idf.txt`一样,一个词占一行;每一行分为两部分,一部分为词语,另一部分为词频,用空格隔开 * 范例: 云计算 5 @@ -116,7 +116,7 @@ Features * 1) Default mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis; * 2) Full mode, break the words of the sentence into words scanned, which is suitable for search engines. -Usage +Usage ======== * Fully automatic installation: `easy_install jieba` or `pip install jieba` * Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , after extracting run `python setup.py install` @@ -163,7 +163,7 @@ Function 2): Add a custom dictionary * Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation. * Usage: `jieba.load_userdict(file_name) # file_name is a custom dictionary path` -* The dictionary format is the same as that of `dict.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space +* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space * Example: 云计算 5 From 04cfe2fb0605d2b1490e9d2db8d067ee8ab30b6d Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Tue, 27 Nov 2012 10:33:38 +0800 Subject: [PATCH 4/7] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 3feb15b..baf8268 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,8 @@ Output: 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 / - + +* 通过用户自定义词典来增强歧义纠错能力: https://github.com/fxsjy/jieba/issues/14 功能 3) :关键词提取 ================ From ed81a62a565c73b5f918d7086e3e04367237a46e Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Tue, 27 Nov 2012 10:36:42 +0800 Subject: [PATCH 5/7] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index baf8268..cf70c53 100644 --- a/README.md +++ b/README.md @@ -106,6 +106,11 @@ Output: ========= http://209.222.69.242:9000/ +常见问题 +========= +1)模型的数据是如何生成的?https://github.com/fxsjy/jieba/issues/7 +2)这个库的授权是? https://github.com/fxsjy/jieba/issues/2 + jieba ======== From dd8a649bf36dd73d85827dfbcf76cc8817725bf9 Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Tue, 27 Nov 2012 10:37:12 +0800 Subject: [PATCH 6/7] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cf70c53..bb9f10b 100644 --- a/README.md +++ b/README.md @@ -108,8 +108,8 @@ http://209.222.69.242:9000/ 常见问题 ========= -1)模型的数据是如何生成的?https://github.com/fxsjy/jieba/issues/7 -2)这个库的授权是? https://github.com/fxsjy/jieba/issues/2 + 1)模型的数据是如何生成的?https://github.com/fxsjy/jieba/issues/7 + 2)这个库的授权是? https://github.com/fxsjy/jieba/issues/2 jieba From 3f7e88711586c2c365b9b918d62bfcb6eaacfccb Mon Sep 17 00:00:00 2001 From: Sun Junyi Date: Tue, 27 Nov 2012 10:37:29 +0800 Subject: [PATCH 7/7] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index bb9f10b..a6ae536 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,7 @@ http://209.222.69.242:9000/ 常见问题 ========= 1)模型的数据是如何生成的?https://github.com/fxsjy/jieba/issues/7 + 2)这个库的授权是? https://github.com/fxsjy/jieba/issues/2