Merge pull request #183 from gumblex/jieba3k

Jieba3k update to v0.33
pull/188/head
Sun Junyi 11 years ago
commit 8f52419386

@ -1,3 +1,9 @@
2014-08-31: version 0.33
1. 支持自定义stop words; by @fukuball
2. 支持自定义idf词典; by @fukuball
3. 修复自定义词典的词性不能正常显示的bug; by @ShuraChow
2014-02-07: version 0.32
1. 新增分词选项可以关闭新词发现功能详见https://github.com/fxsjy/jieba/blob/master/test/test_no_hmm.py#L8
2. 修复posseg子模块的Bug详见: https://github.com/fxsjy/jieba/issues/111 https://github.com/fxsjy/jieba/issues/132

@ -1,6 +1,6 @@
jieba
========
"结巴"中文分词做最好的Python中文分词组件
"结巴"中文分词:做最好的 Python 中文分词组件
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
- _Scroll down for English documentation._
@ -8,7 +8,6 @@ jieba
注意!
========
这个branch `jieba3k`是专门用于Python3.x的版本
=======
Feature
@ -36,52 +35,54 @@ http://jiebademo.ap01.aws.af.cm/
Python 2.x 下的安装
===================
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
* 半自动安装先下载http://pypi.python.org/pypi/jieba/ 解压后运行python setup.py install
* 手动安装将jieba目录放置于当前目录或者site-packages目录
* 通过import jieba 来引用 第一次import时需要构建Trie树需要几秒时间
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 python setup.py install
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
* 通过 import jieba 来引用
Python 3.x 下的安装
====================
* 目前master分支是只支持Python2.x 的
* 目前 master 分支是只支持 Python2.x 的
* Python3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install
* 或使用pip3安装 pip3 install jieba3k
结巴分词Java版本
结巴分词 Java 版本
================
作者piaolingxue
地址https://github.com/huaban/jieba-analysis
结巴分词C++版本
结巴分词 C++ 版本
================
作者Aszxqw
地址https://github.com/aszxqw/cppjieba
结巴分词Node.js版本
结巴分词 Node.js 版本
================
作者Aszxqw
地址https://github.com/aszxqw/nodejieba
结巴分词Erlang版本
结巴分词 Erlang 版本
================
作者falood
https://github.com/falood/exjieba
Algorithm
========
* 基于Trie树结构实现高效的词图扫描生成句子中汉字所有可能成词情况所构成的有向无环图DAG)
* 基于 Trie 树结构实现高效的词图扫描生成句子中汉字所有可能成词情况所构成的有向无环图DAG)
* 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
* 对于未登录词采用了基于汉字成词能力的HMM模型使用了Viterbi算法
* 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法
功能 1):分词
==========
* `jieba.cut`方法接受两个输入参数: 1) 第一个参数为需要分词的字符串 2cut_all参数用来控制是否采用全模式
* `jieba.cut_for_search`方法接受一个参数:需要分词的字符串,该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 注意待分词的字符串可以是gbk字符串、utf-8字符串或者unicode
* `jieba.cut`以及`jieba.cut_for_search`返回的结构都是一个可迭代的generator可以使用for循环来获得分词后得到的每一个词语(unicode)也可以用list(jieba.cut(...))转化为list
* `jieba.cut` 方法接受两个输入参数: 1) 第一个参数为需要分词的字符串 2cut_all 参数用来控制是否采用全模式
* `jieba.cut_for_search` 方法接受一个参数:需要分词的字符串,该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 注意待分词的字符串可以是gbk字符串、utf-8 字符串或者 unicode
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...))转化为 list
代码示例( 分词 )
@ -89,17 +90,15 @@ Algorithm
import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) #全模式
print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) #精确模式
print("Default Mode:", "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") #默认是精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
Output:
@ -115,36 +114,48 @@ Output:
功能 2) :添加自定义词典
================
* 开发者可以指定自己自定义的词典以便包含jieba词库里没有的词。虽然jieba有新词识别能力但是自行添加新词可以保证更高的正确率
* 用法: jieba.load_userdict(file_name) # file_name为自定义词典的路径
* 开发者可以指定自己自定义的词典,以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力,但是自行添加新词可以保证更高的正确率
* 用法: jieba.load_userdict(file_name) # file_name 为自定义词典的路径
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频,最后为词性(可省略),用空格隔开
* 范例:
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
功能 3) :关键词提取
================
* jieba.analyse.extract_tags(sentence,topK) #需要先import jieba.analyse
* setence为待提取的文本
* topK为返回几个TF/IDF权重最大的关键词默认值为20
* jieba.analyse.extract_tags(sentence,topK) #需要先 import jieba.analyse
* setence 为待提取的文本
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
代码示例 (关键词提取)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
关键词提取所使用逆向文件频率IDF文本语料库可以切换成自定义语料库的路径
* 用法: jieba.analyse.set_idf_path(file_name) # file_name为自定义语料库的路径
* 自定义语料库示例https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
关键词提取所使用停止词Stop Words文本语料库可以切换成自定义语料库的路径
* 用法: jieba.analyse.set_stop_words(file_name) # file_name为自定义语料库的路径
* 自定义语料库示例https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
功能 4) : 词性标注
================
* 标注句子分词后每个词的词性采用和ictclas兼容的标记法
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法
* 用法示例
>>> import jieba.posseg as pseg
@ -156,11 +167,11 @@ Output:
爱 v
北京 ns
天安门 ns
功能 5) : 并行分词
==================
* 原理将目标文本按行分隔后把各行文本分配到多个python进程并行分词然后归并结果从而获得分词速度的可观提升
* 基于python自带的multiprocessing模块目前暂不支持windows
* 原理:将目标文本按行分隔后,把各行文本分配到多个 python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 windows
* 用法:
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式
@ -168,12 +179,12 @@ Output:
* 例子:
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* 实验结果在4核3.4GHz Linux机器上对金庸全集进行精确分词获得了1MB/s的速度是单进程版的3.3倍。
* 实验结果:在 4 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
功能 6) : Tokenize返回词语在原文的起始位置
============================================
* 注意,输入参数只接受unicode
* 注意,输入参数只接受 str
* 默认模式
```python
@ -206,9 +217,9 @@ word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
```
功能 7) : ChineseAnalyzer for Whoosh搜索引擎
功能 7) : ChineseAnalyzer for Whoosh 搜索引擎
============================================
* 引用: `from jieba.analyse import ChineseAnalyzer `
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
@ -222,19 +233,19 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. 支持繁体分词更好的词典文件
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
下载你所需要的词典然后覆盖jieba/dict.txt 即可或者用`jieba.set_dictionary('data/dict.txt.big')`
下载你所需要的词典然后覆盖jieba/dict.txt 即可或者用 `jieba.set_dictionary('data/dict.txt.big')`
模块初始化机制的改变:lazy load 从0.28版本开始)
================================================
jieba采用延迟加载"import jieba"不会立即触发词典的加载一旦有必要才开始加载词典构建trie。如果你想手工初始jieba也可以手动初始化。
jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载一旦有必要才开始加载词典构建trie。如果你想手工初始 jieba也可以手动初始化。
import jieba
jieba.initialize() # 手动初始化(可选)
在0.28之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
0.28 之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
jieba.set_dictionary('data/dict.txt.big')
@ -335,9 +346,9 @@ Function 2): Add a custom dictionary
李小福 2
创新办 3
之前 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
[Before] 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
[After]: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
Function 3): Keyword Extraction
================
@ -349,6 +360,18 @@ Code sample (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Developers can specify their own custom IDF corpus in jieba keyword extraction
* Usage `jieba.analyse.set_idf_path(file_name) # file_name is a custom corpus path`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
Developers can specify their own custom stop words corpus in jieba keyword extraction
* Usage `jieba.analyse.set_stop_words(file_name) # file_name is a custom corpus path`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
Using Other Dictionaries
========
It is possible to supply Jieba with your own custom dictionary, and there are also two dictionaries readily available for download:

File diff suppressed because it is too large Load Diff

@ -0,0 +1,51 @@
the
of
is
and
to
in
that
we
for
an
are
by
be
as
on
with
can
if
from
which
you
it
this
then
at
have
all
not
one
has
or
that
一個
沒有
我們
你們
妳們
他們
她們
是否

@ -195,7 +195,7 @@ def __cut_DAG_NO_HMM(sentence):
if len(buf)>0:
yield buf
buf = ''
yield l_word
yield l_word
x =y
if len(buf)>0:
yield buf
@ -243,8 +243,8 @@ def __cut_DAG(sentence):
yield elem
def cut(sentence,cut_all=False,HMM=True):
'''The main function that segments an entire sentence that contains
Chinese characters into seperated words.
'''The main function that segments an entire sentence that contains
Chinese characters into seperated words.
Parameter:
- sentence: The String to be segmented
- cut_all: Model. True means full pattern, false means accurate pattern.
@ -257,8 +257,8 @@ def cut(sentence,cut_all=False,HMM=True):
sentence = sentence.decode('gbk','ignore')
'''
\\u4E00-\\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
\r\n|\s : whitespace characters. Will not be Handled.
'''
\r\n|\s : whitespace characters. Will not be Handled.
'''
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U), re.compile(r"(\r\n|\s)")
if cut_all:
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5]+)", re.U), re.compile(r"[^a-zA-Z0-9+#\n]")
@ -306,7 +306,7 @@ def load_userdict(f):
''' Load personalized dict to improve detect rate.
Parameter:
- f : A plain text file contains words and their ocurrences.
Structure of dict file:
Structure of dict file:
word1 freq1 word_type1
word2 freq2 word_type2
...
@ -372,7 +372,7 @@ def enable_parallel(processnum=None):
def pcut(sentence,cut_all=False,HMM=True):
parts = re.compile('([\r\n]+)').split(sentence)
if cut_all:
result = pool.map(__lcut_all,parts)
result = pool.map(__lcut_all,parts)
else:
if HMM:
result = pool.map(__lcut,parts)

@ -1,3 +1,4 @@
#encoding=utf-8
import jieba
import os
try:
@ -5,27 +6,54 @@ try:
except ImportError:
pass
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
f_name = os.path.join(_curpath,"idf.txt")
content = open(f_name,'rb').read().decode('utf-8')
_curpath = os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
abs_path = os.path.join(_curpath, "idf.txt")
idf_freq = {}
lines = content.split('\n')
for line in lines:
word,freq = line.split(' ')
idf_freq[word] = float(freq)
median_idf = sorted(idf_freq.values())[int(len(idf_freq)/2)]
stop_words= set([
"the","of","is","and","to","in","that","we","for","an","are","by","be","as","on","with","can","if","from","which","you","it","this","then","at","have","all","not","one","has","or","that"
IDF_DICTIONARY = abs_path
STOP_WORDS = set([
"the","of","is","and","to","in","that","we","for","an","are","by","be","as","on","with","can","if","from","which","you","it","this","then","at","have","all","not","one","has","or","that"
])
def set_idf_path(idf_path):
global IDF_DICTIONARY
abs_path = os.path.normpath( os.path.join( os.getcwd(), idf_path ) )
if not os.path.exists(abs_path):
raise Exception("jieba: path does not exist:" + abs_path)
IDF_DICTIONARY = abs_path
return
def get_idf(abs_path):
content = open(abs_path,'rb').read().decode('utf-8')
idf_freq = {}
lines = content.split('\n')
for line in lines:
word,freq = line.split(' ')
idf_freq[word] = float(freq)
median_idf = sorted(idf_freq.values())[len(idf_freq)//2]
return idf_freq, median_idf
def set_stop_words(stop_words_path):
global STOP_WORDS
abs_path = os.path.normpath( os.path.join( os.getcwd(), stop_words_path ) )
if not os.path.exists(abs_path):
raise Exception("jieba: path does not exist:" + abs_path)
content = open(abs_path,'rb').read().decode('utf-8')
lines = content.split('\n')
for line in lines:
STOP_WORDS.add(line)
return
def extract_tags(sentence,topK=20):
global IDF_DICTIONARY
global STOP_WORDS
idf_freq, median_idf = get_idf(IDF_DICTIONARY)
words = jieba.cut(sentence)
freq = {}
for w in words:
if len(w.strip())<2: continue
if w.lower() in stop_words: continue
if w.lower() in STOP_WORDS: continue
freq[w]=freq.get(w,0.0)+1.0
total = sum(freq.values())
freq = [(k,v/total) for k,v in freq.items()]

@ -1,6 +1,6 @@
from distutils.core import setup
setup(name='jieba3k',
version='0.32',
version='0.33',
description='Chinese Words Segementation Utilities',
author='Sun, Junyi',
author_email='ccnusjy@gmail.com',

@ -4,14 +4,14 @@ sys.path.append("../")
import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) #全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) #默认模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) # 默认模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))

@ -12,7 +12,7 @@ parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) <1:
if len(args) < 1:
print(USAGE)
sys.exit(1)

@ -0,0 +1,32 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_idfpath.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))

@ -0,0 +1,33 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_stop_words.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))

@ -12,7 +12,7 @@ import os
import random
if len(sys.argv)<2:
print "usage: extract_topic.py directory [n_topic] [n_top_words]"
print("usage: extract_topic.py directory [n_topic] [n_top_words]")
sys.exit(0)
n_topic = 10
@ -28,27 +28,27 @@ count_vect = CountVectorizer()
docs = []
pattern = os.path.join(sys.argv[1],"*.txt")
print "read "+pattern
print("read "+pattern)
for f_name in glob.glob(pattern):
with open(f_name) as f:
print "read file:", f_name
print("read file:", f_name)
for line in f: #one line as a document
words = " ".join(jieba.cut(line))
docs.append(words)
random.shuffle(docs)
print "read done."
print("read done.")
print "transform"
print("transform")
counts = count_vect.fit_transform(docs)
tfidf = TfidfTransformer().fit_transform(counts)
print tfidf.shape
print(tfidf.shape)
t0 = time.time()
print "training..."
print("training...")
nmf = decomposition.NMF(n_components=n_topic).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))

@ -1,7 +1,7 @@
#-*-coding: utf-8 -*-
import sys
import imp
sys.path.append("../")
from imp import reload
import unittest
import types
import jieba
@ -98,7 +98,7 @@ test_contents = [
class JiebaTestCase(unittest.TestCase):
def setUp(self):
reload(jieba)
imp.reload(jieba)
def tearDown(self):
pass

@ -23,6 +23,6 @@ while True:
break
line = line.strip()
for word in jieba.cut(line):
print(word.encode(default_encoding))
print(word)

@ -29,6 +29,6 @@ content = open(file_name,'rb').read()
tags = jieba.analyse.extract_tags(content,topK=topK)
print(",".join(tags) )
print(",".join(tags))

@ -6,7 +6,9 @@ jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut(test_sent)
print( "/ ".join(result) )
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":

@ -6,7 +6,9 @@ jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut(test_sent,cut_all=True)
print("/ ".join(result))
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":

@ -6,7 +6,9 @@ jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut_for_search(test_sent)
print("/ ".join(result))
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":

@ -1,3 +1,4 @@
import urllib.request, urllib.error, urllib.parse
import sys,time
import sys
sys.path.append("../../")
@ -6,16 +7,15 @@ import jieba
jieba.enable_parallel()
url = sys.argv[1]
with open(url,"rb") as content:
content = content.read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('cost',tm_cost)
print('speed' , len(content)/tm_cost, " bytes/second")
content = open(url,"rb").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
with open("1.log","wb") as log_f:
log_f.write(words.encode('utf-8'))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","wb")
log_f.write(words.encode('utf-8'))
print('speed' , len(content)/tm_cost, " bytes/second")

@ -8,7 +8,7 @@ import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for w in result:
sys.stdout.write(w.word+ "/"+ w.flag + ", ")
print(w.word, "/", w.flag, ", ", end=' ')
print("")

@ -1,4 +1,4 @@
import urllib2
import urllib.request, urllib.error, urllib.parse
import sys,time
import sys
sys.path.append("../../")
@ -16,7 +16,7 @@ tm_cost = t2-t1
log_f = open("1.log","wb")
for w in words:
print >> log_f, w.encode("utf-8"), "/" ,
print(w.encode("utf-8"), "/", end=' ', file=log_f)
print 'speed' , len(content)/tm_cost, " bytes/second"
print('speed' , len(content)/tm_cost, " bytes/second")

@ -3,9 +3,10 @@ import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent)
print("/ ".join(result))
print(" / ".join(result))
if __name__ == "__main__":

@ -5,7 +5,7 @@ import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent)
print(" ".join(result) )
print(" ".join(result))
def testcase():
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")

@ -5,8 +5,9 @@ import jieba
def cuttest(test_sent):
result = jieba.cut_for_search(test_sent)
print("/ ".join(result))
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
@ -93,4 +94,4 @@ if __name__ == "__main__":
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')

@ -5,7 +5,9 @@ import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent,cut_all=True)
print("/ ".join(result))
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
@ -92,4 +94,4 @@ if __name__ == "__main__":
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')

@ -1,3 +1,4 @@
import urllib.request, urllib.error, urllib.parse
import sys,time
import sys
sys.path.append("../")
@ -5,15 +6,17 @@ import jieba
jieba.initialize()
url = sys.argv[1]
with open(url,"rb") as content:
content = content.read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('cost',tm_cost)
print('speed' , len(content)/tm_cost, " bytes/second")
content = open(url,"rb").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","wb")
log_f.write(words.encode('utf-8'))
log_f.close()
print('cost',tm_cost)
print('speed' , len(content)/tm_cost, " bytes/second")
with open("1.log","wb") as log_f:
log_f.write(words.encode('utf-8'))
log_f.write(bytes("/ ".join(words),'utf-8'))

@ -6,7 +6,7 @@ import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for w in result:
sys.stdout.write(w.word+ "/"+ w.flag + ", ")
print(w.word, "/", w.flag, ", ", end=' ')
print("")

@ -1,3 +1,4 @@
import urllib.request, urllib.error, urllib.parse
import sys,time
import sys
sys.path.append("../")
@ -15,7 +16,7 @@ tm_cost = t2-t1
log_f = open("1.log","wb")
for w in words:
log_f.write(bytes(w.word+"/"+w.flag+" ",'utf-8'))
print(w.encode("utf-8"), "/", end=' ', file=log_f)
print('speed' , len(content)/tm_cost, " bytes/second")

@ -14,7 +14,7 @@ for w in words:
result = pseg.cut(test_sent)
for w in result:
print(w.word, "/", w.flag, ", ")
print(w.word, "/", w.flag, ", ", end=' ')
print("\n========")

@ -59,5 +59,5 @@ for keyword in ("水果世博园","你","first","中文","交换机","交换"):
print(hit.highlights("content"))
print("="*10)
for t in analyzer("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream."):
for t in analyzer("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot"):
print(t.text)

@ -23,8 +23,8 @@ with open(file_name,"rb") as inf:
for line in inf:
i+=1
writer.add_document(
title=u"line"+str(i),
path=u"/a",
title="line"+str(i),
path="/a",
content=line.decode('gbk','ignore')
)
writer.commit()
@ -32,10 +32,10 @@ writer.commit()
searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)
for keyword in (u"水果小姐",u"",u"first",u"中文",u"交换机",u"交换"):
print "result of ",keyword
for keyword in ("水果小姐","","first","中文","交换机","交换"):
print("result of ",keyword)
q = parser.parse(keyword)
results = searcher.search(q)
for hit in results:
print hit.highlights("content")
print "="*10
print(hit.highlights("content"))
print("="*10)

@ -18,10 +18,10 @@ ix = open_dir("tmp")
searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)
for keyword in (u"水果小姐",u"",u"first",u"中文",u"交换机",u"交换",u"少林",u"乔峰"):
print "result of ",keyword
for keyword in ("水果小姐","","first","中文","交换机","交换","少林","乔峰"):
print("result of ",keyword)
q = parser.parse(keyword)
results = searcher.search(q)
for hit in results:
print hit.highlights("content")
print "="*10
print(hit.highlights("content"))
print("="*10)

Loading…
Cancel
Save