Commit Graph

104 Commits (093980647b0070eb29ce0cadeb2cd1f6fa448565)

Author SHA1 Message Date
Dingyuan Wang 99d0fb1a8a use regex and fix encoding related issues in load_userdict
Dingyuan Wang ceb5c26be4 fix self.FREQ in cut_for_search; make pair object iterable
Dingyuan Wang 3b76328f2a allow ignoring word frequency while providing pos tag
Dingyuan Wang 94840a734c wraps most globals in classes
API changes:
* class jieba.Tokenizer, jieba.posseg.POSTokenizer
* class jieba.analyse.TFIDF, jieba.analyse.TextRank
* global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer
* multiprocessing only works with jieba.(posseg.)dt
* new lcut, lcut_for_search functions that returns a list
* jieba.analyse.textrank now returns 20 items by default

Tests:
* added test_lock.py to test multithread locking
* demo.py now contains most of the examples in README
Dingyuan Wang 4a552ca94f suggest word frequency, support passing str to add_word
Dingyuan Wang 872a7039f2 Merge branch 'master' of https://github.com/fxsjy/jieba
Dingyuan Wang f808ea0ebb use only one dict to store words and prefixes
fxsjy 5bfa43a781 fix test scripts
Dingyuan Wang f3a53dd2da fix print() in tests
fxsjy 8cbb26a7b6 fix test_file.py
Dingyuan Wang 22bcf8be7a Merge master and jieba3k, make the code Python 2/3 compatible
Dingyuan Wang 3dad899ec8 backport 2to3 scripts and changelog
Dingyuan Wang c6b386f65b update jieba3k
Dingyuan Wang a5ecf70f71 update to v0.35
Dingyuan Wang 4a6140081e fix problems in auto2to3
Dingyuan Wang 7a6caa0c3c port extract_tags, etc to jieba3k; add auto2to3 script
walkskyer 6772f0282e 修复带权重测试脚本输出结果是调用顺序错误
Dingyuan Wang fd9f1f2c0e update README, textrank, etc.
fxsjy f5ca87e088 merge change of @fukuball
Dingyuan Wang bb1e6000c6 fix version; fix spaces at end of line
Dingyuan Wang 51df77831b use prefix dict instead of trie, add a command line interface, and a few small improvements
Dingyuan Wang 6fad5fbb2c update to v0.33
Fukuball Lin b658ee69cb 讓 jieba 可以自行增加 stop words 語料庫
1. 增加範例 stop words 語料庫
2. 為了讓 jieba 可以切換 stop words 語料庫,新增 set_stop_words 方法,並改寫 extract_tags
3. test 增加 extract_tags_stop_words.py 測試範例
Fukuball Lin 7198d562f1 讓 jieba 可以切換 idf 語料庫
1. 新增繁體中文 idf 語料庫
2. 為了讓 jieba 可以切換 iff 語料庫,新增 get_idf, set_idf_path 方法,並改寫 extract_tags
3. test 增加 extract_tags_idfpath
Dingyuan Wang c04ccd0d12 Update to v0.32 according to the master branch.
fxsjy 18678d50c6 fix bug issue
gan 31d5845535 add better support for english. like input: 'this is interesting and interested me'-->output:'this interest interest',which 'interest' match 'interesting interested'
Sun Junyi 7e7fcc1184 add an option to disable HMM
ZoeyYoung d49542c06e fix bug
ZoeyYoung dce353f88b merge from master
ZoeyYoung 2857ae45cc Merge branch 'master' into jieba3k
Conflicts:
	Changelog
	jieba/__init__.py
	jieba/finalseg/__init__.py
	jieba/posseg/__init__.py
	setup.py
	test/parallel/test_file.py
	test/test_file.py
Sun Junyi 81390a2d23 test_file.py: close the file object
fxsjy b77645b3aa modify test_file.py; use less memory
Linker Lin 5d83855088 自动检测CPU数目,启动合适数目的进程。
Linker Lin 2ceb981da0 自动检测CPU数目,启动合适数目的进程。
Sun Junyi 6549deabbd merge change from master
Cheng wei 6035bb6320 fix invalid syntax for python3
Sun Junyi 9d0ea771a5 fix bug; decimals & digit-english mixed
Sun Junyi ba5114dc95 update whoosh example
Sun Junyi f424862222 clean the files in tmp
Sun Junyi b18d56d2a3 Merge pull request from linkerlin/master
添加一个tmp目录,好让test_whoosh.py可以运行。
Sun Junyi b9b1f1a418 fix conflict of merging
miao.lin becd32b178 made test_whoosh.py happy.
添加一个tmp目录,好让test_whoosh.py可以运行。
Sun Junyi c01680c6a8 merge the new file
Sun Junyi b62f052927 PEP8
Sun Junyi 45daf561c7 follow PEP8: change tab to 4 white spaces
Sun Junyi dbec3ad9df add some comments
Sun Junyi efc784312c add ChineseAnalyzer for whoosh search engine
Sun Junyi f08690a2df add 'search mode' for jieba.tokenize
Sun Junyi cb1b0499f7 unittest for jieba.tokenize