加入paddle分词和词性标注功能 (#788)

* paddle cut release * 修改README.md，提示用户安装paddlepaddle.tiny * 删除两个init.py文件中utf头文件 * 修改readme细节
5 years ago · 5b3bb4b7f2
parent 38134ee20f
commit 5b3bb4b7f2
33 changed files with 21589 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -9,11 +9,11 @@ jieba

 特点
 ========
-* 支持三种分词模式：
+* 支持四种分词模式：
    * 精确模式，试图将句子最精确地切开，适合文本分析；
    * 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
    * 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
-
+    * paddle模式，利用paddlepaddle深度学习框架，训练序列标注（双向GRU）网络模型实现分词。同时支持词性标注。如需使用，请先安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。
 * 支持繁体分词
 * 支持自定义词典
 * MIT 授权协议
@ -33,6 +33,7 @@ jieba
 * 半自动安装：先下载 http://pypi.python.org/pypi/jieba/ ，解压后运行 `python setup.py install`
 * 手动安装：将 jieba 目录放置于当前目录或者 site-packages 目录
 * 通过 `import jieba` 来引用
+* 如果需要使用paddle模式下的分词和词性标注功能，请先安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。

 算法
 ========
@ -44,7 +45,7 @@ jieba
 =======
 1. 分词
 --------
-* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型
+* `jieba.cut` 方法接受四个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型；use_paddle 参数用来控制是否使用paddle模式下的分词模式（如需使用，安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1` ）；
 * `jieba.cut_for_search` 方法接受两个参数：需要分词的字符串；是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词，粒度比较细
 * 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建议直接输入 GBK 字符串，可能无法预料地错误解码成 UTF-8
 * `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator，可以使用 for 循环来获得分词后得到的每一个词语(unicode)，或者用
@ -57,6 +58,9 @@ jieba
 # encoding=utf-8
 import jieba

+seg_list = jieba.cut("我来到北京清华大学", use_paddle=True)
+print("Paddle Mode: " + "/ ".join(seg_list))  # paddle模式
+
 seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
 print("Full Mode: " + "/ ".join(seg_list))  # 全模式

@ -192,11 +196,13 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 -----------
 * `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器，`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
 * 标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法。
+* 除了jieba默认分词模式，提供paddle模式下的词性标注功能。如需使用，请先安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。
 * 用法示例

 ```pycon
 >>> import jieba.posseg as pseg
->>> words = pseg.cut("我爱北京天安门")
+>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
+>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
 >>> for word, flag in words:
 ...    print('%s %s' % (word, flag))
 ...
@ -206,6 +212,21 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 天安门 ns
 ```

+paddle模式词性标注对应表如下：
+
+paddle模式词性和专名类别标签集合如下表，其中词性标签 24 个（小写字母），专名类别标签 4 个（大写字母）。
+
+| 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
+| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
+| n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
+| nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
+| nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
+| a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
+| m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
+| c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
+| PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
+
+
 5. 并行分词
 -----------
 * 原理：将目标文本按行分隔后，把各行文本分配到多个 Python 进程并行分词，然后归并结果，从而获得分词速度的可观提升
--- a/jieba/init.py
+++ b/jieba/init.py
@ -20,6 +20,7 @@ if os.name == 'nt':
 else:
    _replace_file = os.rename

+
 _get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))

 DEFAULT_DICT = None
@ -272,7 +273,7 @@ class Tokenizer(object):
                for elem in buf:
                    yield elem

-    def cut(self, sentence, cut_all=False, HMM=True):
+    def cut(self, sentence, cut_all = False, HMM = True,use_paddle = False):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into separated words.
@ -282,8 +283,18 @@ class Tokenizer(object):
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''
+        is_paddle_installed = False
+        if use_paddle == True:
+            import_paddle_check = import_paddle()
+            is_paddle_installed = check_paddle_install()
        sentence = strdecode(sentence)
-
+        if use_paddle == True and is_paddle_installed == True and import_paddle_check == True:
+            results = predict.get_sent(sentence)
+            for sent in results:
+                if sent is None:
+                    continue
+                yield sent
+            return
        if cut_all:
            re_han = re_han_cut_all
            re_skip = re_skip_cut_all
--- a/jieba/_compat.py
+++ b/jieba/_compat.py
@ -1,6 +1,16 @@
 # -*- coding: utf-8 -*-
 import os
 import sys
+import imp
+import logging
+
+log_console = logging.StreamHandler(sys.stderr)
+default_logger = logging.getLogger(__name__)
+default_logger.setLevel(logging.DEBUG)
+
+def setLogLevel(log_level):
+    global logger
+    default_logger.setLevel(log_level)

 try:
    import pkg_resources
@ -10,6 +20,29 @@ except ImportError:
    get_module_res = lambda *res: open(os.path.normpath(os.path.join(
                            os.getcwd(), os.path.dirname(__file__), *res)), 'rb')

+try:
+    import paddle
+    if paddle.__version__ == '1.6.1':
+        import paddle.fluid as fluid
+        import jieba.lac_small.predict as predict
+except ImportError:
+    pass
+
+
+def import_paddle():
+    import_paddle_check = False
+    try:
+        import paddle
+        if paddle.__version__ == '1.6.1':
+            import paddle.fluid as fluid
+            import jieba.lac_small.predict as predict
+        import_paddle_check = True
+    except ImportError:   
+        default_logger.debug("Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1. Now, back to jieba basic cut......")
+        return False
+    return import_paddle_check
+
+
 PY2 = sys.version_info[0] == 2

 default_encoding = sys.getfilesystemencoding()
@ -44,3 +77,19 @@ def resolve_filename(f):
        return f.name
    except AttributeError:
        return repr(f)
+
+
+def check_paddle_install():
+    is_paddle_installed =  False
+    try:
+        if imp.find_module('paddle') and paddle.__version__ == '1.6.1':
+            is_paddle_installed = True
+        elif paddle.__version__ != '1.6.1':
+            is_paddle_installed = False
+            default_logger.debug("Check the paddle version is not correct, please\
+            use command to install paddle: pip uninstall paddlepaddle(-gpu), \
+            pip install paddlepaddle-tiny==1.6.1. Now, back to jieba basic cut......")
+    except ImportError:
+        default_logger.debug("import paddle error, back to jieba basic cut......")
+        is_paddle_installed = False
+    return is_paddle_installed
--- a/jieba/lac_small/init.py
+++ b/jieba/lac_small/init.py
--- a/jieba/lac_small/creator.py
+++ b/jieba/lac_small/creator.py
@ -0,0 +1,46 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Define the function to create lexical analysis model and model's data reader
+"""
+import sys
+import os
+import math
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+import jieba.lac_small.nets as nets
+
+
+def create_model(vocab_size, num_labels, mode='train'):
+    """create lac model"""
+
+    # model's input data
+    words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
+    targets = fluid.data(
+        name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
+
+    # for inference process
+    if mode == 'infer':
+        crf_decode = nets.lex_net(
+            words, vocab_size, num_labels, for_infer=True, target=None)
+        return {
+            "feed_list": [words],
+            "words": words,
+            "crf_decode": crf_decode,
+        }
+    return ret
+
--- a/jieba/lac_small/model_baseline/crfw
+++ b/jieba/lac_small/model_baseline/crfw
--- a/jieba/lac_small/model_baseline/fc_0.b_0
+++ b/jieba/lac_small/model_baseline/fc_0.b_0
--- a/jieba/lac_small/model_baseline/fc_0.w_0
+++ b/jieba/lac_small/model_baseline/fc_0.w_0
--- a/jieba/lac_small/model_baseline/fc_1.b_0
+++ b/jieba/lac_small/model_baseline/fc_1.b_0
--- a/jieba/lac_small/model_baseline/fc_1.w_0
+++ b/jieba/lac_small/model_baseline/fc_1.w_0
--- a/jieba/lac_small/model_baseline/fc_2.b_0
+++ b/jieba/lac_small/model_baseline/fc_2.b_0
--- a/jieba/lac_small/model_baseline/fc_2.w_0
+++ b/jieba/lac_small/model_baseline/fc_2.w_0
--- a/jieba/lac_small/model_baseline/fc_3.b_0
+++ b/jieba/lac_small/model_baseline/fc_3.b_0
--- a/jieba/lac_small/model_baseline/fc_3.w_0
+++ b/jieba/lac_small/model_baseline/fc_3.w_0
--- a/jieba/lac_small/model_baseline/fc_4.b_0
+++ b/jieba/lac_small/model_baseline/fc_4.b_0
--- a/jieba/lac_small/model_baseline/fc_4.w_0
+++ b/jieba/lac_small/model_baseline/fc_4.w_0
--- a/jieba/lac_small/model_baseline/gru_0.b_0
+++ b/jieba/lac_small/model_baseline/gru_0.b_0
--- a/jieba/lac_small/model_baseline/gru_0.w_0
+++ b/jieba/lac_small/model_baseline/gru_0.w_0
--- a/jieba/lac_small/model_baseline/gru_1.b_0
+++ b/jieba/lac_small/model_baseline/gru_1.b_0
--- a/jieba/lac_small/model_baseline/gru_1.w_0
+++ b/jieba/lac_small/model_baseline/gru_1.w_0
--- a/jieba/lac_small/model_baseline/gru_2.b_0
+++ b/jieba/lac_small/model_baseline/gru_2.b_0
--- a/jieba/lac_small/model_baseline/gru_2.w_0
+++ b/jieba/lac_small/model_baseline/gru_2.w_0
--- a/jieba/lac_small/model_baseline/gru_3.b_0
+++ b/jieba/lac_small/model_baseline/gru_3.b_0
--- a/jieba/lac_small/model_baseline/gru_3.w_0
+++ b/jieba/lac_small/model_baseline/gru_3.w_0
--- a/jieba/lac_small/model_baseline/word_emb
+++ b/jieba/lac_small/model_baseline/word_emb
--- a/jieba/lac_small/nets.py
+++ b/jieba/lac_small/nets.py
@ -0,0 +1,122 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The function lex_net(args) define the lexical analysis network structure
+"""
+import sys
+import os
+import math
+
+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+
+
+def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
+    """
+    define the lexical analysis network structure
+    word: stores the input of the model
+    for_infer: a boolean value, indicating if the model to be created is for training or predicting.
+
+    return:
+        for infer: return the prediction
+        otherwise: return the prediction
+    """
+   
+    word_emb_dim=128
+    grnn_hidden_dim=128
+    bigru_num=2
+    emb_lr = 1.0
+    crf_lr = 1.0
+    init_bound = 0.1
+    IS_SPARSE = True
+
+    def _bigru_layer(input_feature):
+        """
+        define the bidirectional gru layer
+        """
+        pre_gru = fluid.layers.fc(
+            input=input_feature,
+            size=grnn_hidden_dim * 3,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound),
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=1e-4)))
+        gru = fluid.layers.dynamic_gru(
+            input=pre_gru,
+            size=grnn_hidden_dim,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound),
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=1e-4)))
+
+        pre_gru_r = fluid.layers.fc(
+            input=input_feature,
+            size=grnn_hidden_dim * 3,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound),
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=1e-4)))
+        gru_r = fluid.layers.dynamic_gru(
+            input=pre_gru_r,
+            size=grnn_hidden_dim,
+            is_reverse=True,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound),
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=1e-4)))
+
+        bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
+        return bi_merge
+
+    def _net_conf(word, target=None):
+        """
+        Configure the network
+        """
+        word_embedding = fluid.embedding(
+            input=word,
+            size=[vocab_size, word_emb_dim],
+            dtype='float32',
+            is_sparse=IS_SPARSE,
+            param_attr=fluid.ParamAttr(
+                learning_rate=emb_lr,
+                name="word_emb",
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound)))
+
+        input_feature = word_embedding
+        for i in range(bigru_num):
+            bigru_output = _bigru_layer(input_feature)
+            input_feature = bigru_output
+
+        emission = fluid.layers.fc(
+            size=num_labels,
+            input=bigru_output,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Uniform(
+                    low=-init_bound, high=init_bound),
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=1e-4)))
+
+        size = emission.shape[1]
+        fluid.layers.create_parameter(
+            shape=[size + 2, size], dtype=emission.dtype, name='crfw')
+        crf_decode = fluid.layers.crf_decoding(
+            input=emission, param_attr=fluid.ParamAttr(name='crfw'))
+
+        return crf_decode
+    return _net_conf(word)
--- a/jieba/lac_small/predict.py
+++ b/jieba/lac_small/predict.py
@ -0,0 +1,82 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import os
+import time
+import sys
+
+import paddle.fluid as fluid
+import paddle
+
+import jieba.lac_small.utils as utils
+import jieba.lac_small.creator as creator
+import jieba.lac_small.reader_small as reader_small
+import numpy
+
+word_emb_dim=128
+grnn_hidden_dim=128
+bigru_num=2
+use_cuda=False
+basepath = os.path.abspath(__file__)
+folder = os.path.dirname(basepath)
+init_checkpoint = os.path.join(folder, "model_baseline")
+batch_size=1
+
+dataset = reader_small.Dataset()
+infer_program = fluid.Program()
+with fluid.program_guard(infer_program, fluid.default_startup_program()):
+    with fluid.unique_name.guard():
+        infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
+infer_program = infer_program.clone(for_test=True)
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+utils.init_checkpoint(exe, init_checkpoint, infer_program)
+results = []
+
+def get_sent(str1):
+    feed_data=dataset.get_vars(str1)
+    a = numpy.array(feed_data).astype(numpy.int64)
+    a=a.reshape(-1,1)
+    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
+
+    words, crf_decode = exe.run(
+            infer_program,
+            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
+            feed={"words":c, },
+            return_numpy=False,
+            use_program_cache=True)
+    sents=[]
+    sent,tag = utils.parse_result(words, crf_decode, dataset)
+    sents = sents + sent
+    return sents
+
+def get_result(str1):
+    feed_data=dataset.get_vars(str1)
+    a = numpy.array(feed_data).astype(numpy.int64)
+    a=a.reshape(-1,1)
+    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
+
+    words, crf_decode = exe.run(
+            infer_program,
+            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
+            feed={"words":c, },
+            return_numpy=False,
+            use_program_cache=True)
+    results=[]
+    results += utils.parse_result(words, crf_decode, dataset)
+    return results
--- a/jieba/lac_small/reader_small.py
+++ b/jieba/lac_small/reader_small.py
@ -0,0 +1,100 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The file_reader converts raw corpus to input.
+"""
+
+import os
+import __future__
+import io
+import paddle
+import paddle.fluid as fluid
+
+def load_kv_dict(dict_path,
+                 reverse=False,
+                 delimiter="\t",
+                 key_func=None,
+                 value_func=None):
+    """
+    Load key-value dict from file
+    """
+    result_dict = {}
+    for line in io.open(dict_path, "r", encoding='utf8'):
+        terms = line.strip("\n").split(delimiter)
+        if len(terms) != 2:
+            continue
+        if reverse:
+            value, key = terms
+        else:
+            key, value = terms
+        if key in result_dict:
+            raise KeyError("key duplicated with [%s]" % (key))
+        if key_func:
+            key = key_func(key)
+        if value_func:
+            value = value_func(value)
+        result_dict[key] = value
+    return result_dict
+
+class Dataset(object):
+    """data reader"""
+    def __init__(self):
+        # read dict
+        basepath = os.path.abspath(__file__)
+        folder = os.path.dirname(basepath)
+        word_dict_path = os.path.join(folder, "word.dic")
+        label_dict_path = os.path.join(folder, "tag.dic")
+        self.word2id_dict = load_kv_dict(
+            word_dict_path, reverse=True, value_func=int)
+        self.id2word_dict = load_kv_dict(word_dict_path)
+        self.label2id_dict = load_kv_dict(
+            label_dict_path, reverse=True, value_func=int)
+        self.id2label_dict = load_kv_dict(label_dict_path)
+    
+    @property
+    def vocab_size(self):
+        """vocabuary size"""
+        return max(self.word2id_dict.values()) + 1
+    
+    @property
+    def num_labels(self):
+        """num_labels"""
+        return max(self.label2id_dict.values()) + 1
+
+    def word_to_ids(self, words):
+        """convert word to word index"""
+        word_ids = []
+        for word in words:
+            if word not in self.word2id_dict:
+                word = "OOV"
+            word_id = self.word2id_dict[word]
+            word_ids.append(word_id)
+        return word_ids
+   
+    def label_to_ids(self, labels):
+        """convert label to label index"""
+        label_ids = []
+        for label in labels:
+            if label not in self.label2id_dict:
+                label = "O"
+            label_id = self.label2id_dict[label]
+            label_ids.append(label_id)
+        return label_ids
+
+    def get_vars(self,str1):
+        words = str1.strip()
+        word_ids = self.word_to_ids(words)
+        return word_ids
+    
+   
--- a/jieba/lac_small/tag.dic
+++ b/jieba/lac_small/tag.dic
@ -0,0 +1,57 @@
+0	a-B
+1	a-I
+2	ad-B
+3	ad-I
+4	an-B
+5	an-I
+6	c-B
+7	c-I
+8	d-B
+9	d-I
+10	f-B
+11	f-I
+12	m-B
+13	m-I
+14	n-B
+15	n-I
+16	nr-B
+17	nr-I
+18	ns-B
+19	ns-I
+20	nt-B
+21	nt-I
+22	nw-B
+23	nw-I
+24	nz-B
+25	nz-I
+26	p-B
+27	p-I
+28	q-B
+29	q-I
+30	r-B
+31	r-I
+32	s-B
+33	s-I
+34	t-B
+35	t-I
+36	u-B
+37	u-I
+38	v-B
+39	v-I
+40	vd-B
+41	vd-I
+42	vn-B
+43	vn-I
+44	w-B
+45	w-I
+46	xc-B
+47	xc-I
+48	PER-B
+49	PER-I
+50	LOC-B
+51	LOC-I
+52	ORG-B
+53	ORG-I
+54	TIME-B
+55	TIME-I
+56	O
--- a/jieba/lac_small/utils.py
+++ b/jieba/lac_small/utils.py
@ -0,0 +1,142 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+util tools
+"""
+from __future__ import print_function
+import os
+import sys
+import numpy as np
+import paddle.fluid as fluid
+import io
+
+
+def str2bool(v):
+    """
+    argparse does not support True or False in python
+    """
+    return v.lower() in ("true", "t", "1")
+
+
+
+def parse_result(words, crf_decode, dataset):
+    """ parse result """
+    offset_list = (crf_decode.lod())[0]
+    words = np.array(words)
+    crf_decode = np.array(crf_decode)
+    batch_size = len(offset_list) - 1
+
+    for sent_index in range(batch_size):
+        begin, end = offset_list[sent_index], offset_list[sent_index + 1]
+        sent=[]
+        for id in words[begin:end]:
+            if dataset.id2word_dict[str(id[0])]=='OOV':
+                sent.append(' ')
+            else:
+                sent.append(dataset.id2word_dict[str(id[0])])
+        tags = [
+            dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
+        ]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split('-')[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split('-')[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+    return sent_out,tags_out
+
+def parse_padding_result(words, crf_decode, seq_lens, dataset):
+    """ parse padding result """
+    words = np.squeeze(words)
+    batch_size = len(seq_lens)
+
+    batch_out = []
+    for sent_index in range(batch_size):
+
+        sent=[]
+        for id in words[begin:end]:
+            if dataset.id2word_dict[str(id[0])]=='OOV':
+                sent.append(' ')
+            else:
+                sent.append(dataset.id2word_dict[str(id[0])])
+        tags = [
+            dataset.id2label_dict[str(id)]
+            for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
+        ]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split('-')[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split('-')[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+    """
+    Init CheckPoint
+    """
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+    def existed_persitables(var):
+        """
+        If existed presitabels
+        """
+        if not fluid.io.is_persistable(var):
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+
--- a/jieba/lac_small/word.dic
+++ b/jieba/lac_small/word.dic
--- a/jieba/posseg/init.py
+++ b/jieba/posseg/init.py
@ -269,13 +269,24 @@ def _lcut_internal_no_hmm(s):
    return dt._lcut_internal_no_hmm(s)


-def cut(sentence, HMM=True):
+def cut(sentence, HMM=True, use_paddle=False):
    """
    Global `cut` function that supports parallel processing.

    Note that this only works using dt, custom POSTokenizer
    instances are not supported.
    """
+    is_paddle_installed = False
+    if use_paddle == True:
+        import_paddle_check = import_paddle()
+        is_paddle_installed = check_paddle_install()
+    if use_paddle==True and is_paddle_installed == True and import_paddle_check == True:
+        sents,tags = predict.get_result(strdecode(sentence))
+        for i,sent in enumerate(sents):
+            if sent is None or tags[i] is None:
+                continue
+            yield pair(sent,tags[i])
+        return
    global dt
    if jieba.pool is None:
        for w in dt.cut(sentence, HMM=HMM):
--- a/setup.py
+++ b/setup.py
@ -71,5 +71,5 @@ setup(name='jieba',
      keywords='NLP,tokenizing,Chinese word segementation',
      packages=['jieba'],
      package_dir={'jieba':'jieba'},
-      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
+      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*','lac_small/model_baseline/*']}
 )