python写的基于感知机的中文分词系统

标签： python 分词感知机 • 文件类型: .rar • 文件大小: 4.92MB • 下载次数: 1 • 2023-09-18

基于字的用感知机实现的中文分词系统。完全训练后对微软的测试集精度可以达到96%多。我上传的版本是完整的代码（训练和分词），大家自己用附带的微软训练数据训练就可以了，只有一个文件。代码总的来说写的还是很清楚的，方便自己也方便别人阅读。欢迎大家共讨论,xiatian@ict.ac.cn。

资源截图

小图大图

代码片段和文件信息

# -*- coding: cp936 -*-

import os
import time
import random
import cPickle

__author__ = “summer rain“
__email__ = “xiatian@ict.ac.cn“

class CPTTrain:
    def __init__（self segment train）:
        self.__char_type = {}
        data_path = “PTData“
        for ind name in enumerate（[“punc“ “alph“ “date“ “num“]）:
            fn = data_path + “/“ + name
            if os.path.isfile（fn）:
                for line in file（fn “rU“）:
                    self.__char_type[line.strip（）.decode（“cp936“）] = ind
            else:
                print “can‘t open“ fn
                exit（）

        self.__train_insts = None           # all instances for training.
        self.__feats_weight = None          # [“b“ “m“ “e“ “s“][all the features] --> weight.
        self.__words_num = None             # total words num in all the instances.
        self.__insts_num = None             # namley the sentences‘ num.
        self.__cur_ite_ID = None            # current iteration index.
        self.__cur_inst_ID = None           # current index_th instance.
        self.__real_inst_ID = None          # the accurate index in training instances after randimizing.
        self.__last_update = None           # [“b“..“s“][feature] --> [last_update_ite_ID last_update_inst_ID]
        self.__feats_weight_sum = None      # sum of [“b“..“s“][feature] from begin to end.

        if segment and train or not segment and not train:
            print “there is only a True and False in segment and train“
            exit（）
        elif train:
            self.Train = self.__Train
        else:
            self.__LoadModel（）
            self.Segment = self.__Segment

    def __LoadModel（self）:
        model = “PTData/avgmodel“
        print “load“ model “...“
        self.__feats_weight = {}
        if os.path.isfile（model）:
            start = time.clock（）
            self.__feats_weight = cPickle.load（file（model “rb“））
            end = time.clock（）
            print “It takes %d seconds“ %（end - start）
        else:
            print “can‘t open“ model

    def __Train（self corp_file_name max_train_num max_ite_num）:
        if not self.__LoadCorp（corp_file_name max_train_num）:
            return False

        starttime = time.clock（）
                
        self.__feats_weight = {}
        self.__last_update = {}
        self.__feats_weight_sum = {}
        
        for self.__cur_ite_ID in xrange（max_ite_num）:
            if self.__Iterate（）:
                break

        self.__SaveModel（）
        endtime = time.clock（）        
        print “total iteration times is %d seconds“ %（endtime - starttime）

        return True

    def __GenerateFeats（self inst）:
        inst_feat = []
        for ind [c tag t] in enumerate（inst）:
            inst_feat.append（[]）
            if t == -1:
                continue
            # Cn
            for n in xrange（-2 3）:

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----

     文件      11158  2008-05-26 19:11  PTTrain.py

     文件        260  2007-12-04 13:00  PTDataalph

     文件     976127  2008-05-26 10:58  PTDataavgmodel

     文件         17  2007-12-04 13:00  PTDatadate

     文件        110  2007-12-04 13:00  PTData
um

     文件        270  2007-12-04 13:00  PTDatapunc

     目录          0  2008-05-23 14:28  PTData

     文件   24476617  2007-12-17 10:53  msr_train.txt

----------- ---------  ---------- -----  ----

             25464559                    8

立即下载

IronPython-2.7 for VS2010 Python数据挖掘入门与实践-中文高清晰完整版.pdf

0 0

python写的基于感知机的中文分词系统

发表评论

评论列表（条）