CCKS2017病例标注


CCK2017病例标注,CCKS2017 Task2 数据格式说明: 每个病例分为4个域,分别存储在4个文件夹 一般项目 病史特征 诊疗过程 出院情况 每一个目录下存储两类文件
资源截图
代码片段和文件信息
# coding:utf-8

import fio
import codecs
import sys
import os
import jieba.posseg as pseg

datadir = “../data2/training dataset v4“
area = [“病史特点“ “出院情况“ “一般项目“ “诊疗经过“]

class CRF_unit:
    def __init__(self):
        self.features = []

    def test_into_aline(self filename):
        self.features = []
        sentences = fio.ReadFileUTF8(filename);
        for sentence in sentences:
            for token in sentence:
                self.features.append(token)

    def get_posTag(self sentence):
        words = pseg.cut(sentence)
        return words

    def get_token(self filename):
        self.features = []
        sentences = fio.ReadFileUTF8(filename);
        for sentence in sentences:
            words = self.get_posTag(sentence)
            for w in words:
                for token in w.word:
                    feature = [token w.flag “N“]
                    self.features.append(feature)
                
    def read_type(self itype):
        itype = itype.encode(‘utf-8‘)
        if itype == “症状和体征“:
            return “SIGNS“
        if itype == “检查和检验“:
            return “CHECK“
        if itype == “疾病和诊断“:
            return “DISEASE“
        if itype == “治疗“:
            return “TREATMENT“
        if itype == “身体部位“:
            return “BODY“


    def get_type(self filename):
        sentences = fio.ReadFileUTF8(filename);
        for sentence in sentences:
            words = sentence.split()
            print words[-3] + words[-2]
            x = int(words[-3])
            y = int(words[-2])

            #if words[3].encode(‘utf-8‘) == “身体部位“:
            itype = self.read_type(words[-1])
            self.features[x][2] = “B-“ + itype
            for j in range(x+1y+1):
                self.features[j][2] = “I-“ + itype



if __name__ == ‘__main__‘:
    extractor = CRF_unit()
    x = 0;
    “““
    for i in range(1241):
        filename = datadir + ‘/‘ + area[x] + ‘/‘ + area[x] + ‘-‘+ str(i) +‘.txtoriginal.txt‘
        extractor.get_token(filename)

        filename = datadir + ‘/‘ + area[x] + ‘/‘ + area[x] + ‘-‘+ str(i) +‘.txt‘
        extractor.get_type(filename)

        filename = datadir + ‘/result/‘ + area[x] + “/“ + ‘1-240_train.txt‘
        fio.AddTrain(extractor.features filename)
    “““
    
    for i in range(241 301):
        filename = datadir + ‘/‘ + area[x] + ‘/‘ + area[x] + ‘-‘+ str(i) +‘.txtoriginal.txt‘
        extractor.test_into_aline(filename);

        filename = datadir + ‘/result/‘ + area[x] + ‘.testt-‘ + str(i) + ‘.txt‘
        fio.AddTest(extractor.features filename)
    




 属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2017-11-22 09:15  CCKS2017
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.git
     文件          23  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitHEAD
     目录           0  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitranches
     文件         268  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitconfig
     文件          73  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitdescription
     目录           0  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githooks
     文件         478  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githooksapplypatch-msg.sample
     文件         896  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookscommit-msg.sample
     文件         189  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookspost-update.sample
     文件         424  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookspre-applypatch.sample
     文件        1642  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookspre-commit.sample
     文件        1348  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookspre-push.sample
     文件        4898  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githookspre-rebase.sample
     文件        1239  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githooksprepare-commit-msg.sample
     文件        3610  2017-08-09 10:14  CCKS2017CCKS2017_dataset.githooksupdate.sample
     文件     1960281  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitindex
     目录           0  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitinfo
     文件         240  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitinfoexclude
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
     文件         187  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogsHEAD
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efs
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efsheads
     文件         187  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efsheadsmaster
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efs
emotes
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efs
emotesorigin
     文件         187  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitlogs
efs
emotesoriginHEAD
     目录           0  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitobjects
     目录           0  2017-08-09 10:14  CCKS2017CCKS2017_dataset.gitobjectsinfo
     目录           0  2017-08-09 10:18  CCKS2017CCKS2017_dataset.gitobjectspack
............此处省略13886个文件信息

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件举报,一经查实,本站将立刻删除。

发表评论

评论列表(条)