一种基于词向量的英汉词义映射方法和装置与流程

文档序号:17555230发布日期:2019-04-30 18:34阅读:364来源:国知局
一种基于词向量的英汉词义映射方法和装置与流程
本发明涉及自然语言处理
技术领域
,具体涉及一种基于词向量的英汉词义映射方法和装置。
背景技术
:词义映射是指将知识库中的词义由一种语言描述映射为其它语言描述的过程。词义映射是自然语言处理领域中语言基础资源建设的一项重要工作。作为一项基础性任务,其对词义消歧、语义分析、机器翻译等应用具有重要影响。最初词义映射工作主要以手工的方式开展,人工去逐条映射知识库中的词义。这种方法能够保证词义映射的正确率,但因知识库中的词义概念非常丰富、数据量巨大,这种手工映射的方法耗时耗力,难以完成。随着机器翻译技术的发展,有研究人员开始使用机器翻译的方法进行词义映射,将待映射词义送入机器翻译系统,由该系统自动输出词义映射结果。这种方法利用机器翻译技术自动处理,省时省力,但因机器翻译的质量并不可靠,词义映射的正确率难以保证。无论手工映射方法,还是机器翻译的映射方法,显然都不能满足大规模知识库的词义映射的需求。针对这些问题,本发明提出了基于词向量的英汉词义映射方法和装置,该方法综合考虑词义的注释和例句信息,利用词向量为注释和例句生成句子向量,而后利用句子向量综合比较不同词义的相似度,判定待映射英文词义的目标中文词义。该方法能解决现有映射方法的不足,提高词义映射的正确率。技术实现要素:本发明公开了一种基于词向量的英汉词义映射方法和装置,以更有效地进行词义映射。为此,本发明提供如下技术方案:一种基于词向量的英汉词义映射方法,包括以下步骤:步骤一、由英文知识库提取待映射词义的同义词集,而后根据英汉词典查询同义词集中各个同义词的候选中文词义;步骤二、由英文知识库提取待映射词义的英文注释和例句,并根据英汉词典查询步骤一所得的各个候选中文词义的英文注释和例句;步骤三、在大规模英文语料库上训练词向量,而后为步骤二所得的各个英文注释和例句分别生成句子向量;步骤四、计算步骤三所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,而后计算待映射词义与候选中文词义的综合相似度;步骤五、选择综合相似度最大的候选中文词义作为待映射词义的目标词义。进一步的,所述步骤一中,在提取同义词集和查询候选中文词义时,具体为:步骤1-1)由英文知识库,提取待映射词义的同义词集;步骤1-2)根据英汉词典,查询同义词集中各个同义词的候选中文词义。进一步的,所述步骤二中,在提取英文注释和例句时,具体为:步骤2-1)由英文知识库,提取待映射词义的英文注释和例句;步骤2-2)根据英汉词典,查询步骤1-2)所得的各个候选中文词义的英文注释和例句。进一步的,所述步骤三中,在训练词向量并生成句子向量时,具体为:步骤3-1)在大规模英文语料库上训练词向量;步骤3-2)对步骤二所得的英文注释和例句进行词形还原、提取实词等预处理;步骤3-3)根据步骤3-1)所得的词向量,为步骤3-2)处理得到的英文注释和例句分别生成句子向量,具体为:将英文注释或例句记作s,将句子中的某一实词记作w,则句子s的句子向量可由公式(1)获得;其中,|s|表示句子s包含的实词的数量,表示实词wk的词向量。进一步的,所述步骤四中,在计算词义相似度时,具体为:步骤4-1)计算步骤三所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,具体为:将英文注释或例句记作s;任意两个句子si和sj的句子向量相似度可通过公式(2)求得;其中,和表示句子si和sj的句子向量,和表示向量和的模。将公式(1)代入公式(2),可得公式(3)。为了使相似度得分在0到1之间,以便于之后对其进行比较,将公式(3)中的句子向量利用函数做归一化处理,则公式(3)将转化为公式(4);其中,函数的归一化处理,即指将转化为单位向量。该处理仅改变向量大小并不改变方向,不影响向量的余弦相似度计算。步骤4-2)由步骤4-1)所得的英文注释和例句的句子向量相似度,计算待映射词义与候选中文词义的综合相似度,具体为:将英文知识库中的待映射词义记作Bs,将某一候选中文词义记作Ds,其综合相似度可由公式(5)计算;其中,Bsgl为Bs的英文注释,Dsgl为Ds的英文注释,Bsexs为Bs的英文例句集合,Dsexs为Ds的英文例句集合,Bsex为Bsexs中的一条例句,Dsex为Dsexs中的一条例句,α和(1-α)分别表示注释和例句的权重,sim(Bsgl,Dsgl)和sim(Bsex,Dsex)由公式(4)计算。进一步的,所述步骤五中,选择综合相似度最大的候选中文词义作为待映射词义的目标词义时,具体为:将英文知识库中的待映射词义记作Bs,将某一候选中文词义记作Ds,则Bs映射的目标词义Ts可由公式(6)而得;其中,Dss表示Bs的候选中文词义的集合,Dsi表示Dss中的第i个候选中文词义,score(Bs,Dsi)可由公式(5)计算求得。一种基于词向量的英汉词义映射装置,包括:候选词义查询单元,用于在英文知识库中提取待映射词义的同义词集,而后根据英汉词典查询同义词集中各个同义词的候选中文词义;注释和例句提取单元,用于在英文知识库提取待映射词义的英文注释和例句,并根据英汉词典查询候选词义查询单元所得的各个候选中文词义的英文注释和例句;句子向量生成单元,用于在大规模英文语料库上训练词向量,而后为注释和例句提取单元所得的各个英文注释和例句分别生成句子向量;词义相似度计算单元,用于计算句子向量生成单元所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,而后计算待映射词义与候选中文词义的综合相似度;目标词义选择单元,用于选择综合相似度最大的候选中文词义作为待映射词义的目标词义。进一步的,所述候选词义查询单元还包括:同义词集提取单元,用于提取待映射词义的同义词集;候选中文词义查询单元,用于查询同义词集中各个同义词的候选中文词义;进一步的,所述注释和例句提取单元还包括:待映射词义信息提取单元,用于提取待映射词义的英文注释和例句;候选词义信息提取单元,用于提取候选中文词义查询单元所得的各个候选中文词义的英文注释和例句;进一步的,所述句子向量生成单元还包括:词向量训练单元,用于在大规模英文语料库上训练词向量;词义信息预处理单元,用于对注释和例句提取单元所得的英文注释和例句进行词形还原、提取实词等预处理;句子向量生成单元,用于根据词向量训练单元所得词向量为词义信息预处理单元得到的英文注释和例句分别生成句子向量;进一步的,所述词义相似度计算单元还包括:句子向量相似度计算单元,用于计算句子向量生成单元所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度;综合相似度计算单元,根据句子向量相似度计算单元所得的英文注释和例句的句子向量相似度,计算待映射词义与候选中文词义的综合相似度。本发明的有益效果:1、本发明提出的基于词向量的英汉词义映射方法和装置,是一种完全自动化的词义映射方法,可以避免传统手工映射方法的繁琐的人力劳动。2、本发明提出的基于词向量的英汉词义映射方法和装置,充分发挥了深度学习的优势,利用词向量技术生成句子向量,能够较为准确地选择目标词义,避免了传统机器翻译映射方法的正确率较低的问题。3、本发明提出的基于词向量的英汉词义映射方法和装置,综合考虑词义的注释和例句信息,利用深度学习的词向量技术完成注释和例句的相似度计算,对两者加权求和以计算综合相似度,从而选择目标词义,具有较高的映射正确率。4、本发明提出的基于词向量的英汉词义映射方法和装置,在计算句子相似度时,仅保留了句子中的实词,可避免句子中无关虚词的干扰,提高词义映射正确率。附图说明图1为根据本发明实施方式基于词向量的英汉词义映射方法的流程图;图2为根据本发明实施方式基于词向量的英汉词义映射装置的结构示意图;图3为根据本发明实施方式词义查询单元的结构示意图;图4为根据本发明实施方式注释和例句提取单元的结构示意图;图5为根据本发明实施方式句子向量生成单元的结构示意图;图6为根据本发明实施方式词义相似度计算单元的结构示意图;具体实施方式:为了使本
技术领域
的人员更好地理解本发明实施例的方案,下面结合附图和实施方式对发明实施例作进一步的详细说明。BabelNet为多语知识库,目前其已建成了较为完备的英语词义知识库,但其汉语词义知识库并不完备,目前尚缺乏有效的词义映射方法完成英汉词义的自动映射。本专利试图提出一种基于词向量的英汉词义映射方法和装置,解决与此类似的英汉词义映射问题。由BabelNet抽取一条待映射词义“measure;mensurate;measure_out”,其具体语义描述如表1所示。以该词义为例,说明本发明的具体实施方式。表1本发明实施例基于词向量的英汉词义映射方法流程图,如图1所示,包括以下步骤。步骤101,查询候选词义。由英文知识库提取待映射词义的同义词集,而后根据英汉词典查询同义词集中各个同义词的候选中文词义,具体为:步骤1-1)由英文知识库,提取待映射词义的同义词集,过程如下:本实施例针对BabelNet的英汉词义映射而开展,所采用的英文知识库即为BabelNet知识库。与WordNet类似,BabelNet的词义以同义词集的形式表示。由表1可知,当前待映射词义的同义词集为{measure,mensurate,measure_out}。步骤1-2)根据英汉词典,查询同义词集中各个同义词的候选中文词义,过程如下:本发明实施例中,英汉词典采用柯林斯高阶英汉词典。柯林斯高阶英汉词典对于每个词义均有细致的英汉描述信息,其同时提供了英汉词义注释和例句。在柯林斯高阶英汉词典中,每个英文词有一个或多个对应的中文词义,每个中文词义有着一条英文注释和一个或多个英汉对照例句,这些英文信息为本专利的实施工作提供了很好地资源支撑。本发明实施例中,根据柯林斯高阶英汉词典,对同义词集{measure,mensurate,measure_out}中各个同义词查询得到候选中文词义,如表2所示。表2编号汉语词义描述1衡量;估量;评估;判定2测量;度量;计量3距离(或长度、宽度、数量等)为…4(按所需)量出,量取步骤102,提取注释和例句。由英文知识库提取待映射词义的英文注释和例句,并根据英汉词典查询步骤101所得的各个候选中文词义的英文注释和例句,具体为:步骤2-1)由英文知识库,提取待映射词义的英文注释和例句,过程如下:本发明实施例中,根据英文知识库BabelNet,对待映射词义的英文注释和例句进行提取。由表1信息可知,待映射词义的英文注释和例句信息如表3所示。表3步骤2-2)根据英汉词典,查询步骤1-2)所得的各个候选中文词义的英文注释和例句,过程如下:本发明实施例中,根据柯林斯高阶英汉词典,依次提取步骤1-2)所得编号为1、2、3、4的各个候选中文词义的英文注释和例句,如表4所示。为便于理解,表4中同时给出了相应的中文词义。表4步骤103,生成句子向量。在大规模英文语料库上训练词向量,而后为步骤102所得的各个英文注释和例句分别生成句子向量,具体为:步骤3-1)在大规模英文语料库上训练词向量。本发明实施例使用Google的词向量工具word2vectoolkit在宾夕法尼亚大学提供的第五版EnglishGigaword数据集上训练词向量,向量的维数为200,其它训练参数等设置均使用默认值。EnglishGigaword是一个英文新闻文本数据包,它涵盖了七种不同的英文国际新闻源,共计9876086个文档,共26348MB,由宾夕法尼亚大学的语言学数据协作会耗费数年时间整理而成。步骤3-2)对步骤102所得的英文注释和例句进行词形还原、提取实词等预处理;本发明实施例中,使用斯坦福大学的StanfordCoreNLPtoolkit对英文句子进行词形还原,而后提取实词。其具体处理过程,以待映射词义的注释处理为例进行说明。首先对待映射词义的注释“Determinethemeasurementsofsomethingorsomebody,takemeasurementsof”进行词形还原,可得“determinethemeasurementofsomethingorsomebody,takemeasurementof”;而后,从中提取实词,可得“determinemeasurementsomethingsomebodytakemeasurement”。步骤3-3)由步骤3-1)所得的词向量,对步骤3-2)处理得到的英文注释和例句分别生成句子向量,具体为:将英文注释或例句记作s,将句子中的某一实词记作w,则句子s的句子向量可由公式(1)获得;其中,|s|表示句子s包含的实词的数量,表示实词wk的词向量。在本发明实例中,以步骤3-2)中得到的待映射词义注释“determinemeasurementsomethingsomebodytakemeasurement”的处理为例,说明句子向量的生成方法。首先,由步骤3-1)训练的词向量,抽取句子中各个实词的词向量。比如,determine的词向量为:[-0.060966704,-0.06865787,-0.13976261,0.052583452,0.02309357,-0.015850635,0.0057524024,0.004298664,0.07135361,-0.004907789,-0.0073844297,-0.0660588,-0.09741554,-0.0826721,0.0020558392,0.0019447851,-0.044812344,0.1433886,0.107519455,-0.013067925,0.055411655,0.098691314,-0.11813014,0.028893137,-0.10136866,0.024213811,-0.021338113,-0.006830832,-0.01115726,0.023671253,0.022735655,-0.106075086,-0.0060708467,-0.06795107,-0.024008093,-0.10278628,0.110742025,0.06967174,-0.026281023,-0.1304829,-0.18443915,-0.01603829,0.024118813,-0.02448944,0.08606661,0.04368876,-0.027071448,0.06927168,-0.16086423,-0.09339183,0.048664782,-0.0037259995,-0.19597004,-0.05804217,-0.042547442,-0.105807476,0.013699462,0.09974968,-0.038489617,-0.0507417,0.08751733,0.03520148,0.062430475,0.011540262,-0.12392134,0.10225074,-0.04389849,-0.053057443,-0.014595923,0.15838726,-0.036213677,-0.022729969,0.12135271,0.053754877,0.0653142,-0.11217302,-0.032784045,-0.02645095,-0.0058537563,-0.037233904,-0.091778874,-0.017529158,0.03335303,-0.11941094,0.12519278,0.045954995,-0.07207713,-0.040876612,-0.093257025,0.06504259,0.005461387,0.06069275,0.030098341,-0.007988872,-0.027645452,-0.032660615,-0.062259212,-0.020880515,0.076618314,0.046356063,-0.07308063,0.03509143,-0.08876938,-0.02635127,-0.012593604,0.14288785,0.045763995,-0.024156947,0.04318199,-0.012540084,-0.10338905,-0.031343687,-0.04143757,-0.024850031,0.12515464,0.13902804,0.045706462,0.094424434,0.06911446,-0.042245053,-0.01119372,0.07074649,-0.06615113,0.059482194,0.06079544,-0.0073646945,0.05371373,0.07749403,0.09774167,0.04614667,0.080500856,0.06686461,-0.1371806,0.059351735,-0.11971834,-0.024769751,0.005559396,-0.004569609,0.025109604,-0.010085186,0.06588754,-0.021475257,-0.12877394,-0.011472024,0.019178912,-0.022502841,0.049072206,-0.07339941,-0.06519345,-0.023635125,0.05878342,-0.041036837,0.016565796,0.13539337,-0.024638291,-0.08239346,-0.00374239,0.0033550384,0.01374094,0.0065936707,-0.030307738,0.009063287,-0.021692682,-0.09899706,0.04887318,0.037609883,-0.045150857,-0.09769283,-0.06568951,-0.13722141,0.018394174,0.03404645,-0.08603616,-0.07023705,0.14471957,-0.059314273,0.0674724,-0.07376034,0.041695137,-0.03897431,-0.12877795,-0.057006553,-0.018086433,0.022128537,-0.08181979,-0.08615692,0.029183147,-0.090377316,0.069178686,-0.015696429,-0.0043464974,0.0035500522,0.1526469,0.09442544,0.012619695,0.09376681,0.06574002,0.032735877,-0.06054757,0.031108197]。measurement的词向量为:[-0.030921048,0.040468287,0.07367502,-0.036431145,0.09001577,-0.10851831,0.031571753,-0.0076946556,-0.025466012,0.08239048,-0.033852145,0.023865981,-0.06640976,0.09898748,-0.060916066,-0.12299272,-0.10123717,0.018511012,-0.017379025,0.11183538,-0.032644443,0.061155915,-0.046167403,-0.02107625,-0.054799207,-0.003215416,-0.022842003,-0.07484936,-0.016040549,-6.718859E-4,0.09849985,0.10686533,-0.027949711,-0.014089485,0.08666428,-0.055681817,0.12596299,-0.081768885,-0.023240687,-0.040215734,0.009278273,-0.072330184,0.011064145,-0.046390835,0.009363516,0.07663736,-0.046891708,0.120461896,-0.024577046,-0.065430254,-0.060996015,-0.031411856,-0.024597166,-0.022857357,-0.019988738,-0.02650852,-0.046675686,-0.072701864,-0.06415478,-0.012159599,-0.019452924,-0.007099012,-0.035306044,-0.046926122,-0.060533796,-0.069201075,0.029004399,-0.024853425,-0.08013603,-0.040774312,0.10615162,0.036688466,0.0055641048,-0.005188717,0.0027881414,0.061590068,-0.057311498,-0.0018721737,0.032288115,-0.12578985,-0.1902009,-0.056136098,-4.728086E-4,-0.061017197,0.04288104,0.01388723,-0.038211193,-0.043795947,-0.04814441,0.1526314,0.033593766,0.078088604,0.005799715,0.03464157,-0.0035865682,-0.20270306,-0.111725785,-0.09797781,-0.09489581,-0.054468293,-0.0015290832,-0.16072103,0.056969997,0.013535669,-0.17215633,0.20882045,0.04354922,-0.0025980647,0.08676594,0.0429361,0.029175945,-0.039518964,0.03309713,0.027989952,-0.029852066,0.028658131,0.037572138,-0.064470336,0.0275685,-0.094821155,0.14544079,-0.049508303,0.05595343,0.04108511,0.022339016,-0.007031241,0.06387787,-0.051717743,0.035961512,0.0034367307,0.073031195,-0.097252965,-0.060861535,0.12593704,-0.024983672,0.07234978,-0.04727927,-0.19234574,0.11479137,0.013784515,-0.012358148,0.02151782,0.014949858,0.03911975,-0.01054792,-0.07922059,0.036444385,0.025766745,-0.12601435,0.047032543,-0.02278641,-0.13189878,0.111353576,-0.06969082,0.020863937,0.01676644,0.009361927,0.039854113,-0.060249478,0.027769696,-0.27008596,0.05944734,0.039832402,-0.026858494,-0.020013094,0.025406713,-2.128433E-4,-0.05612445,0.04703572,-0.024139712,0.06555838,0.07517604,0.09585466,-0.005991909,-0.0397101,-0.042226095,0.06041255,0.02176508,-0.027269356,-0.038427215,-0.09381253,0.22008736,0.105541155,0.071456574,-0.016034195,0.02069451,0.017009461,-0.07982682,-0.010532036,0.08931265,0.042708967,0.018712737,-0.07463705,0.052128073,0.06920637,0.022202944,0.022940483,0.05133759,-0.038717363,-0.013162929]。同理,逐一得到句子中各个实词的词向量。然后,由公式(1)将句子中各个实词的词向量相加,可得该词义注释的句子向量为:[-0.12244331,0.23284505,-0.125848,-0.09857595,0.15176383,-0.21165508,-0.06935414,0.17774323,-0.0481385,0.27167976,-0.23219745,-0.31177434,-0.237795,0.20023781,-0.2208232,-0.25496095,-0.050965287,0.19869018,0.14223932,0.054064974,0.14445543,0.3649017,-0.06972199,-0.0942207,-0.4732177,-0.002447103,-0.11354132,-0.23180336,-0.032030072,0.11646948,0.068802774,0.24477573,0.074090265,-0.30747676,0.28410295,-0.3153889,0.48259473,0.0018074736,-0.2570166,-0.065705955,-0.29293522,0.1187244,0.08923024,-0.023698367,0.078454815,0.2028578,-0.36501467,0.40085053,-0.0051737167,-0.25175425,-0.11989543,-0.09693016,-0.095989406,0.0065662824,0.01091335,-0.03598065,-0.12002948,-0.10372059,-0.28191066,0.033649035,0.3604529,-0.047989205,-0.1641263,-0.21081169,-0.13621823,0.33522972,-0.050793078,-0.0373758,-0.22907057,0.109199345,0.37030825,-0.11889391,0.24283075,0.07673705,0.318008,-0.22766817,-0.42850304,-0.071055345,0.1914971,-0.28046763,-0.6080315,-0.017843004,0.2313133,-0.2477001,0.26103482,0.14874645,-0.09291037,-0.0409794,-0.23852225,0.41014478,-0.17998967,0.31087965,0.11493398,-0.0023042597,-0.09591526,-0.28730935,-0.49623907,-0.30990297,-0.22764425,-0.06879938,-6.009942E-4,-0.25748277,0.00649539,0.21129256,-0.4945098,0.82365096,0.3147551,0.0121324705,0.29460865,-0.13176502,-0.1077477,-0.19233456,0.08242655,0.16084583,-0.13618916,0.11765827,0.23201033,-0.14476305,0.3566257,-0.33154497,0.32010967,0.017003909,0.0983599,0.28363377,0.17411232,-0.31067532,0.21472177,-0.18492793,0.09781431,0.060426474,0.3050918,-0.12334619,-0.23786914,0.27095866,0.023499401,-0.07610657,-0.0463394,-0.48189855,0.44204056,-0.030785767,0.046995677,-0.11442133,-0.32249418,-0.13742244,-0.1368755,-0.21778521,0.061512135,-0.31345803,-0.19940937,0.09265008,-0.02924196,-0.15277626,0.30612707,0.41078234,0.099931955,-0.14431237,0.16773543,-0.14954714,-0.044322092,-0.020516273,-0.52509534,0.10045516,0.13150021,-0.1684227,0.059403583,0.3293987,0.24298555,-0.3315874,-0.057996165,-0.34279677,0.24292094,0.2758336,-0.16648525,-0.13480023,-0.18450123,-0.1112635,0.15073343,0.20073035,-0.097931616,-0.2827055,-0.24364212,0.17794128,0.35367286,-0.012077071,-0.17940772,0.08209381,0.08326046,-0.12982222,0.35156035,0.11034558,-0.0971424,0.01952859,-0.070994884,0.22338426,0.10498668,-0.22422943,-0.04826733,0.046616875,-0.326965,0.05593993]。同理,可得各个英文注释和例句所对应的句子向量。步骤104,计算词义相似度。计算步骤103所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,而后计算待映射词义与候选中文词义的综合相似度,具体为:步骤4-1)计算步骤103所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,具体为:将英文注释或例句记作s;任意两个句子si和sj的句子向量相似度可通过公式(2)求得;其中,和表示句子si和sj的句子向量,和表示向量和的模。将公式(1)代入公式(2),可得公式(3)。为了使相似度得分在0到1之间,以便于之后对其进行比较,将公式(3)中的句子向量利用函数做归一化处理,则公式(3)将转化为公式(4);其中,函数的归一化处理,即指将转化为单位向量。该处理仅改变向量大小并不改变方向,不影响向量的余弦相似度计算。本发明实施例中,对于计算两个句子向量的相似度,以计算待映射词义注释“Determinethemeasurementsofsomethingorsomebody,takemeasurementsof”与表4中编号为1的候选中文词义的英文注释“Ifyoumeasurethequality,value,oreffectofsomething,youdiscoverorjudgehowgreatitis.”的句子单位向量相似度为例。首先,对句子向量的归一化处理,以对步骤103所得待映射词义注释的句子向量处理为例。对步骤103所得待映射词义注释“Determinethemeasurementsofsomethingorsomebody,takemeasurementsof”的句子向量进行单位向量的转化,得到向量的单位向量为,[-0.03826203,0.072761215,-0.03932595,-0.030803772,0.047424328,-0.06613961,-0.021672316,0.055542573,-0.015042689,0.08489659,-0.07255885,-0.09742565,-0.07430801,0.06257185,-0.069004536,-0.079672165,-0.015926026,0.062088236,0.044448037,0.016894639,0.045140546,0.11402729,-0.021787263,-0.02944281,-0.14787471,-7.646896E-4,-0.035480265,-0.0724357,-0.010009004,0.03639528,0.021500021,0.076489404,0.023152297,-0.0960827,0.088778675,-0.09855515,0.15080492,5.6481326E-4,-0.080314524,-0.020532303,-0.09153865,0.037099913,0.027883353,-0.0074054482,0.024516165,0.06339057,-0.11406259,0.12526086,-0.0016167228,-0.07867011,-0.037465848,-0.030289482,-0.029995508,0.0020518824,0.0034102874,-0.01124351,-0.037507735,-0.032411408,-0.088093616,0.010514909,0.112637095,-0.014996036,-0.051287454,-0.06587606,-0.042566523,0.10475516,-0.015872212,-0.011679478,-0.071581736,0.03412345,0.11571677,-0.037152883,0.07588162,0.023979386,0.099373594,-0.0711435,-0.13390192,-0.022203922,0.05984049,-0.087642685,-0.19000237,-0.0055757193,0.07228256,-0.07740323,0.08157017,0.046481434,-0.029033348,-0.012805559,-0.074535266,0.1281652,-0.056244556,0.097146064,0.035915457,-7.200528E-4,-0.029972339,-0.089780636,-0.1550686,-0.096840866,-0.07113603,-0.02149896,-1.878033E-4,-0.0804602,0.0020297295,0.06602633,-0.15452823,0.25738078,0.098357104,0.0037912477,0.09206158,-0.04117495,-0.03366983,-0.060102183,0.025757283,0.050262343,-0.042557437,0.036766764,0.07250038,-0.045236673,0.11144115,-0.10360372,0.10003033,0.0053135124,0.030736258,0.08863206,0.054407958,-0.09708222,0.06709791,-0.0577877,0.030565768,0.01888253,0.095337436,-0.038544167,-0.07433118,0.08467125,0.007343274,-0.023782367,-0.014480493,-0.15058737,0.13813224,-0.009620174,0.014685571,-0.035755258,-0.100775465,-0.042942822,-0.04277191,-0.0680552,0.019221786,-0.09795178,-0.062312976,0.02895201,-0.009137753,-0.0477407,0.09566095,0.12836443,0.031227507,-0.04509584,0.05241526,-0.046731643,-0.013850109,-0.006411083,-0.16408584,0.031391002,0.0410922,-0.052630022,0.018562889,0.10293304,0.07592999,-0.10361698,-0.018123088,-0.10711978,0.0759098,0.08619461,-0.05202459,-0.04212341,-0.057654366,-0.034768473,0.047102343,0.06272577,-0.030602425,-0.08834199,-0.076135166,0.05560446,0.11051842,-0.003773936,-0.056062706,0.025653306,0.02601787,-0.04056785,0.10985829,0.034481637,-0.030355806,0.006102444,-0.022185028,0.06980483,0.03280705,-0.07006894,-0.015082947,0.0145672,-0.10217254,0.017480541]。同理,可得到其它各个注释和例句句子向量的单位向量。对于待映射词义注释与表4中编号为1的候选中文词义的英文注释两者之间的相似度可由公式(4)求得,计算求得该相似度为0.3879761。同理,可依次计算出待映射词义英文注释与编号为2、3、4的候选中文词义的英文注释相似度,分别为0.4196734,0.3625376,0.41536587。同理,可依次计算出待映射词义的例句与候选中文词义的例句的相似度,如表5所示。表5中,待映射词义只有一个例句,其编号为ex;候选中文词义的例句的第一个词义的第一个例句的编号为1_ex1,其第一个词义的第二个例句的编号为1_ex2,其它各词义的各例句的编号以此类推。表5待映射词义例句编号候选词义例句编号例句相似度ex1_ex10.33322173ex1_ex20.3466332ex1_ex30.34800234ex2_ex10.7905501ex2_ex20.40629613ex3_ex10.5284378ex3_ex20.5624604ex3_ex20.5684977ex4_ex10.35761255ex4_ex20.3466332步骤4-2)由步骤4-1)所得的英文注释和例句的句子向量相似度,计算待映射词义与候选中文词义的综合相似度,具体为:将英文知识库中的待映射词义记作Bs,将某一候选中文词义记作Ds,其综合相似度可由公式(5)计算;其中,Bsgl为Bs的英文注释,Dsgl为Ds的英文注释,Bsexs为Bs的英文例句集合,Dsexs为Ds的英文例句集合,Bsex为Bsexs中的一条例句,Dsex为Dsexs中的一条例句,α和(1-α)分别表示注释和例句的权重,sim(Bsgl,Dsgl)和sim(Bsex,Dsex)由公式(4)计算。本发明实施例中,对待映射词义Bs与某一候选中文词义Ds的综合相似度计算,以表1中的待映射词义与表4中编号为1的候选中文词义之间的综合相似度计算为例。由已知步骤4-1)所得待映射词义英文注释与表4中编号为1的候选中文词义的英文注释相似度,sim(Bsgl,Dsgl)=0.3879761。公式(5)中表示取某一Bsex和某一Dsex之间的相似度最大者,由步骤4-1)所得待映射词义英文例句与编号为1的候选中文词义的各例句相似度分别为0.33322173、0.3466332、0.34800234,其中0.34800234的值最大,故经过大量实验验证,本发明实施例将公式(5)中的权重设置为0.4。由公式(5)可得,待映射词义Bs与编号为1的候选中文词义Ds的综合相似度score(Bs,Ds)=0.4×0.3879761+(1-0.4)×0.34800234=0.3480023443698883。同理,可得待映射词义与表4中其它各候选中文词义的综合相似度,如表6所示。表6步骤105,根据词义相似度选择目标词义。选择综合相似度最大的候选中文词义作为待映射词义的目标词义时,具体为:将英文知识库中的待映射词义记作Bs,将某一候选中文词义记作Ds,则Bs映射的目标词义Ts可由公式(6)而得;其中,Dss表示Bs的候选中文词义的集合,Dsi表示Dss中的第i个候选中文词义,score(Bs,Dsi)可由公式(5)计算求得。在本发明实例中,由表6可知,编号为2的候选中文词义的词义综合相似度得分最高,所以该词义将被作为待映射词义的目标词义映射结果。通过以上操作步骤,即可完成待映射词义的词义映射工作。相应地,本发明实施例还提供一种基于词向量的英汉词义映射装置,其结构示意图如图2所示。在该实施例中,所述装置包括:候选词义查询单元201,用于在英文知识库中提取待映射词义的同义词集,而后根据英汉词典查询同义词集中各个同义词的候选中文词义;注释和例句提取单元202,用于在英文知识库中提取待映射词义的英文注释和例句,并根据英汉词典查询候选词义查询单元所得的各个候选中文词义的英文注释和例句;句子向量生成单元203,用于在大规模英文语料库上训练词向量,而后为注释和例句提取单元所得的各个英文注释和例句分别生成句子向量;词义相似度计算单元204,用于计算句子向量生成单元所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,而后计算待映射词义与候选中文词义的综合相似度;目标词义选择单元205,用于选择综合相似度最大的候选中文词义作为待映射词义的目标词义。图2所示装置的候选词义查询单元201的结构示意图如图3所示,其包括:同义词集提取单元301,用于提取待映射词义的同义词集;候选中文词义查询单元302,用于查询同义词集中各个同义词的候选中文词义。图2所示装置的注释和例句提取单元202的结构示意图如图4所示,其包括:待映射词义信息提取单元401,用于提取待映射词义的英文注释和例句;候选词义信息提取单元402,用于提取候选中文词义查询单元所得的各个候选中文词义的英文注释和例句。图2所示装置的句子向量生成单元203的结构示意图如图5所示,其包括:词向量训练单元501,用于在大规模英文语料库上训练词向量;词义信息预处理单元502,用于对注释和例句提取单元所得的英文注释和例句进行词形还原、提取实词等预处理;句子向量生成单元503,用于根据词向量训练单元所得词向量为词义信息预处理单元处理得到的英文注释和例句分别生成句子向量。图3所示装置的词义相似度计算单元204的结构示意图如图6所示,其包括:句子向量相似度计算单元601,用于计算句子向量生成单元所得的待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度;综合相似度计算单元602,根据句子向量相似度计算单元所得的英文注释和例句的句子向量相似度,计算待映射词义与候选中文词义的综合相似度。可以将图2~图6所示的基于词向量的英汉词义映射装置集成到各种硬件设备中。例如,可以将基于词向量的英汉词义映射装置集成到:PC、智能手机、工作站等设备中。可以通过使用指令或指令集存储的储存方式将本发明实施方式所提出的基于词向量的英汉词义映射方法存储在各种存储介质上。这些存储介质包括但不局限于:光盘、硬盘、内存、U盘等。综上所述,在本发明实施方式中,由英文知识库提取待映射词义的同义词集,而后根据英汉词典查询同义词集中各个同义词的候选中文词义;由英文知识库提取待映射词义的英文注释和例句,并根据英汉词典查询各个候选中文词义的英文注释和例句;在大规模英文语料库上训练词向量,而后为各个英文注释和例句分别生成句子向量;计算待映射词义的英文注释和例句的句子向量与候选中文词义的英文注释和例句的句子向量的相似度,而后计算待映射词义与候选中文词义的综合相似度;选择综合相似度最大的候选中文词义作为待映射词义的目标词义。由此可见,应用本发明实施方式之后,实现了基于词向量的英汉词义映射。本发明实施方式可以利用深度学习中的词向量技术进行词义映射,能够有效考虑句子中词语间的语义关系;针对英文句子的特点,本发明提取实词,可消除句子中其它虚词的干扰;提出了句子相似度计算方法,有效考虑了待映射词义和候选中文词义的注释和例句信息。本发明提出的基于词向量的英汉词义映射方法和装置,能够自动完成知识库的词义映射,具有较高的正确率。本发明提出的基于词向量的英汉词义映射方法和装置,是一种完全自动化的词义映射方法,可以避免传统手工映射方法的繁琐的人力劳动。本发明提出的基于词向量的英汉词义映射方法和装置,充分发挥了深度学习的优势,利用词向量技术生成句子向量,能够较为准确地选择目标词义,避免了传统机器翻译映射方法的正确率较低的问题。本说明书中的实施例采用递进的方式描述,彼此相同相似的部分互相参见即可。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上对本发明实施例进行了详细介绍,本文中应用了具体实施方式对本发明进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法和装置;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,故本说明书不应理解为对本发明的限制。当前第1页1 2 3 
当前第1页1 2 3 
网友询问留言 已有0条留言
  • 还没有人留言评论。精彩留言会获得点赞!
1