一种无监督的英语短文句子跑题分析方法与流程

文档序号:18796921发布日期:2019-09-29 19:49阅读:387来源:国知局
一种无监督的英语短文句子跑题分析方法与流程

本发明涉及自然语言处理技术,是一种无监督的判断英语短文中句子是否跑题的分析方法,本发明的分析方法只适合分析英语短文,不适合分析中文短文。



背景技术:

传统的英语短文句子跑题分析方法主要分为两类,一类是有监督方法,另一类是无监督方法。有监督方法需要大量的与待分析英语短文相同题目的英语文本来训练构建分析模型,才可以分析出英语短文中的句子是否跑题;无监督方法只需英语短文题目便可以分析出英语短文中句子是否跑题。在实际的英语短文句子跑题分析中,收集大量的与待分析英语短文相同题目的训练英语短文较为困难。因此,无监督英语短文句子跑题分析方法的实际应用存在可行性的问题。无监督英语短文句子跑题分析方法,由于不需要大量的与待分析英语短文相同题目的英语短文来训练构建分析模型,因此其实际应用具有较好的可行性。但是,传统的无监督英语短文句子跑题分析方法,是将待分析英语短文与待分析英语短文题目分别表示为词频-文档频率向量形式,最后通过相似度计算来判断英语短文句子是否跑题。然而,传统的无监督英语短文句子跑题分析方法忽略了词与词之间的语义相关性,从而导致了在进行英语短文句子跑题分析时无法准确分析出跑题句子和跑题程度评分准确差的问题。



技术实现要素:

本发明的目的是针对传统无监督英语短文句子跑题分析方法的不足,而提供一种无监督的英语短文句子跑题分析新方法,该方法充分考虑了词与词之间的语义相关性,无需事先用与待分析英语短文同主题的范文进行训练,只需给定待分析英语短文题目,便能准确分析出待分析英语短文中的跑题句子,并能够计算出待分析英语短文句子的跑题程度分数。

实现本发明目的的技术方案是:

一种无监督的英语短文句子跑题分析方法,包括一个由顺序连接的英语短文预处理模块、多元语义表示模型构建模块、英语短文表示模型构建模块、英语短文句子跑题分析模块,其总体处理步骤图如图1所示;

其分析方法包括如下处理步骤:

(1)英语短文预处理模块,第一,输入待分析英语短文及其题目,对待分析英语短文及其题目进行共指消解、单词小写化处理并对待分析英语短文分句;第二,对第一步中输出的待分析英语短文及其题目进行词性标注、短语切分处理,得到待分析英语短文题目和英语短文中的各个句子组成单词和名词短语;第三,对待分析英语短文中的各个句子和英语短文题目中的名词短语进行去停用词和词干化处理,并用下划线将名词短语中的单词分隔开;第四,分别输出待分析英语短文中的各个句子的名词短语列表与除名词短语外的单词列表、英语短文题目的名词短语列表与除名词短语外的单词列表;

(2)多元语义表示模型构建模块,第一,输入神经概率词向量空间、词共现词向量空间、常识概念语义网络、英语语义词典同义词集;第二,对第一步中输入的神经概率词向量空间、词共现词向量空间、常识概念语义网络、英语语义词典同义词集中的词汇表进行去除标点符号处理,用下划线将短语中的单词分隔开并输出;第三,对第二步中的输出结果中的多词短语进行去停用词、单词小写化;第四,将第三步处理后的词共现词向量空间和神经概率词向量空间进行合并处理;第五,使用英语语义词典同义词集改进第四步合并后的向量空间;第六,对去停用词、单词小写化后的常识概念语义网络进行稀疏对称处理;第七,使用稀疏对称处理后的常识概念语义网络,对第五步中的改进结果进行扩展改造,得到多元语义表示模型;

(3)英语短文表示模型构建模块,第一,输入预处理模块中的待分析英语短文题目的预处理结果,将英语短文题目中的名词短语和除名词短语外的单词,映射到多元语义表示模型中得到对应的向量表示;第二,使用预先训练好的文档频率集,对待分析英语短文题目中的单词和名词短语的向量进行加权和;第三,计算出求和后的向量的主成分,得到待分析英语短文题目的向量表示;第四,输入预处理模块中的待分析英语短文的预处理的结果,将待分析英语短文中的各个句子中的名词短语和除名词短语外的单词,映射到多元语义表示模型中得到对应的向量表示,并对待分析英语短文中的各个句子的单词和名词短语向量进行加权并求和;第五,计算出求和后的向量的主成分,得到待分析英语短文中的各个句子的向量表示;

(4)英语短文句子跑题分析模块,第一,输入英语短文表示模型构建模块中输出的待分析英语短文题目向量;第二,输入待分析英语短文中的各个句子向量;第三,计算待分析英语短文题目向量及其各个句子向量的语义相似度,并取均值作为待分析英语短文与待分析英语短文题目的相似度;第四,输入预先设置的英语主题库中的主题,并通过英语短文表示模型构建模块表示为向量形式,之后计算得到待分析英语短文与英语短文主题库中的各个主题的语义相似度;第五,将待分析英语短文与待分析英语短文题目的相似度以及待分析英语短文与英语主题库中各题目的相似度进行降序排列,如果待分析英语短文与待分析英语短文题目的语义相似度排在前五,则继续执行第六步,否则将待分析英语短文判定为与待分析主题完全无关的短文并结束;第六,分别计算待分析英语短文题目向量与待分析英语短文中的所有句子向量的语义相似度,并将这些语义相似度与预先设置的阈值进行比较,如果小于预先设置的阈值则判定为跑题句子并输出;第七,统计待分析英语短文中的跑题句子数量和待分析英语短文句子的总数量,通过计算待分析英语短文中的跑题句子数占待分析英语短文中的句子总数比例,得出待分析英语短文句子跑题程度分数,并生成待分析英语短文句子跑题程度的评语。

1.本发明英语短文句子跑题分析方法涉及的基本概念与定义如下

(1)共指消解训练集

本发明的共指消解训练集是指不含单词、语法、表达错误且不含普遍共指现象的英语范文。

(2)词性标注的结构

本发明的词性标注是对待分析英语短文和待分析英语短文题目中的单词进行词性标注处理,其标注后格式如下所示:

单词1[词性1#词性2#词性3……]单词2[词性1#词性2#词性3……]……

单词n[词性1#词性2#词性3……]

(3)短语切分的结构

本发明的短语切分是对待分析英语短文和待分析英语短文题目中的短语进行切分,短语切分的格式如下:

单词1/短语切分标记1,单词2/短语切分标记2,……单词n/短语切分标记n

(4)向量空间的结构

神经概率词向量空间和词共现词向量空间的结构如下所示:

单词1[300维向量]

单词2[300维向量]

单词n[300维向量]

……

短语1[300维向量]

短语2[300维向量]

短语n[300维向量]

……

(5)英语语义词典同义词集

本发明的英语语义词典将单词根据词性进行了分组,共有名词、动词、形容词、副词四个组,在四个分组中分别将词义相近的单词相互连接形成同义词集,本发明的英语语义词典同义词集结构如下所示:

单词1单词1的同义词1单词1的同义词2……单词1的同义词n

……

单词n单词n的同义词1单词n的同义词2……单词n的同义词n

……

(6)常识概念语义网络的结构

常识概念语义网络是一个包含大量常识性语义知识的网络图谱,它将英语中存在常识关联的概念以带权重和标记的边联系起来,标记中包含的语义关系包括对称关系和非对称关系。对称关系中包括了同义关系、相似关系等,非对称关系包括了因果关系、包含关系等。常识概念语义网络的结构如下所示:

概念1[带权重和标记的有向边]概念2[带权重和标记的有向边]……[带权重和标记的有向边]概念n……

(7)常识概念语义网络稀疏对称关联矩阵的结构

本发明将常识概念语义网络中的概念与概念之间的标签去除,并将其中的有向边表示为无相边,最后通过稀疏对称处理将其表示为一个稀疏对称关联矩阵,其结构如下所示:

概念1与概念1的权重概念1与概念2的权重……概念1与概念n的权重……

概念2与概念1的权重概念2与概念2的权重……概念2与概念n的权重……

……

概念n与概念1的权重概念n与概念2的权重……概念n与概念n的权重……

……

(8)单词逆文档频率集的结构

单词逆文档频率是指训练英语文本中包含的文档总数与训练英语文本中包含该单词文档总数的比值,训练英语文本中所有单词逆文档频率构成的集合称为单词逆文档频率集,其结构如下所示:

单词1[单词1逆文档频率]

单词2[单词2逆文档频率]

……

单词n[单词n逆文档频率]

……

(9)英语短文主题库

英语短文主题库包括了许多个英语短文主题,其结构如下所示:

短文主题1

短文主题2

……

短文主题n

2.本发明英语短文句子跑题分析方法涉及的计算公式定义如下

(1)单词逆文档频率的计算公式为:

(2)词向量与名词短语向量加权和的计算公式:

在公式(2)中,i表示要进行加权和的单词序号,j表示要进行加权和的名词短语序号。n和m分别是进行加权和的单词和与词短语的总数,α和β分别是词向量与名词短语向量的权重系数,单词i逆文档频率与名词短语j中名词逆文档频率由公式(1)计算得到。

(3)英语短文题目与英语短文语义相似度的计算公式为:

在公式(3)中,n表示待分析英语短文的句子总数。

(4)英语短文题目与英语短文句子语义相似度的计算公式为:

(5)英语短文句子跑题程度分数的计算公式为:

本发明英语短文句子跑题分析方法的具体处理步骤如下

如图2所示,所述的英语短文预处理模块处理步骤如下:

p201开始;

p202读入待分析英语短文和题目;

p203将待分析英语短文和题目首尾拼接为一个整体并进行共指消解处理,得到共指消解链;

p204读入待分析英语短文题目;

p205判断待分析英语短文题目中各代词所在的共指消解链中是否存在名词短语,是则执行p206,否则执行p207;

p206将待分析英语短文题目中的代词替换成共指链中的名词或名词短语;

p207对待分析英语短文题目进行分句分词;

p208将分句分词后的待分析英语短文题目中的单词小写化;

p209对单词小写化后的待分析英语短文题目进行词性标注和短语切分,并输出待分析英语短文题目的名词短语和除名词短语外的单词列表;

p210对待分析英语短文题目的名词短语列表中的名词短语进行去停用词和词干化,并用下划线分隔名词短语间的单词;

p211读取待分析英语短文;

p212判断待分析英语短文中各代词所在的共指消解链中是否存在名词或名词短语,是则执行p213,否则执行p214;

p213将待分析英语短文中的代词替换成共指链中的名词或名词短语;

p214对待分析英语短文进行分句分词;

p215将分句分词后的待分析英语短文中的单词小写化;

p216对单词小写化后的短文以句子为单位进行词性标注和短语切分,并输出待分析英语短文各句子的名词短语和除名词短语外的单词列表;

p217对待分析英语短文各句子中的名词短语列表中的名词短语进行去停用词和词干化处理,并用下划线分隔名词短语间的单词;

p218结束。

如图3所示,所述的多元语义表示模型构建模块处理步骤如下:

p301开始;

p302使用训练语料训练出神经概率词向量空间;

p303输入神经概率词向量空间、词共现词向量空间、常识概念语义网络以及英语语义词典中同义词集的词汇表;

p304去除上一步输入的四种语料的词汇表中所有单词间的标点符号;

p305对去除标点符号后的词汇表进行单词小写化处理;

p306对单词小写化后的词汇表进行去停用词处理;

p307对去停用词后的词汇表中的单词进行词干化处理;

p308用下划线将词干化后的词汇表中的短语进行分隔并输出经过词汇表处理后的四种语料;

p309输入词汇表处理后的神经概率词向量空间和词共现词向量空间;

p310构建两个向量空间中不重叠的单词在另一向量空间中的向量表示,使两个向量空间中的词汇表重叠;

p311将词汇表重叠后的神经概率词向量空间和词共现词向量空间中相同的词对应的300维词向量首尾相连成600维的词向量;

p312通过奇异值分解将首尾相连后的600维词向量降至300维;

p313对降维后的词向量进行二范数标准化处理,并输出神经概率词向量空间和词共现词向量空间融合后的向量空间;

p314使用英语语义词典同义词集对融合后的向量空间进行改进,拉近同义词所对应的词向量间的欧氏距离;

p315输入词汇表处理后的常识概念语义网络;

p316去除词汇表处理后的常识概念语义网络中的标签并表示为一个无向图;

p317将处理为无向图后的常识概念语义网络表示为稀疏对称的关联矩阵并输出;

p318输入同义词集改进后的向量空间模型;

p319使用稀疏对称后的常识概念语义网络对同义词集改进后的向量空间进行扩展改进,拉近存在常识关联的词对应的词向量间的欧式距离,得到多元语义表示模型并输出;

p320结束。

如图4所示,所述的英语短文表示模型模块处理步骤如下:

p401开始;

p402按顺序读取预处理后的待分析英语短文题目的名词短语和除名词短语外的单词列表中的单词和名词短语;

p403判断读取的是否为名词短语,是则执行p404,否则执行p407;

p404将名词短语映射到多元语义表示模型中;

p405判断名词短语在多元语义表示模型中是否存在对应向量,是则执行p409,否则执行p406;

p406将对应名词短语拆分成单词;

p407将单词映射到多元语义表示模型中;

p408判断单词在多元语义表示模型中是否存在对应向量,是则执行p409,否则执行p410;

p409得到对应单词或名词短语的向量并保存在待分析英语短文题目的单词向量列表或名词短语向量列表中;

p410判断单词是否为待分析英语短文题目的名词短语和除名词短语外的单词列表中的最后一个单词或者名词短语,是则执行p411,否则执行p402;

p411读取待分析英语短文题目的单词向量列表和名词短语向量列表;

p412使用逆文档频率集,通过单词逆文档频率公式(1)计算单词逆文档频率,再通过词向量与名词短语向量加权和公式(2)计算得到词向量与名词短语向量加权和;

p413使用主成分分析方法计算出上一步加权和后向量的主成分并移除,得到待分析英语短文题目的向量表示;

p414按顺序读取预处理后的待分析英语短文各句子的名词短语和除名词短语外的单词列表中的单词和名词短语;

p415判断读取的是否为名词短语,是则执行p416,否则执行p419;

p416将名词短语映射到多元语义表示模型中;

p417判断名词短语在多元语义表示模型中是否存在对应向量,是则执行p421,否则执行p418;

p418将对应名词短语拆分成单词;

p419将单词映射到多元语义表示模型中;

p420判断单词在多元语义表示模型中是否存在对应向量,是则执行p421,否则执行p422;

p421得到对应单词或名词短语的向量并保存在待分析英语短文相应句子的单词向量列表或名词短语向量列表中;

p422判断单词是否为待分析英语短文的名词短语和除名词短语外的单词列表中的最后一个单词或者名词短语,是则执行p423,否则执行p414;

p423分别读取待分析英语短文各句子的单词向量列表和名词短语向量列表;

p424分别将各句子的单词向量和名词短语向量通过公式(2)计算得到词向量与名词短语向量加权和;

p425使用主成分分析方法计算出上一步加权和后向量的主成分并移除,得到待分析英语短文各个句子的句向量表示;

p426结束。

如图5所示,所述的英语短文句子跑题分析模块处理步骤如下:

p501开始;

p502读取待分析英语短文题目向量;

p503同时读取待分析英语短文所有句子的句向量;

p504将待分析英语短文题目向量和待分析英语短文所有句子的句向量带入英语短文题目与英语短文语义相似度公式(3)计算得到英语短文题目与英语短文语义相似度;

p505通过公式(3)计算得到英语短文题目与英语短文语义相似度;

p506对待分析英语短文与待分析英语短文题目的语义相似度以及待分析英语短文与英语主题库中各题目的语义相似度进行降序排列;

p507判断待分析英语短文与待分析英语短文题目的语义相似度是否排在前5位,是则执行p509,否则执行p508;

p508将待分析英语短文判定为与主题完全无关的跑题短文;

p509读取待分析英语短文题目向量;

p510按顺序读取待分析英语短文句向量;

p511通过英语短文题目与英语短文句子语义相似度公式(4)计算得到英语短文题目与英语短文句子语义相似度;

p512判断待分析英语短文题目向量与待分析英语短文句子向量的相似度是否小于预设阈值,是则执行p513,否则执行p510;

p513将待分析英语短文句子向量对应的待分析英语短文句子判定为跑题句子;

p514判断跑题句子是否为待分析英语短文句向量列表中最后一个句向量,是则执行p515,否则执行p510;

p515统计待分析英语短文句子跑题句子总数;

p516通过英语短文句子跑题程度分数公式(5)计算得到英语短文句子跑题程度分数;

p517生成英语短文句子跑题分析评语;

p518结束。

本发明分析方法解决了传统的无监督英语短文句子跑题分析方法忽略了词与词之间的语义相关性,从而导致的在进行英语短文句子跑题分析时无法准确分析出跑题句子和跑题程度评分准确差的问题。一篇英语短文通过该分析方法处理后,最后能够得到该篇英语短文中跑题的句子,以及英语短文句子跑题程度分数与评语。

附图说明

图1是本发明分析方法的总体处理步骤图;

图2是本发明分析方法的英语预处理模块处理步骤图;

图3是本发明分析方法的多元语义表示模型构建模块处理步骤图;

图4是本发明分析方法的英语短文表示模型构建模块处理步骤图;

图5是本发明分析方法的英语短文句子跑题分析模块处理步骤图。

具体实施方式

下面结合实施例和附图对本发明内容作进一步的说明,但不是对本发明的限定。

实施例:

参照图1-5,一种英语短文句子跑题分析方法的具体实施步骤如下所述;

第一步骤:执行“英语短文预处理模块”

输入的英语短文取材于亚洲英语学习者国际语料库中的中国学生英语短文,该英语短文的题目和内容如下所示:

待分析英语短文题目如下:

whethersmokingshouldbecompletelybannedatalltherestaurantinthecountry

待分析英语短文内容如下:

smokingisapopularactivityamonghumanbeings,especiallymen.butasweallknow,cigarettescontainsomepoison,manycountrieshavebannedsmokingatpublicplaces.it'snotenough,stoppingsmokingshouldnotonlyatpublicplacesbutalsoatrestaurantsinthecountry.scientistshaveimprovedthatcigarettehasnicotine.it'sakindofpoison.peoplewhosmokingforalongtimemaygetnicotinism,ifit'sveryserious,peoplewilldieforthis.andpeoplewhoalwayssmokingalsowillhavelungcancer.itcan'tbecured.thepatientswillsufferunimaginablepain.andthefamilyofthepatientswillbesosad.so,forhumanbeing'shealth,smokingshouldbebannedcompletelyatallrestaurants.smokingisnotonlybadforsmokingpeople,butalsounhealthyforthepeoplearoundthem.whenonepersonissmoking,thepeoplearoundhimmaygetsecond-handsmokes.thesecond-handsmokesalsocontainnicotine,alsoharmfulforhuman'sbody.andweseethat,ifpeoplesmokinginarestaurant,everyoneinthisrestaurantwillbeharmedbythem.asfarasiknow,smokingalsopolluteenvironment.soifwecanbansmoking,wewillsaveourworld.inaword,smokingisbadforourhealthandalsoharmfulforourworld.banningsmokingcompletelyatallrestaurantsinthecountryisalittlestep,onedaywewillstopsmokingallovertheworld!

(1)对待分析英语短文题目和英语短文内容进行共指消解,得到的共指消解链如下:

chain1-["smoking"insentence1,"smoking"insentence2,"smoking"insentence3,"smoking"insentence11,"smoking"insentence12,"smoking"insentence16,"smoking"insentence17,"smoking"insentence18,"smoking"insentence19]

chain2-["lungcancer"insentence7,"it"insentence8]

chain3-["thepatients"insentence9,"thepatients"insentence10]

chain4-["alsoatrestaurantsinthecountry"insentence3,"allrestaurantsinthecountry"insentence19]

chain5-["manycountries"insentence2,"thecountry"insentence3,"thecountry"insentence19]

chain6-["smokingpeople"insentence12,"them"insentence12]

chain7-["we"insentence15,"we"insentence17,"we"insentence17,"our"insentence17,"our"insentence18,"our"insentence18,"we"insentence19]

chain8-["thepeoplearoundthem"insentence12,"thepeoplearoundhim"insentence13]

chain9-["ourworld"insentence17,"ourworld"insentence18,"theworld"insentence19]

chain10-["oneperson"insentence13,"him"insentence13]

chain11-["cigarette"insentence4,"it"insentence5]

chain12-["arestaurant"insentence15,"thisrestaurant"insentence15]

chain13-["everyoneinthisrestaurant"insentence15,"them"insentence15]

将待分析英语短文题目和英语短文内容中的代词替换成代词对应的共指链中的同一名词短语,得到的结果如下:

共指替换后的英语短文题目:

whethersmokingshouldbecompletelybannedatalltherestaurantinthecountry

共指替换后的英语短文内容:

smokingisapopularactivityamonghumanbeings,especiallymen.butasweallknow,cigarettescontainsomepoison,manycountrieshavebannedsmokingatpublicplaces.it'snotenough,stoppingsmokingshouldnotonlyatpublicplacesbutalsoatrestaurantsinthecountry.scientistshaveimprovedthatcigarettehasnicotine.cigaretteisakindofpoison.peoplewhosmokingforalongtimemaygetnicotinism,ifit'sveryserious,peoplewilldieforthis.andpeoplewhoalwayssmokingalsowillhavelungcancer.lungcancercan'tbecured.thepatientswillsufferunimaginablepain.andthefamilyofthepatientswillbesosad.so,forhumanbeing'shealth,smokingshouldbebannedcompletelyatallrestaurants.smokingisnotonlybadforsmokingpeople,butalsounhealthyforthepeoplearoundsmokingpeople.whenonepersonissmoking,thepeoplearoundonepersonmaygetsecond-handsmokes.thesecond-handsmokesalsocontainnicotine,alsoharmfulforhuman'sbody.andweseethat,ifpeoplesmokinginarestaurant,everyoneinthisrestaurantwillbeharmedbythem.asfarasiknow,smokingalsopolluteenvironment.soifwecanbansmoking,wewillsaveourworld.inaword,smokingisbadforourhealthandalsoharmfulforourworld.banningsmokingcompletelyatallrestaurantsinthecountryisalittlestep,onedaywewillstopsmokingallovertheworld!

共指替换后,对英语短文题目进行短语切分后,输出待分析英语短文题目的名词短语和除名词短语外的单词列表结果如下:

名词短语列表:[smoking,alltherestaurant,thecountry]

除名词短语外的单词列表:[whether,should,be,completely,banned,at,in]

对其中的名词短语列表中的名词短语进行去停用词、词干化处理,并用下划线将名词短语之间的单词分隔,待分析英语短文题目经预处理后,最终生成的名词短语和除名词短语外的单词列表结果如下:

名词短语列表:[smoking,restaurant,country]

除名词短语外的单词列表:[whether,should,be,completely,banned,at,in]

共指替换后,对英语短文中的各个句子进行短语切分后,输出待分析英语短文中的各个句子的名词短语和除名词短语外的单词列表结果如下:

待分析英语短文的第1句:

名词短语列表:[smoking,apopularactivity,humanbeings,men]

除名词短语外的单词列表:[is,among,human,especially]

待分析英语短文的第2句:

名词短语列表:[we,cigarettes,somepoison,manycountries,smoking,publicplaces]

除名词短语外的单词列表:[but,as,all,know,contain,have,banned,smoking,at]

待分析英语短文的第3句:

名词短语列表:[it,stoppingsmoking,publicplaces,restaurants,thecountry]

除名词短语外的单词列表:[is,not,enough,should,not,only,at,but,also,at,in]

待分析英语短文的第4句:

名词短语列表:[scientists,cigarette,nicotine]

除名词短语外的单词列表:[have,improved,that,has]

待分析英语短文的第5句:

名词短语列表:[cigarette,akindofpoison]

除名词短语外的单词列表:[is]

待分析英语短文的第6句:

名词短语列表:[people,alongtime,nicotinism,it,people,this]

除名词短语外的单词列表:[who,smoking,for,may,get,if,is,very,serious,will,die,for]

待分析英语短文的第7句:

名词短语列表:[people,lungcancer]

除名词短语外的单词列表:[and,who,smoking,also,will,have]

待分析英语短文的第8句:

名词短语列表:[lungcancer]

除名词短语外的单词列表:[can,not,be,cured]

待分析英语短文的第9句:

名词短语列表:[thepatients,unimaginablepain]

除名词短语外的单词列表:[will,suffer]

待分析英语短文的第10句:

名词短语列表:[thefamily,thepatients]

除名词短语外的单词列表:[and,of,will,be,so,sad]

待分析英语短文的第11句:

名词短语列表:[humanbeing,health,smoking,allrestaurants]

除名词短语外的单词列表:[so,for,should,be,banned,completely,at]

待分析英语短文的第12句:

名词短语列表:[smoking,smokingpeople,thepeople,smokingpeople]

除名词短语外的单词列表:[is,not,only,bad,for,but,also,unhealthy,for,around]

待分析英语短文的第13句:

名词短语列表:[oneperson,thepeople,oneperson,second-handsmokes]

除名词短语外的单词列表:[when,is,smoking,around,may,get]

待分析英语短文的第14句:

名词短语列表:[thesecond-handsmokes,nicotine,human'sbody]

除名词短语外的单词列表:[also,contain,also,harmful,for]

待分析英语短文的第15句:

名词短语列表:[we,people,arestaurant,everyone,thisrestaurant,them]

除名词短语外的单词列表:[and,see,that,if,smoking,in,in,will,be,harmed,by]

待分析英语短文的第16句:

名词短语列表:[i,smoking,environment]

除名词短语外的单词列表:[as,far,as,know,also,pollute]

待分析英语短文的第17句:

名词短语列表:[we,smoking,we,ourworld]

除名词短语外的单词列表:[so,if,can,ban,will,save]

待分析英语短文的第18句:

名词短语列表:[aword,smoking,ourhealth,ourworld]

除名词短语外的单词列表:[in,is,bad,for,and,also,harmful,for]

待分析英语短文的第19句:

名词短语列表:[smoking,allrestaurants,thecountry,alittlestep,oneday,we,theworld]

除名词短语外的单词列表:[banning,completely,at,in,is,we,will,stop,smokingall,over]

对其中的名词短语列表中的名词短语进行去停用词、词干化处理,并用下划线将名词短语间的单词分隔,待分析英语短文中的各个句子经预处理后,最终生成的名词短语和除名词短语外的单词列表结果如下:

待分析英语短文的第1句:

名词短语列表:[smoke,popular_activity,human_being,man]

除名词短语外的单词列表:[is,among,human,especially]

待分析英语短文的第2句:

名词短语列表:[cigarette,poison,country,smoke,public_place]

除名词短语外的单词列表:[but,as,all,know,contain,have,banned,smoking,at]

待分析英语短文的第3句:

名词短语列表:[stop_smoke,public_place,restaurant,country]

除名词短语外的单词列表:[is,not,enough,should,not,only,at,but,also,at,in]

待分析英语短文的第4句:

名词短语列表:[scientist,cigarette,nicotine]

除名词短语外的单词列表:[have,improved,that,has]

待分析英语短文的第5句:

名词短语列表:[cigarette,poison]

除名词短语外的单词列表:[is]

待分析英语短文的第6句:

名词短语列表:[people,long_time,nicotinism,people]

除名词短语外的单词列表:[who,smoking,for,may,get,if,is,very,serious,will,die,for]

待分析英语短文的第7句:

名词短语列表:[people,lung_cancer]

除名词短语外的单词列表:[and,who,smoking,also,will,have]

待分析英语短文的第8句:

名词短语列表:[lung_cancer]

除名词短语外的单词列表:[can,not,be,cured]

待分析英语短文的第9句:

名词短语列表:[patient,unimaginable_pain]

除名词短语外的单词列表:[will,suffer]

待分析英语短文的第10句:

名词短语列表:[family,patient]

除名词短语外的单词列表:[and,of,will,be,so,sad]

待分析英语短文的第11句:

名词短语列表:[human_being,health,smoke,restaurant]

除名词短语外的单词列表:[so,for,should,be,banned,completely,at]

待分析英语短文的第12句:

名词短语列表:[smoke,smoke_people,people,smoke_people]

除名词短语外的单词列表:[is,not,only,bad,for,but,also,unhealthy,for,around]

待分析英语短文的第13句:

名词短语列表:[person,people,person,second-hand_smoke]

除名词短语外的单词列表:[when,is,smoking,around,may,get]

待分析英语短文的第14句:

名词短语列表:[second-hand_smoke,nicotine,human_body]

除名词短语外的单词列表:[also,contain,also,harmful,for]

待分析英语短文的第15句:

名词短语列表:[people,restaurant,restaurant]

除名词短语外的单词列表:[and,see,that,if,smoking,in,in,will,be,harmed,by]

待分析英语短文的第16句:

名词短语列表:[smoke,environment]

除名词短语外的单词列表:[as,far,as,know,also,pollute]

待分析英语短文的第17句:

名词短语列表:[smoke,world]

除名词短语外的单词列表:[so,if,can,ban,will,save]

待分析英语短文的第18句:

名词短语列表:[word,smoke,health,world]

除名词短语外的单词列表:[in,is,bad,for,and,also,harmful,for]

待分析英语短文的第19句:

名词短语列表:[smoke,restaurant,country,little_step,one_day,world]

除名词短语外的单词列表:[banning,completely,at,in,is,we,will,stop,smokingall,over]。

第二步骤:执行“多元语义表示模型构建模块”

输入训练语料,以语料中的一篇文章为例(已经过去标点符号、单词小写化处理):

jamesspruntcommunitycollegeiscommunitycollegelocatedinkenansvillenorthcarolinafoundedinasjamessprunttechnicalinstitutethecollegeisnamedforjamesmenziesspruntscottishimmigrantwhobecameteacherpresbyterianministerandthelongtimepastorofgrovepresbyterianchurchinkenansvillejamesspruntinstituteactivefromtowasalsonamedforhiminpopularcultureontheadultswimprogramtimandericawesomeshowgreatjobthefictionalactingteachertairygreeneportrayedbyzachgalifianakismentionsthathestudiedactingforyearsatjamesspruntcommunitycollegeunderthetutelageofrandytutelagereferencesexternallinksofficialwebsiteonealpineskierfrommexicocompetedatthewinterolympicsinsarajevoyugoslavia

itwasthefirsttimesincethatanathletefrommexicocompetedatthewinter

gamesalpineskiingmenathleteeventfinalrunrankrunranktotalrankhubertus

vonhohenlohedownhillcolspangiantslalomslalomreferencesofficialolympic

reports

通过训练语料训练神经概率词向量空间,其中部分词向量如下:

hanging0.23919-0.0577880.28155-0.92667-0.0441210.53662-0.27426-0.119410.1951-0.68673-0.273440.144330.11874-0.18663-0.443650.162450.094339-0.05949-0.0331760.138150.31370.0419380.26662-0.263940.26521-0.51041-0.15387-0.326630.254360.407840.358610.10144-0.0933940.10339-0.216250.29438-0.28886-0.44582-0.174070.295660.1764-0.50125-0.3882-0.224070.02331-0.11821-0.038837-0.203650.047709-0.39679-0.31851-0.082251-0.16661-0.13728-0.14927-0.45177-0.418030.175020.128970.221660.473910.11040.239740.37242-0.048588-0.91877-0.146110.20126-0.026047-0.069531-0.42555-0.38649-0.45285-0.0028376-0.257420.420570.353410.577550.20441-0.55476-0.30309-0.74144-0.0459780.5066-0.329280.0241820.0959370.501570.0554860.84590.18132-0.15232-0.039162-0.20784-0.0237330.137880.094208-0.74280.36825-0.65908-0.12860.212140.163670.073155-0.226390.0750420.49079-0.22637-0.03359-0.22398-0.69074-0.17205-0.048139-0.276870.963290.23638-0.443680.3157-0.088773-0.75049-0.55308-0.42920.663120.42478-0.16673-0.34658-0.30785-0.160350.0959840.21619-0.012272-0.0041859-0.032236-0.27677-0.32136-0.071497-0.193960.523920.28644-0.255870.0830770.511010.17382-0.024035-0.21401-0.1862-0.2082-0.492920.51177-0.51363-0.333730.086778-0.5377-0.26319-0.073083-0.0999810.19140.0056626-0.342080.589640.0129380.378220.125640.228960.24906-0.0269210.676130.11836-0.079872-0.1395-0.0748330.060482-0.49981-0.355970.127360.0153930.0771780.5560.40861-0.315110.59776-0.278470.481150.249370.041354-0.326870.51374-0.236760.024635-0.109530.32448-0.032524-0.1040.22082-0.481-0.37435-0.238580.27401-0.34379-0.138240.957940.37038-0.148580.163980.11623-0.086880.51634-0.21811-0.13780.18631-0.0368170.353860.13214-0.388330.510030.00622720.37996-0.41538-0.36582-0.0411860.06872-0.149550.060356-0.134820.102520.166870.025868-0.0035660.14212-0.361130.53408-0.15335-0.10818-0.067050.33439-0.088653-0.36444-0.354670.11725-0.372680.1166-0.10938-0.614520.26268-0.17361-0.323850.474050.55166-0.23739-0.20907-0.381280.0161590.78867-0.698470.933660.50606-0.0849480.22711-0.1791-0.239550.1005-0.253810.484080.0213380.042411-0.094261-0.1956-0.067508-0.52924-0.038720.217460.27120.151960.012143-0.505690.13181-1.3256-0.41987-0.437890.12313-0.294990.018041-0.19957-0.69534-0.554030.215120.22967-0.044464-0.42413-0.155530.481370.32465-0.117570.19226-0.12712-0.0696250.21740.18468-0.19653-0.26582

……

suspension0.828510.00597690.031603-0.79928-0.11199-0.14388-0.00057314-0.72653-0.52745-1.1323-0.144010.47453-0.34809-0.79474-0.309840.0583290.439290.486-0.0014909-0.438280.26094-0.339110.211-0.0025407-0.54024-0.069989-0.03812-0.350160.219320.38787-0.19020.396960.493430.22657-0.477740.550180.23512-0.147840.170640.91089-0.732760.20272-0.61477-0.589270.145160.078067-0.686430.052211-0.0251820.067288-0.414890.38113-0.0491040.0175940.524820.13034-0.0548840.013173-0.612610.00565670.259860.15701-0.20333-0.84091-0.691660.12119-0.16841-0.450290.43662-0.00210730.23936-0.209960.272580.292140.225160.357140.0736630.2506-0.158410.502480.252490.092292-0.187920.07224-0.6072-0.446690.5217-0.028627-0.313230.88892-0.194880.000503520.21954-0.00431310.197460.73817-0.38324-0.391830.366610.17661-0.802730.10527-0.21081-0.231270.20850.151430.040899-0.069252-0.4092-0.380350.43222-0.0743840.1133-0.23557-0.15003-0.13831-0.18897-0.10059-0.287740.10831-0.13443-0.71330.24719-0.07813-0.10248-0.48975-0.073150.0670330.00942440.474180.192440.48936-0.3279-0.5349-0.0955450.158760.015618-0.29264-0.27575-0.058562-0.192040.525120.37785-0.27022-0.096685-0.36485-0.53822-0.937930.17441-0.37570.57556-0.61246-0.46770.484370.60666-0.622260.21757-0.66482-0.693660.309410.351370.082329-0.63005-0.045535-0.231710.51897-0.0824820.17250.088461-0.42847-0.237920.61107-0.029530.302430.314910.0757690.420030.0557450.0630060.376710.543710.0728690.0581720.37396-0.42847-0.21185-0.125920.057227-0.18631-0.18949-0.734260.47688-0.259040.0264130.04966-0.440390.37894-0.100110.444920.302710.251710.12711-0.0251120.30298-0.093830.35510.20199-0.35224-0.546520.132180.77310.086395-0.29460.38397-0.31528-0.14553-0.21731-0.21936-0.33665-0.180380.20154-0.34890.299580.373780.896810.69910.185950.238670.060575-0.064854-0.38944-0.367630.20581-0.49150.54620.192820.075935-0.212150.70653-0.20930.31674-0.041055-0.15759-0.58329-0.458450.24568-0.137780.58895-0.19625-0.20648-0.296080.228890.40737-0.5743-0.407190.32741-0.35409-0.204-0.079592-0.48132-1.07820.194370.259130.486970.05461-0.364860.583740.29714-0.36717-0.528930.658330.15373-0.502950.591150.224570.34445-1.9855-0.191410.0605320.68289-0.152530.020860.2751-0.243350.0051796-0.41662-0.24436-0.22382-0.6991-0.49319-0.5425-0.158010.083770.266290.263540.37876-0.050207-0.256020.38773-0.51315

神经概率词向量空间的词汇表,词共现词向量空间的词汇表,常识概念语义网络的词汇表以及英语语义词典中同义词集的词汇表经去标点符号、单词小写化处理等步骤后,格式如下(短语间以下划线分隔):

hanging

……

suspension

……

之后使用公式(2)计算出神经概率词向量空间和词共现词向量空间的非重叠词汇的词向量,之后将词汇表重叠后的神经概率词向量空间和词共现词向量空间中相同的词对应的300维词向量首尾相连成600维的词向量,并使用公式(3)降维成300维,并进行标准化处理,处理后的词向量空间中的部分词向量如下:

hanging0.031802-0.41789-0.27364-0.20835-0.088522-0.31877-2.0278-0.41936-0.023931-0.514660.49406-0.59810.16253-0.0954270.00557030.34199-0.739120.509310.197870.14534-0.462780.112410.204520.492450.087480.451970.50791-0.0474720.11835-0.065630.126310.022095-0.468160.11895-0.013502-0.204650.0663560.35878-0.267070.12467-0.0401260.0949480.19945-0.13872-0.390240.0855010.0556820.0388670.21418-0.15199-0.012582-0.10986-0.122240.44494-0.281760.129750.381560.193380.10542-0.11028-0.363760.22414-0.3059-0.0052694-0.23288-0.33320.111310.303170.154180.42172-0.24225-0.14368-0.20453-0.0742820.48042-0.210810.264370.0025660.43707-0.390650.201750.58272-0.18350.16389-0.0603410.841810.121790.342760.097630.333010.0564390.142840.117950.136520.33669-0.94013-1.3226-0.256070.29516-0.453050.130820.0290080.01140.056039-0.702030.19809-0.287230.37381-0.298440.099573-0.459950.627450.22335-0.055385-0.0679560.46065-0.29261-0.370560.12883-0.11338-0.074837-0.09774-0.031991-0.570730.092527-0.229650.35743-0.38747-0.2066-0.407870.131670.47939-0.34588-0.60255-0.297190.15197-0.168320.316990.185740.693010.03065-0.400890.840130.07810.288460.37415-0.36459-0.333280.854080.041278-0.199530.099816-0.34423-0.718090.36066-0.13962-0.323440.13438-0.1345-0.13256-0.571860.064128-0.14432-0.34887-0.82901-0.20519-0.313880.0212820.015116-0.17831-0.148140.28881-0.285640.206410.095834-0.407460.181650.0149680.331480.333610.380420.0282430.016214-0.14282-0.0632640.257130.234980.43597-0.15115-0.544840.463780.031480.0618570.00928640.465140.18515-0.06307-0.393780.576450.248780.464480.29624-0.068711-0.32654-0.15502-0.34664-0.28382-0.770120.570470.20652-0.0510770.0203180.29471-0.0267030.0478150.067252-0.0711840.23071-0.32768-0.147320.51635-0.359710.18805-0.21643-2.2853-0.69781-0.29208-0.0235640.1249-0.43929-0.40031-0.14997-0.56425-0.46338-0.500270.228550.16941-0.36181-0.35983-0.024519-0.18026-0.0291220.42308-0.083804-0.448250.354850.0214510.806660.139230.18715-0.58805-0.401030.262030.30486-0.150850.19503-0.250530.35659-0.039562-0.19839-0.94252-0.0985590.193130.264490.0617650.13377-0.153020.7158-0.17790.29480.21325-0.10421-0.5558-0.001484-0.4335-0.046032-0.21391-0.28922-0.0975280.0747061.3054-0.189050.40572-0.35583-0.018936-0.242380.282220.133730.017749-0.088195-0.127810.269960.040815-0.12904-0.24057-0.100230.110140.115250.171450.6364

……

suspension-0.101890.0152460.5322-0.139560.38735-0.29086-1.95490.90194-0.0532890.567720.13875-0.34544-0.43041-0.36561.06290.54318-0.598770.773250.251720.12254-0.257380.622180.224240.33933-0.265920.0801130.51347-0.393770.701590.076896-0.51523-0.201770.051362-0.053497-0.309120.0267360.12980.260380.477620.85427-0.979360.305870.049706-0.0714780.35464-0.23545-0.210220.0030818-0.00970560.28543-0.1214-0.34855-0.467580.32384-0.099212-0.329660.011059-0.21445-0.0036375-0.0670050.33843-0.11606-0.558020.13923-0.26454-0.127770.160280.132990.322950.1059-0.562810.162650.151450.12045-0.0399210.015654-0.86113-0.4585-0.41292-0.098355-0.452210.352360.10289-0.0999650.01852-0.059492-0.246570.27229-0.1931-0.08291-0.0089034-0.018566-0.912850.218910.146180.31276-2.0524-0.273690.506480.916520.203580.0028857-0.098201-0.326460.28659-0.73806-0.305470.2015-0.109940.169410.0898170.16318-0.24181-0.34227-0.098446-0.140910.10663-0.03339-0.182710.170060.15107-0.18167-0.034147-0.244911.2143-0.809180.69265-0.23376-0.081987-1.214-0.0113140.0295780.42143-0.37014-0.467160.95952-0.56868-0.46442-0.2009-0.630730.22198-0.634590.46664-0.61074-0.122440.571680.12979-0.61571-0.06987-0.35169-0.33271-0.20907-0.24406-0.60876-0.87862-0.0472430.34044-0.522770.022991-0.350520.17174-0.15517-0.013945-0.00104140.086391-0.453540.913660.49668-0.55765-0.0128620.106140.21335-0.644320.47922-0.384980.538250.081585-0.457670.11247-0.424660.28374-0.351060.267110.233150.69870.842170.0741950.18651-0.42782-0.31332-0.051101-1.3789-0.021236-0.144530.429550.446480.53659-0.30197-0.11940.405330.221950.35115-1.0343-0.31510.024244-0.33766-0.47648-0.68371-0.15932-0.828430.19854-0.23612-0.081498-0.684140.30149-0.394940.222660.27925-0.47785-0.501290.32878-0.066530.105060.57818-1.777-0.82636-0.39695-0.36098-0.19781-0.182560.199760.205220.22025-0.140330.612760.41369-0.589740.97-0.039828-0.045545-0.44788-0.613680.372710.27534-0.46063-0.209080.3343-0.18215-0.666340.24056-0.509490.471370.155220.39767-0.577870.393990.29711-0.19939-0.238770.47019-0.473670.40109-0.24246-0.00625510.164360.39355-0.036887-0.0968990.0171980.382610.19252-0.0515590.59492-0.187550.0819510.10761-0.479710.14990.18809-0.16989-0.00507940.42582-0.0087314-0.19854-0.49675-0.503480.274960.25943-0.363130.1716-0.25746-0.24901-0.55776-1.2544-0.50924-0.241220.138420.23861-0.36239-0.43775

输入英语语义词典同义词集,其内容如下:

hanging……suspension……

……

使用英语语义词典同义词集通过最小化公式(5)对上述标准化处理后的向量空间进行改进,最后,使用稀疏对称处理后的概念语义网络通过公式(6)对同义词集改进后的向量空间进行扩展改进,得到多远语义表示模型,多元语义表示模型的部分内容如下:

hanging0.07650.05820.10550.1206-0.06310.0423-0.09270.0020-0.03230.02490.00140.0367-0.1064-0.06560.06300.0728-0.00850.1082-0.0236-0.01680.03610.04610.11760.0231-0.1262-0.0111-0.01030.02900.02440.0017-0.0548-0.08140.05740.03300.00850.17190.0286-0.0348-0.0434-0.0693-0.06520.0254-0.0615-0.02770.02140.0068-0.02900.02080.03720.08320.0013-0.0453-0.01430.01830.0778-0.0862-0.0288-0.0634-0.0123-0.05690.1548-0.0047-0.0191-0.1275-0.0298-0.00370.14140.0227-0.04340.1280-0.06050.07120.0961-0.0480-0.0401-0.0375-0.0387-0.0052-0.1026-0.04730.08810.09500.12060.01860.03620.12720.0155-0.0751-0.1250-0.0076-0.04500.0382-0.0426-0.00160.05340.0714-0.0378-0.1132-0.04990.05980.0677-0.0375-0.02120.00170.0233-0.15020.0092-0.09460.00500.03110.07390.0434-0.0333-0.0492-0.0080-0.03460.0386-0.0361-0.0073-0.0627-0.00070.0010-0.0711-0.01990.0322-0.0638-0.04930.02860.0050-0.08480.0272-0.1084-0.0260-0.11460.1262-0.0443-0.0703-0.02170.03040.06420.17710.02500.08340.08350.0018-0.0456-0.1341-0.03590.0205-0.0172-0.0170-0.00930.01400.0305-0.04150.03160.0713-0.0850-0.01350.0221-0.07200.08210.05960.0337-0.09320.0104-0.0035-0.02330.0459-0.0108-0.02360.0269-0.0727-0.01060.0149-0.0498-0.02140.04120.07370.03990.0134-0.02650.02000.0817-0.0625-0.0266-0.04580.0157-0.0469-0.04490.09690.0045-0.0316-0.0225-0.00330.05910.0167-0.04890.1051-0.0101-0.02820.03890.0213-0.0820-0.0696-0.0625-0.09070.0494-0.0169-0.0012-0.05770.01910.0513-0.05890.11570.07150.02770.0901-0.03160.00530.03530.05180.0248-0.0177-0.06310.0629-0.0525-0.0245-0.0335-0.03800.01860.0258-0.1095-0.02800.0366-0.05330.02900.00680.0371-0.03720.0114-0.05410.01190.04400.06080.05960.04920.05450.03590.06400.0594-0.0105-0.01620.0082-0.0707-0.01380.04790.01480.03250.03330.02150.06990.0531-0.02060.0395-0.00060.12860.05860.0581-0.0475-0.0569-0.00340.1164-0.00660.0069-0.0866-0.0174-0.0272-0.03610.0159-0.0573-0.02270.0417-0.09480.00590.09720.01570.01690.01860.0323-0.0055-0.01740.0025-0.0100-0.0275-0.03180.0054-0.00280.07010.0491

……

suspension0.11880.01410.0118-0.1078-0.0965-0.0690-0.02740.0467-0.04760.0213-0.14680.0148-0.0361-0.0762-0.01290.0078-0.0430-0.0495-0.0199-0.18350.00870.05410.04680.0244-0.13190.03130.00450.06730.06260.0816-0.10500.0200-0.03520.0473-0.02980.0216-0.08860.14500.01950.0212-0.0597-0.11090.0276-0.0326-0.00630.01870.07220.0073-0.05760.10570.0011-0.0270-0.00360.00550.05350.0614-0.0797-0.0136-0.0072-0.05140.0294-0.04150.03650.16580.0162-0.11260.06250.0459-0.01820.12100.10890.02260.03310.0234-0.06760.05520.0079-0.00080.00040.04820.09560.07570.01270.01880.03800.04480.0091-0.0170-0.0259-0.0187-0.0257-0.01790.03750.03760.02950.08650.1049-0.0819-0.01550.07420.1480-0.04490.02590.03150.04200.00140.0552-0.01700.03140.12810.01210.15410.0101-0.0899-0.01450.05310.02330.0588-0.0302-0.07800.0800-0.0904-0.0346-0.04430.0487-0.11160.0052-0.05710.0780-0.10710.01750.04090.10530.0020-0.0031-0.0452-0.03150.05490.00830.03500.0987-0.01370.06350.0898-0.1462-0.00440.05630.0110-0.03760.0079-0.00130.08940.0189-0.0249-0.03260.01010.1208-0.02640.0388-0.0162-0.0058-0.01680.03400.08720.04420.0264-0.03770.05810.02370.02500.0007-0.0038-0.0914-0.0037-0.06940.0060-0.02150.08700.01490.01960.0582-0.0356-0.0200-0.07700.00380.06550.0028-0.0405-0.05880.00320.0840-0.0235-0.0003-0.03470.0048-0.05360.0644-0.0213-0.1193-0.0412-0.0965-0.02990.0766-0.10560.0451-0.01370.02120.0342-0.05980.0381-0.01830.1068-0.0411-0.07390.02440.0642-0.08790.1046-0.11730.01870.06140.0459-0.0581-0.0522-0.01000.0205-0.0605-0.10230.0702-0.0114-0.02320.05720.0441-0.0392-0.0994-0.06920.02530.01960.0396-0.06800.0752-0.0348-0.0236-0.01700.00240.05610.0619-0.03030.00630.0026-0.0078-0.02960.01370.0270-0.02200.07070.05480.03600.04670.07730.0186-0.0332-0.0215-0.0485-0.03670.02750.02460.0631-0.0969-0.0334-0.00120.06710.0579-0.0146-0.02730.0651-0.00050.0522-0.02720.0192-0.0515-0.05180.0278-0.02510.00820.10100.0836-0.00740.00280.08170.02200.02860.02130.04250.00590.01320.03230.0380-0.02010.0408。

第三步骤:执行“英语短文表示模型构建模块”

待分析英语短文题目经预处理后得到的名词短语列表和单词列表,之后经英语短文表示模块后得到对应的300维向量表示,结果如下:

待分析英语短文题目向量:

[0.58,0.28,0.22,-0.47,0.41,0.34,0.97,0.40,-0.45,0.49,-0.09,0.17,0.22,0.02,0.51,0.12,0.14,-0.43,-0.01,0.56,0.41,-0.46,0.47,0.31,-0.09,0.23,0.09,0.33,-0.13,-0.21,0.32,0.26,-0.22,0.07,-0.05,0.52,0.25,-0.14,-0.23,-0.00,-0.14,0.02,-0.24,0.22,0.30,-0.26,0.03,0.02,0.41,0.24,0.01,0.16,-0.06,0.10,0.47,-0.07,-0.20,-0.45,0.09,-0.10,-0.14,0.40,-0.07,0.21,0.08,0.01,0.24,0.26,0.06,0.03,-0.11,-0.10,-0.16,-0.13,0.38,-0.40,0.12,0.27,-0.29,-0.31,-0.05,0.09,0.38,0.30,-0.01,-0.25,0.01,0.08,-0.13,0.25,-0.16,0.19,0.03,0.24,0.08,0.31,0.13,0.02,-0.04,0.24,-0.03,0.35,-0.15,-0.21,-0.14,-0.35,0.02,0.16,0.04,-0.09,-0.07,0.04,0.25,-0.11,0.35,-0.01,0.10,-0.05,-0.48,0.25,-0.02,-0.13,0.36,-0.24,0.03,0.09,-0.07,0.01,-0.10,-0.22,0.03,-0.37,0.22,0.21,-0.05,0.05,-0.02,0.11,0.03,0.08,0.01,-0.14,-0.22,-0.11,0.13,0.12,-0.22,-0.07,0.09,0.00,0.23,0.00,0.01,0.05,-0.08,0.27,0.25,-0.38,-0.15,-0.03,-0.44,0.07,-0.29,0.05,0.29,-0.02,0.12,-0.14,-0.06,0.28,0.05,-0.10,0.22,-0.29,0.10,-0.29,-0.55,-0.04,-0.05,0.23,0.03,0.23,-0.46,-0.14,0.01,-0.22,-0.32,0.21,0.01,0.01,0.27,0.06,0.26,-0.42,0.13,0.13,0.07,0.09,0.08,0.11,-0.19,0.15,0.11,-0.12,0.04,-0.05,-0.05,0.08,0.21,0.17,0.07,-0.23,-0.08,0.12,-0.00,-0.15,0.20,0.08,-0.03,0.29,0.19,0.01,-0.16,-0.18,0.02,0.04,-0.12,0.12,0.09,-0.34,0.22,0.02,-0.11,0.01,0.11,-0.07,-0.29,0.01,-0.10,-0.18,-0.04,0.12,-0.05,0.15,-0.10,-0.03,0.18,0.05,0.04,0.03,-0.17,-0.14,-0.05,0.12,-0.10,-0.23,0.17,-0.11,-0.19,0.32,-0.24,-0.26,-0.01,-0.03,0.24,0.10,-0.19,0.02,0.04,0.09,0.01,-0.37,-0.13,0.08,-0.20,-0.04,-0.01,0.15,-0.07,0.02,-0.07,-0.32,0.08,0.09,-0.13,-0.01,-0.20,0.10,-0.20,-0.12,0.04,-0.01,-0.05,-0.23,0.09,0.05,-0.03,-0.05,0.14,0.04]

待分析英语短文各句子经预处理后得到的名词短语列表和单词列表,之后经英语短文表示模块后得到对应的300维向量表示,结果如下:

待分析各句子的句向量:

第1句的句向量:[0.46,0.61,0.21,-0.08,-0.09,-0.12,0.94,0.11,-0.10,0.13,-0.20,-0.04,0.05,-0.28,0.13,0.11,0.56,0.14,0.04,0.31,0.12,-0.28,0.23,0.20,-0.08,0.05,-0.11,0.30,-0.40,-0.09,0.12,0.09,0.05,0.23,-0.33,0.54,0.20,0.19,0.45,-0.11,0.32,0.02,-0.35,-0.24,0.17,0.08,0.55,0.27,0.03,-0.20,0.10,-0.29,0.22,0.28,0.33,-0.15,0.13,-0.22,0.35,0.34,-0.08,-0.05,0.14,0.31,0.11,0.29,0.15,-0.03,-0.28,-0.28,-0.44,0.17,0.29,0.15,0.20,-0.03,0.13,0.50,0.09,-0.13,-0.10,0.46,0.05,0.09,-0.30,-0.35,-0.10,0.22,-0.18,0.10,-0.06,-0.07,0.21,0.23,0.21,0.32,-0.08,-0.06,0.09,-0.25,0.15,-0.04,-0.11,0.19,-0.06,-0.11,-0.07,-0.14,-0.13,0.13,-0.07,0.01,0.19,-0.04,0.04,0.19,0.10,-0.21,0.15,-0.03,0.32,-0.24,0.17,-0.31,0.12,-0.06,-0.03,0.08,0.01,-0.11,0.15,0.03,0.20,-0.11,0.12,-0.23,-0.15,0.22,-0.01,0.03,0.20,0.29,-0.16,0.15,-0.14,0.34,-0.09,0.04,0.11,-0.01,0.20,0.14,-0.08,0.11,0.18,-0.05,0.30,0.15,-0.18,0.22,-0.01,-0.20,0.15,-0.23,0.15,0.18,-0.14,-0.09,-0.19,0.31,-0.04,0.03,-0.13,-0.16,-0.16,-0.01,-0.05,0.02,0.08,0.10,0.05,-0.04,0.00,0.26,-0.33,0.01,-0.22,0.10,-0.00,-0.04,0.25,0.07,-0.13,0.08,-0.12,0.01,-0.09,-0.10,-0.15,-0.02,-0.19,-0.01,-0.06,-0.14,0.27,-0.01,0.00,-0.01,0.04,0.13,0.04,0.08,-0.12,-0.05,-0.09,-0.07,-0.03,0.08,-0.13,0.07,-0.01,0.15,-0.26,0.01,0.04,0.23,0.04,0.02,0.18,-0.06,-0.03,0.11,-0.10,-0.11,-0.13,0.07,0.02,0.02,0.16,0.05,-0.13,-0.21,-0.05,-0.21,-0.17,-0.10,0.04,0.13,-0.15,0.15,0.22,-0.16,0.23,-0.20,-0.07,-0.16,0.11,0.23,-0.10,0.19,0.10,-0.22,-0.05,0.05,0.02,0.03,-0.21,0.11,-0.11,0.11,0.03,-0.21,0.01,0.01,-0.09,-0.14,-0.00,-0.19,-0.16,-0.10,0.20,0.00,0.02,0.08,-0.17,-0.00,-0.00,-0.08,-0.14,-0.00,-0.00,0.01,0.10,-0.09,-0.10,-0.07,-0.01,-0.12,0.02,0.17]

……

第19句的句向量:[0.75,0.44,0.11,-0.25,0.57,-0.10,0.92,0.17,-0.29,0.13,-0.24,-0.02,0.20,-0.17,0.12,0.19,0.42,0.15,0.12,0.21,0.33,-0.43,-0.08,0.12,0.29,0.25,-0.05,-0.22,0.15,-0.12,0.33,0.02,-0.21,-0.19,-0.42,0.44,0.21,-0.05,0.40,0.18,-0.12,-0.04,-0.17,0.10,-0.14,0.15,0.12,-0.11,-0.29,-0.18,-0.09,0.03,-0.14,0.14,0.70,0.02,-0.08,-0.24,0.14,0.08,0.32,-0.02,0.03,-0.31,0.44,0.02,0.07,0.32,0.09,-0.30,-0.02,0.32,-0.34,0.25,0.20,-0.12,0.21,-0.03,-0.19,-0.07,-0.11,0.45,0.22,0.11,0.17,-0.02,-0.24,0.34,0.06,0.16,-0.38,0.13,0.46,0.22,-0.36,0.25,0.13,0.08,0.31,0.06,-0.12,0.10,0.18,0.11,-0.46,0.04,-0.13,-0.02,-0.33,0.01,-0.18,-0.27,0.01,-0.04,0.09,0.07,0.07,0.37,-0.38,-0.47,0.43,-0.01,-0.09,-0.05,0.19,0.04,0.17,-0.01,-0.32,0.07,-0.06,0.05,-0.11,0.02,0.01,-0.42,-0.01,-0.18,-0.19,0.11,-0.01,-0.18,-0.02,-0.07,-0.01,0.23,-0.03,-0.09,0.03,0.04,0.09,-0.18,-0.08,0.26,0.26,0.22,0.30,-0.58,-0.24,0.28,-0.01,-0.16,-0.21,-0.16,0.05,-0.06,-0.16,-0.23,-0.12,0.12,0.17,0.07,-0.30,-0.21,-0.17,-0.40,-0.29,-0.04,0.20,-0.18,-0.14,0.01,-0.44,0.08,-0.31,-0.13,-0.14,0.08,-0.09,-0.11,0.07,0.12,-0.10,0.14,-0.25,0.09,0.04,-0.01,0.00,-0.24,-0.24,0.11,-0.17,0.21,0.13,0.13,0.36,-0.08,-0.03,0.01,0.19,-0.19,0.04,0.17,-0.07,-0.40,-0.16,0.19,-0.15,0.03,-0.11,0.02,0.06,0.09,0.00,-0.17,0.02,0.05,0.08,-0.35,0.12,0.27,-0.00,-0.03,-0.16,-0.09,-0.08,-0.27,-0.13,-0.00,-0.16,-0.21,-0.11,-0.29,-0.09,0.15,-0.02,0.11,0.11,-0.06,0.08,-0.05,-0.01,-0.21,0.04,-0.24,0.03,0.13,-0.14,0.14,-0.02,0.08,-0.02,-0.05,-0.14,-0.14,-0.19,-0.10,-0.06,-0.07,-0.02,-0.15,0.15,0.06,-0.10,-0.05,-0.04,-0.01,0.02,0.11,-0.36,-0.11,-0.09,0.23,-0.06,-0.01,0.19,0.01,0.06,0.08,-0.15,-0.00,-0.20,-0.08,0.12,0.02,0.09,-0.14,-0.02,-0.12]。

第四步骤:执行“英语短文句子跑题分析模块”

根据第二步骤中所得到的待分析英语短文题目向量和待分析英语短文各句子的句向量,通过公式(9)求得待分析英语短文题目与待分析英语短文的语义相似度,结果如下:

待分析英语短文题目与待分析英语短文的语义相似度:

0.5785602927207947

通过公式(9)求得待分析英语短文题目与英语短文主题库中140个主题的语义相似度,结果如下:

0.21049652993679047

0.2283303588628769

-0.03294780105352402

……

0.13407745957374573

-0.02940783090889454

0.196487158536911

本发明定义若待分析英语短文与待分析英语短文题目的相似度不排在前5,则判定为与待分析主题无关的短文。将待分析英语短文与待分析英语短文的相似度与待分析英语短文与英语短文主题库中的题目的相似度从大到小进行排序,待分析英语短文与待分析英语短文题目的相似度排在第1位,因此待分析英语短文不是与待分析主题无关的短文,因此下一步将进行跑题句子抽取步骤,并对待分析英语短文进行切题程度评分:

首先通过公式(10)计算出待分析英语短文题目与待分析英语短文各句子间的相似度。结果如下:

待分析英语短文题目与待分析英语短文第1个句子的语义相似度:

0.35596713423728943

待分析英语短文题目与待分析英语短文第2个句子的语义相似度:

0.6706677675247192

待分析英语短文题目与待分析英语短文第3个句子的语义相似度:

0.6703278422355652

……

待分析英语短文题目与待分析英语短文第17个句子的语义相似度:

0.4644368588924408

待分析英语短文题目与待分析英语短文第18个句子的语义相似度:

0.3410434126853943

待分析英语短文题目与待分析英语短文第19个句子的语义相似度:

0.7801513671875

本发明将跑题句子抽取阈值设为0.25,因此当待分析英语短文题目与待分析英语短文中的句子的语义相似度小于0.25时,将被判定为跑题句子,因此待分析英语短文句子跑题句子为第7句和第8句,对应结果如下:

跑题句子:

短文第7句:thepatientswillsufferunimaginablepain.

短文第8句:andthefamilyofthepatientwillbesosad.

根据本发明公式(11)计算得到待分析英语短文句子跑题分数并生成评语如下:

待分析英语短文句子跑题程度分数:10.5分

待分析英语短文句子跑题程度评语:该英语短文内容基本切题。

当前第1页1 2 
网友询问留言 已有0条留言
  • 还没有人留言评论。精彩留言会获得点赞!
1