摘要:随着人工智能技术的发展,自然语言理解领域的应用已经越来越广泛,几乎任何基于汉语文本的系统都必须经过分词这一步。中文分词技术是对中文句子的切分技术,是计算机理解汉字意思的基础,是中文信息处理系统中最重要的预处理技术。而未登录词的识别则是影响中文分词准确率的一个重要因素。所谓未登录词主要是指分词系统的常用词词典中未收录的词。汉语中未登录词的种类很多、结构规律各异、数量众多,而且还在不断形成,不可能完全收录到常用词词典中。但是如果一篇文章中存在未被识别的未登录词,将直接影响到中文分词的准确率和召回率。虽然现在国内外也有许多分词软件,未登录词识别的准确率和召回率都有所提高,但是未登录词的误判和漏判将干扰中文信息检索以及中文分词的正确进行。
首先,本文选取人民日报(2001~2004)的语料,作为实验的语料库。然后利用中科院的分词软件对语料进行切割分词。本文主要处理的是连续单字所组成的散串(分词碎片),判断他们是否为可能为未登录词。通过陈小荷教授的一揽子算法对分词后的语料进行分析,通过大规模语料求出单字概率,单字词概率以及单字非词概率,并根据所求的数据进行算法实现。本文选取了三个测试语料,通过提取的未登录词的总个数,正确的未登录词个数,以及未提取到的未登录词个数,准确率和召回率分别为:84.61%、91.67%,81.66%、98.0%,83.33% 、90.91%。结果表明本系统对未登录词的识别率比较高。
关键词:未登录词,分词碎片,单字非词概率,单字词概率
Abstract:With the development of artificial intelligence, natural language understanding application of the field has become increasingly widespread, almost any system based on Chinese word must be segmentation through this step, Chinese word segmentation is a technology for Chinese sentences, and foundation of the computer to understand Chinese characters, the most important pre-processing techniques in Chinese information processing system. The identification of unknown words is an important factor for the accuracy of Chinese word segmentation. The so-called unknown words are mainly refers to the common dictionary words are not included in the segmentation system. Unknown- words in Chinese have many different types, different law structure, a large number, but also constantly updated and expanded, and not fully included into the common word dictionary. But if an article has the unknown words which can’t be identified, this will directly affect the accuracy and recall of Chinese word segmentation. Although there are many words segmentation software at home and abroad, unknown word recognition and recall accuracy are improved, but the miscarriage and the Missing of unknown words would interfere the Chinese information retrieval and the Chinese word segmentation correctly.
First, we select the corpus of the People’s Daily (2001~2004) as the experimental corpus. And use the CAS word segmentation software to cut the corpus. This paper deals with the bulk consisting of a continuous string of words (sub-word fragments) and to determine whether they are likely to unknown words. Analysis the corpus which have been segmentation though the Package algorithm of Professor Chen Xiao he. Though a large corpus calculate the probability of word, single word probability and the probability of words of non-words. And according to gained data to achieve the algorithm. This paper selects three test corpus, By extracting the total number of unknown words, the correct number of unknown words, and not to extract the number of unknown words , precision and recall rates were: 84.61%, 91.67%, 81.66%, 98.0%, 83.3%, and 90.91%. The results show that the system recognition rate of unknown words is relatively high.
Key words: unknown words sub-word fragments single word of non-words probability single word probability