ÕÒÂÛÎÄÍø > ¼ÆËã»úÂÛÎÄ > È˹¤ÖÇÄÜ >

ÖÐÎÄÈ«ÎÄÐÅÏ¢¼ìË÷ϵͳÖÐË÷ÒýÏî¼¼Êõ¼°·Ö´ÊϵͳµÄʵÏÖ

ÕªÒª£º±¾ÎĶÔÖÐÎÄÈ«ÎļìË÷ϵͳÖг£ÓõÄË÷ÒýÏî¼¼ÊõnÔªÓï·¨£¬×Ö£¬nÔªÓï·¨£¬´Ê½øÐÐÁ˽éÉܲ¢ÌÖÂÛÁËÆä¸÷×ÔµÄÌØµã¡£È»ºó×ÅÖØ½éÉÜÁËÒÔ´ÊΪË÷ÒýÏîµÄ·½·¨¼°È«ÎļìË÷Öеĺº×Ö·Ö´ÊÎÊÌâ¡£×îºó¸ø³öÁËÒ»ÖÖ»ìºÏÐÍ×î´óÆ¥Åä·Ö´ÊËã·¨¡£
¹Ø¼ü´Ê£ºÐÅÏ¢¼ìË÷ ÖÐÎÄÐÅÏ¢´¦Àí ·Ö´Ê

The indexing term technology of Chinese information retrieval and implement of segmentation system


Abstract


The paper discuss the technology of indexing term ,such as n-gray , character, word ,used in Chinese information retrieval . we also introduce the method of using word as indexing term and the problem of segmentation ,then paper presents a mix max match algorithm.


1 ÒýÑÔ

ÔÚÈ«ÎÄÐÅÏ¢¼ìË÷ϵͳÖУ¬Ë÷ÒýÏîµÄÑ¡ÔñÊÇÒ»¸ö»ù±¾µÄ£¬Ò²ÊǷdz£ÖØÒªµÄÎÊÌâ¡£¶ÔÊäÈëµÄÎĵµ¼°Óû§²éѯҪ×öµÄµÚÒ»¼þʾÍÊǽ«ËüÃÇ·Ö½âΪË÷ÒýÏîµÄ¼¯ºÏ£¬È»ºó²ÅÓпÉÄܼÆËã³ö²éѯÓëÎĵµµÄÏà¹Ø¶È¡£ÔÚÓ¢ÎĵÄÈ«ÎÄÐÅÏ¢¼ìË÷ϵͳÖУ¬½«²éѯ¼°Îĵµ·Ö½âΪË÷ÒýÏºÏÊǼþ·Ç³£¼òµ¥µÄÊÂÒòΪͨ³£Ñ¡ÓôÊΪË÷ÒýÏî,
¶øÓ¢ÎÄÖдÊÓë´ÊÖ®¼ä´æÔÚ·Ö¸ô·û£¨Èç¿Õ¸ñ£©¡£¶ÔÖÐÎÄÈ«ÎÄÐÅÏ¢¼ìË÷ϵͳÀ´Ëµ½«²éѯ¼°Îĵµ·Ö½âΪË÷ÒýÏºÏ¾Í¸´ÔÓЩ¡£Ê×ÏÈҪȷ¶¨ÒÔʲôµ¥Î»ÎªË÷ÒýÏÊÇÒÔ×Ö£¬´Ê»¹ÊǶÌÓïΪË÷ÒýÏÏÖÓеÄÑо¿Öд󲿷ÝÈÏΪӦÒÔ´ÊΪË÷ÒýÏî¡£ÕâÊÇÒòΪÊ×ÏÈÒÔ´ÊΪµ¥Î»±È½Ï·ûºÏÈ˵Ä×ÔȻ˼άϰ¹ß£¬Æä´ÎÒÔ´ÊΪË÷ÒýÏî¾Í¿ÉÒÔ½èÓÃÓ¢ÎÄÈ«ÎļìË÷ϵͳÖÐÒÑÓеÄÀíÂÛ¼°·½·¨¡£


ÒÔ´ÊΪË÷ÒýÏ¾ÍÒª½øÐзִʣ¬Ò²¾ÍÊǽ«Óɺº×Ö×é³ÉµÄÁ¬Ðø×Ö·û´®·Ö½âΪ´ÊµÄ¼¯ºÏ£¬Òª½øÐÐÕýÈ·µÄ·Ö´Ê²»ÊÇÒ»¼þÊ®·ÖÈÝÒ×µÄÊ£¬Ê×ÏÈÔÚÖÐÎÄÖÐ×ÖÓëÖ®¼ä£¬´ÊÓë´ÊÖ®¼äÊDz»´æÔÚ·Ö¸ô·ûµÄ£¬Òò´Ë·Ö´ÊÒ»°ã¶¼Òª½èÖú´ÊµäÀ´½øÐУ¬¶øÖÐÎĵĹ¹´Ê·Ç³£Áé»î£¬´ÊµÄÊýÄ¿¼¸ºõÊÇÎÞÏ޵ģ¬Òò´ËÒª¹¹ÔìÍ걸µÄ´ÊµäÊDz»¿ÉÄܵġ£ÎªÁ˿˷þÒÔ´ÊΪË÷ÒýÏîËù´øÀ´µÄÀ§ÄÑ£¬ÈËÃÇÌá³öÁËһЩ±ðµÄ·½·¨ÈçÒÔ×ÖΪË÷ÒýÏÒÔ¶þÔª£¬ÈýÔªÓ﷨ΪË÷ÒýÏîµÈ¡£


±¾ÎÄÊ×ÏȶԸ÷ÖÖÀàÐ͵ÄË÷ÒýÏî¼¼Êõ×÷¼òµ¥½éÉÜ£¬·ÖÎöËüÃÇÓ¦ÓÃÓÚÖÐÎļìË÷ÖеÄÓÅȱµã£¬È»ºó×ÅÖØÌÖÂÛÒÔ´ÊΪË÷ÒýÏîʱµÄ·Ö´ÊϵͳµÄÉè¼Æ¼°ÊµÏÖ¡£


2 Ë÷ÒýÏî¼°ÖÐÎÄÎı¾µÄ±íʾ·½Ê½


2.1 ×Ö


ʹÓÃ×ÖΪË÷ÒýÏîÊÇ×î¼òµ¥µÄ·½·¨£¬½«Îı¾·Ö½âΪË÷ÒýÏîʱ·Ç³£ÈÝÒ×ʵÏÖ¡£°´ÕÕGB2312µÄ¹æ¶¨¹²ÓÐ6763¸öºº×Ö¡£ÕâÑùË÷Òý¼¯ºÏ¾Í·Ç³£Ð¡£¬×î´ó²»»á³¬¹ý6763¡£ÔÚÕâÒ»µãÉÏÓëÆäËüË÷ÒýÏî¼¼Êõ(Èç´Ê£¬NÔªÓï·¨)Ïà±ÈÓŵãÊÇ·ÇÃ÷ÏԵġ£µ«ÒÔ×ÖΪË÷Òýµ¥Î»Ò²ÓÐÆäÃ÷ÏÔµÄȱµã¡£Ê×ÏÈÊÇÆ¥ÅäµÄ׼ȷÐÔ²»¸ß£¬ÀýÈçÓû§µÄ²éѯΪ
"ʶ±ð"£¬¶øÄ³ÎĵµÖдæÔÚ "ÄãÊÇ·ñ»¹ÈϱðµÄÈË?" ÕâÑùÒ»¾ä»°¡£Ôò»ùÓÚ×ֵļìË÷·½·¨Ôò»áÈÏΪ¸Ã²éѯÓëÎĵµÊÇÏà¹ØµÄ¡£Æä´ÎÔÚÖÐÎÄÖÐͬһ¸ÅÄî¿ÉÒÔÓжàÖÖ±í´ï·½Ê½Èç
"ÖÐÎÄ"£¬"ººÓï"£¬"¹úÓï"¡£»ùÓÚ×ֵļìË÷·½·¨ÊÇÎÞ·¨´¦ÀíÕâÀàÎÊÌâµÄ¡£


2.2 nÔªÓï·¨


ÔÚÈ«ÎļìË÷Öг£ÓõÄΪ¶þÔª¼°ÈýÔªÓï¡£¶þÔªÓï·¨µÄ˼ÏëΪ½«Îı¾ÖÐËùÓÐÏàÁÚºº×Ö¾ù×÷ΪË÷ÒýÏÕâÑùǰһ¸öË÷ÒýÏîµÄºóÒ»¸ö×ÖÓëÏÂÒ»Ë÷ÒýÏîÍ·¸ö×ÖÊÇÏàͬµÄ¡£ÀýÈçÓÐÒ»¸ö×Ö·û´®C1C2C3C4C5£¬ÔòÓÉËüÉú³ÉµÄË÷ÒýÏîΪC1C2£¬C2C3£¬C3C4£¬C4C5¡£ÈýÔªÓï·¨µÄ˼ÏëÓë¶þÔªÓï·¨Ïàͬ£¬²î±ð½öΪÈýÔªÓï·¨µÄË÷ÒýÏîÓÉÈý¸ö×Ö¹¹³É£¬ÀýÈç¶ÔÉÏÃæµÄ×Ö·û´®ÓÉÆäÉú³ÉµÄÈýÔªÓï·¨Ë÷ÒýÏîΪC1C2C3£¬C2C3C4£¬C3C4C5¡£


ͬÑùnÔª·¨µÄÓŵãΪ½«Îı¾·Ö½âΪË÷ÒýÏºÏÊÇÊ®·ÖÈÝÒ׵ġ£µ«ÆäË÷Òý¿Õ¼äÊÇÊ®·Ö¾Þ´óµÄ¡£Ê¹ÓÃnÔªÓ﷨ͬÑùÒ²»áʹϵͳÎÞ·¨ÀûÓÃÓïÑÔѧ֪ʶ¡£


2.3 ´Ê


Ŀǰ´ó¶àÊýÑо¿ÕßÈÏΪÖÐÎÄÈ«ÎļìË÷Ò²Ó¦ÒÔ´ÊΪË÷Òýµ¥Î»¡£Ò²¾ÍÊÇË÷ÒýÏîÓ¦¸ÃΪÖÐÎĵĴʡ£ÕâÑù×öµÄºÃ´¦ÊÇÊ®·ÖÃ÷ÏԵġ£Ê×ÏÈ·ûºÏÈ˵Äϰ¹ß£¬ÓÐÀûÓÚÌá¸ß²éѯµÄ׼ȷÐÔ£¬Ò²±ãÓÚϵͳÀûÓÃÓïÑÔѧ֪ʶ¡£Èç¹ûÒª½øÒ»²½Éè¼Æ¿çÓïÖÖ²éѯϵͳÔò·ÇÒªÒÔ´ÊΪË÷ÒýÏî²»¿É¡£µ«Ê¹ÓôÊΪË÷ÒýÏîÔòÓ¦ÏȽâ¾öºÃ·Ö´ÊÎÊÌâ¡£


3. Ò»ÖÖ»ìºÏÐÍÕýÏò×î´óÆ¥ÅäËã·¨


ÖÐÎÄ·Ö´ÊÎÊÌâµÄÑо¿¼ºÓжþÊ®¶àÄêÀúÀô¡£Æä¼ä¼ºÌá³öÁ˶àÖÖ·Ö´ÊËã·¨¡£×ܵÄÀ´ËµÕâЩËã·¨¿É·ÖΪËÄ´óÀà¡£µÚÒ»ÀàΪ»ùÓڴʵäµÄ»úе·Ö´ÊËã·¨¡£µÚ¶þÀàΪ»ùÓÚͳ¼ÆµÄ·Ö´ÊËã·¨¡£µÚÈýÀàΪµÚÒ»ÀàºÍµÚ¶þÀàµÄ»ìºÏÐÍ·Ö´ÊËã·¨¡£µÚËÄÀàΪ»ùÓÚ֪ʶµÄ·Ö´Êר¼Òϵͳ¡£


µ«¸÷ÖÖ·Ö´ÊËã·¨¾ùÓÐÆäÊÊÓÃÁìÓò£¬Õë¶ÔÈ«ÎļìË÷ÖÐÎĵµÊýÁ¿´ó£¬ÒªÇóËÙ¶È¿ìµÄÌØµã¡£ÎÒÃÇÉè¼ÆÁËÒ»¸ö»ìºÏÐÍÕýÏò×î´óÆ¥ÅäËã·¨£¬¸ÃËã·¨¿ÉÀûÓùæÔò¼°×ÖÆµÐÅÏ¢À´´¦Àí·Ö´ÊÖÐµÄÆçÒ岢ʹÓÃÁËÈý´Ê¿é·½·¨[1]¡£Îª¼Ó¿ì·Ö´Ê¹ý³ÌÖдʵIJéÕÒËÙ¶È£¬°´Ê××ÖË÷Òý½á¹¹¶Ô´Êµä½øÐÐÁË×éÖ¯¡£


3.1 Èý´Ê¿é¼°´¦ÀíÆçÒåµÄ¹æÔò


Èý´Ê¿éÊÇÒ»ÖÖ´¦Àí·Ö´ÊÆçÒåµÄ·½·¨¡£·Ö´ÊÖÐÓöµ½ÆçÒåʱ£¨¼ÙÉèÓÐÒ»×Ö·û´®C1C2C3C4C5C6£¬µ±Ç°´¦Àíµ½ºº×ÖC1£¬ÇÒC1Ϊ´ÊC1C2ҲΪ´Ê£©£¬ÔòÏòǰ¶àÕÒÁ½¸ö´Ê£¬ÕâÖÖÓÉÈý¸ö´Ê×é³ÉµÄ´®³ÆÖ®ÎªÈý´Ê¿é¡£´¦ÀíÖÐÎÒÃǽ«ÕÒ³öËùÓпÉÄܵÄÈý´Ê¿é²¢ÇÒÈÏΪ¾ßÓÐ×î´ó³¤¶ÈµÄÈý´Ê¿éÊÇ×îÓпÉÄܵķִʡ£


¼ÙÉèÓÐ×Ö·û´®C1C2C3C4C5C6£¬ÇÒC1,C1C2¾ùΪ´Ê²¢ÓÐÈçÏÂһЩ¿ÉÄܵÄÈý´Ê¿é¡£


1 C1 C2 C3C4
2 C1C2 C3C4 C5
3 C1C2 C3C4 C5C6

¾ßÓÐ×î´ó³¤¶ÈµÄ´Ê¿éΪµÚÈý¸ö¡£ÕâÑùÎÒÃǾÍÈÏΪµÚÈý¸ö´Ê¿éÖеÄC1C2ΪÕýÈ·µÄ·Ö·¨¡£È¡ÆäΪ´Ê¡£´ÓC3ÍâÔٴοªÊ¼½øÐзִʣ¬Ò»Ö±µ½×Ö·û´®½áÊø¡£


ÎÒÃÇËùÉè¼ÆµÄ·Ö´ÊËã·¨ÒÔÕýÏò×î´óÆ¥ÅäË㷨Ϊ¿ò¼Ü¡£·Ö´Ê¹ý³ÌÖÐÓöµ½ÆçÒåʱÔòÓ¦ÓÃÏÂÀý¹æÔò¼ÓÒÔ½â¾ö¡£


¹æÔò1


¾ßÓÐ×î´ó³¤¶ÈµÄ´Ê¿éµÄµÚÒ»¸ö´ÊΪÕýÈ··Ö´Ê¡£


¹æÔò2


Èç¾ßÓÐ×î´ó³¤¶ÈµÄ´Ê¿é²»Î¨Ò»ÔòѰÕÒ¾ßÓÐ×îС´Ê³¤±ä»¯µÄÈý´Ê¿é¡£¸Ã¹æÔòµÄÒþº¬¼ÙÉèΪÔÚÎĵµÖдʳ¤ÊǾùÔÈ·Ö²¼µÄ¡£


ÀýÈç: 1 Ñо¿ ÉúÃü µÄ ÆðÔ´


2 Ñо¿Éú Ãü µÄ ÆðÔ´


°´¹æÔòѡȡ¿é1ÖеÄ"Ñо¿"ΪÕýÈ··Ö´Ê¡£


¹æÔò3


µ±¾ßÓÐ×î´ó³¤¶ÈµÄ´Ê¿é²»Î¨Ò»²¢ÇÒÓÐÏàͬµÄ´Ê³¤±ä»¯Ôò¾ß×î´óƽ¾ù´ÊµÄ¿éÖеĵÚÒ»¸ö´ÊΪÕýÈ··Ö´Ê¡£¸Ã¹æÔòµÄÒþº¬¼ÙÉèΪÓöµ½¶à×ִʵĸÅÂÊ´óÓÚÓöµ½Ò»×ִʵĸÅÂÊ¡£¸Ã¹æÔò½öµ±Ä³Ð©´Ê¿éÓÉÒ»¸ö»ò¶þ¸ö´Ê¹¹³Éʱ²ÅÓÐÓá£


¹æÔò4


µ±Ç°Ãæ¹æÔò¾ù²»ÄÜÈ·¶¨Ñ¡È¡ÄÇ´Ê¿éʱ£¬Ôò·Ö±ð¼ÆËã¸÷¿éÖÐÒ»×Ö´ÊµÄ´ÊÆµºÍ£¬È¡´ÊƵºÍ×î´óµÄ´Ê¿é¡£


3.2 ´ÊµäµÄ×éÖ¯¼°´ÊµÄ²éÕÒ


Õû¸ö´ÊµäÓÉ12Íò¸ö´ÊÌõÐÅÏ¢¹¹³É¡£´Êµä×éÖ¯½á¹¹ÎªÊ××ÖË÷Òý½á¹¹£¬ÆäʾÒâͼÈçÏÂ


 


´ÊµäÓÉÁ½²¿·Ý×é³É£¬Ò»²¿·ÝΪË÷Òý²¿·Ý£¬ÁíÒ»²¿·ÝÔòΪ´ÊµäÕýÎÄ¡£Ë÷Òý²¿·ÝÓÉ×Ö£¬×ÖÆµ£¬Ö¸Õë×é³É¡£ÆäÖÐÖ¸ÕëÖ¸ÏòÒÔ¸Ã×ÖΪÊ××ÖµÄËùÓдʵÄÊ×µØÖ·¡£ÕýÎIJ¿·ÝΪ´ÊÌõ¡£´ÊÌõ°´Æä³¤¶Ì´Ó¶ÌÏò³¤µÄ˳Ðò´æ·Å¡£´Êµä²ÉÈ¡ÕâÖÖ×éÖ¯·½Ê½ÊÇΪÁ˼ӿì´ÊµÄ²éÕÒËÙ¶È¡£


4 ½áÊøÓï

±¾ÎĽéÉÜÁËÒ»ÖÖ»ìºÏÐÍ·Ö´ÊËã·¨¡£Îª½â¾ö·Ö´ÊÆçÒåÎÊÌâÒýÈëÁËËÄÌõ¹æÔò¡£ÔÚ·Ö´ÊÖÐÓöµ½ÆçÒåʱÔòͨ¹ýÉú³ÉÈý´Ê¿é²¢ÒýÓùæÔòÀ´½â¾ö¡£ÎÄÖÐÌá³öµÄ·Ö´ÊËã·¨ÒÑÔÚһȫÎļìË÷ϵͳÖнøÐÐÁËʵ¼ÊÓ¦Óá£


²Î ¿¼ ÎÄ Ï×

[1]Chen K.J & Liu S.H, Word identification for Mandarin Chinese sentences. COLING -92

[2]»Æ²ýÄþµÈ£¬ÓïÑÔÐÅÏ¢´¦ÀíרÂÛ¡£Ç廪´óѧ³ö°æ 1996Äê4ÔÂ.

[3]Ò¦Ìì˳µÈ£¬»ùÓÚ¹æÔòµÄººÓï×Ô¶¯·Ö´Êϵͳ¡£ÖÐÎÄÐÅϢѧ±¨£¬1990Äê µÚ1ÆÚ

[4]ÎâʤԶ£¬Ò»ÖÖººÓï·Ö´Ê·½·¨¡£¼ÆËã»úÑо¿Óë·¢Õ¹£¬1996Äê µÚ4ÆÚ

 



ººÓïÎı¾´ÊÐÔ±ê×¢±ê¼Ç¼¯µÄ¹æ·¶
¡¶ÏÖ´úººÓïÓï·¨ÐÅÏ¢´Êµä¡·µÄ¿ª·¢ÓëÓ¦ÓÃ
¹¤É̹ÜÀí | ¹¤¿ÆÂÛÎÄ | ²ÆÎñ¹ÜÀí | ¹ÜÀíѧ | ¹«¹²¹ÜÀí | ²ÆÕþ˰ÊÕ | ֤ȯ½ðÈÚ | »á¼ÆÉó¼Æ | ¼ÆËã»ú | ·¨ÂÉÂÛÎÄ | ҽҩѧ | ººÓïÑÔÎÄѧ
Éç»áÂÛÎÄ | ¹¤¿ÆÂÛÎÄ | Àí¿ÆÂÛÎÄ | ÎÄ»¯ÂÛÎÄ | ÒÕÊõÂÛÎÄ | ÎÄѧÂÛÎÄ | ÕÜѧÂÛÎÄ | ÕþÖÎÂÛÎÄ | Ó¢ÓïÂÛÎÄ | д×÷Ö¸µ¼ | ¼ÆËã»úÓ¦ÓÃ
www.zlunwen.com ÕÒÂÛÎÄÍø ® °æÈ¨ËùÓÐ ÍøÕ¾µØÍ¼