F.74. zhparser

F.74.1. CONFIGURATION
F.74.2. EXAMPLE
F.74.3. Custom dictionary
F.74.4. Custom dictionary 2.1

zhparser is a LightDB extension for full-text search of Chinese language (Mandarin Chinese). It implements a Chinese language parser base on the Simple Chinese Word Segmentation(SCWS).

F.74.1. CONFIGURATION

These options are used to control dictionary loading behavior and word segmentation behavior. None of these options is required and are false by default (that is, if these options are not set in the configuration file, zhparser behaves the same as setting the following options to false).

Ignore all special symbols such as punctuation: zhparser.punctuation_ignore = f

Idle words are automatically grouped in two-word word segmentation: zhparser.seg_with_duality = f

Load all dictionaries into memory: zhparser.dict_in_memory = f

Short word compound: zhparser.multi_short = f

Hash binary compound: zhparser.multi_duality = f

Important word compound: zhparser.multi_zmain = f

Set zhparser non stop words: See lightdb_tsearch_non_stopwords for more information

In addition to the dictionaries shipped with zhparser, you can add custom dictionaries, which take precedence over native dictionaries. The file must be stored in the share/postgresql/tsearch_data directory. zhparser determines the dictionary format based on the file extension. The.txt extension means that the dictionary is in text format, and the.xdb extension means that the dictionary is in XDB format. Multiple files are separated by commas, and dictionaries are categorized in descending order of priority, such as:

zhparser.extra_dicts = 'dict_extra.txt,mydict.xdb'

Note: The zhparser.extra_dicts and zhparser.dict_in_memory options need to be set before backend starts (can be changed in configuration file and reload, after which a new connection takes effect), other options can be set to take effect at any time in session.

F.74.2. EXAMPLE

            -- make test configuration using parser

            CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);

            -- add token mapping

            ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

            -- ts_parse

            SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了保障房的建设和投入力度 。2011年,保障房进入了更大规模的建设阶段。住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。');

            -- test to_tsvector

            SELECT to_tsvector('testzhcfg','“今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历>史纪录。”陈国强说。在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。');

            -- test to_tsquery

            SELECT to_tsquery('testzhcfg', '保障房资金压力');
			
            -- test lightdb_tsearch_non_stopwords
            
            SET lightdb_tsearch_non_stopwords  = '1';
            ERROR:  invalid value for parameter "lightdb_tsearch_non_stopwords": "1"
            DETAIL:  lightdb_tsearch_non_stopwords not support '1' as non stopword.
            
            SET lightdb_tsearch_non_stopwords  = '=@:_-';
            SELECT to_tsvector('testzhcfg', '浙江省杭州市this is from lt_hs_tab table 滨江区江南大道logged at 2020-12-21 恒生电子 00:22:32, username=zjh&pwd=balabala');
                                                                                                            to_tsvector                                                                                                   
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            '00:22:32':16 '2020-12-21':13 'at':12 'from':5 'is':4 'logged':11 'lt_hs_tab':6 'pwd=balabala':18 'table':7 'this':3 'username=zjh':17 '大道':10 '恒生':14 '杭州市':2 '江南':9 '浙江省':1 '滨江区':8 '电子':15
            (1 row)
            
            SET lightdb_tsearch_non_stopwords  = '';
            SELECT to_tsvector('testzhcfg', 'this is from lt_hs_tab table logged at 2020-12-21 00:22:32, username=zjh&pwd=balabala');
                                                                                    to_tsvector                                                                            
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------
            '00':11 '12':9 '2020':8 '21':10 '22':12 '32':13 'at':7 'balabala':17 'from':3 'is':2 'logged':6 'lt_hs_tab':4 'pwd':16 'table':5 'this':1 'username':14 'zjh':15
            (1 row)
        

F.74.3. Custom dictionary

** TXT dictionary is currently compatible with CLI/SCws_gen_dict used text dictionary) **

  • One entry per line, Comments starting with a # or semicolon are ignored and skipped

    Each line consists of 4 fields, which are successively "word "(composed of Chinese characters or less than 3 letters), "TF", "IDF" and" part of speech ". The fields are separated by Spaces or tabs. There is no limit on the number of fields, and they can be aligned to beautify themselves

    Except for "word", other fields can be ignored. If omitted, TF and IDF default to 1.0 and the part of speech is "@"

    TXT library dynamic loading (internal monitoring file modification time is automatically converted into XDB stored in the system temporary directory), it is recommended that TXT dictionary not too large

    To delete words, please set the part of speech to "! ", the word is set to invalid, even if it exists in other core libraries

Note: 1. The format of the custom dictionary can be TXT or binary XDB format. The XDB format is more efficient and suitable for large dictionaries. You can use the built-in SCWS tool scWS-Gen-dict to convert text dictionaries to XDB format. 2. The default zhparser dictionary is Simplified Chinese.

F.74.4. Custom dictionary 2.1

** Custom dictionary 2.1 adds the portability of custom dictionary and is compatible with features provided by 1.0 **

Custom dictionary require superuser privileges. Custom dictionary are database-level (not instance). Each database has its own custom participle and is stored under the data directory base/ database ID (version 2.0 is stored under share/tsearch_data).

            test=# SELECT * FROM ts_parse('zhparser', '保障房资金压力');
             tokid | token
            -------+-------
               118 | 保障
               110 | 房
               110 | 资金
               110 | 压力

            test=# insert into zhparser.zhprs_custom_word values('资金压力');
            --Delete the word insert into zhprs_custom_word(word, attr) values('word', '!');
            --\d zhprs_custom_word view its table structure, supported TD, IDF
            test=# select sync_zhprs_custom_word();
             sync_zhprs_custom_word
            ------------------------

            (1 row)

            test=# \q --sync then re-establish the connection
            [lzzhang@lzzhang-pc bin]$ ./psql -U lzzhang -d test -p 1600
            test=# SELECT * FROM ts_parse('zhparser', '保障房资金压力');
             tokid |  token
            -------+----------
               118 | 保障
               110 | 房
               120 | 资金压力