LinuZau ToolBox Wiki - WalWiki/カスタマイズ/日本語文字列を正しく検索

海、箸等の文字列を正しく検索します。

[index for euc char]

Perlでは海、箸等の特定の文字の検索で問題が生じます。
この問題を修正します。
参考：Perlメモ

euc_index()の追加

# yakty add [index for euc char] start
sub euc_index {
    my ($text, $word, $index) = @_;
    return index($text, $word, $index) if(length($word) > 2); # 高速化
    return -1 if(index($text, $word, $index) == -1);
    my %matchpos;
    my $pos=0;
    foreach ($text =~ /($EUC_CHAR)/go){
        $matchpos{$pos} = 1;
        $pos += length($_);
    }
    while(($pos = index($text, $word, $index)) != -1){
        last if($matchpos{$pos});
        $index = $pos+1;
    }
    return $pos;
}
# yakty add [index for euc char] end

高速化とコメントしてある下記の処理をコメントアウトするとさらに正しく検索を行います。ただし、２バイトより長い文字列がおかしな部分にマッチする可能性は低いので、２バイトより長い文字列を検索する場合には通常のindex()を使用する事で高速化を図っています。

return index($text, $word, $index) if(length($word) > 2); # 高速化

match_text()内の変更

編集

    # 検索
    my $match_text  = ($case_sense) ? $text  : lc($text);
    my @match_links = ($case_sense) ? @links : map { $_ = lc($_); } @links;
    my %match;
    foreach my $word (@words) {
        my $match_word = ($case_sense) ? $word : lc($word);
        #my $pos = index($match_text, $match_word);           # yakty del [index for euc char]
        my $pos = &euc_index($match_text, $match_word);       # yakty add [index for euc char]
        if    ($pos != -1)                                       { $match{$word} = $pos; }
        #elsif (grep {index($_, $match_word) != -1} @match_links) { $match{$word} = -1; }          # yakty del [index for euc char]
        elsif (grep {&euc_index($_, $match_word) != -1} @match_links) { $match{$word} = -1; }      # yakty add [index for euc char]
    }