LinuZau ToolBox Wiki - WalWiki/カスタマイズ/日本語文字列を正しく検索

[ 新規 ] [ 検索 | 一覧 | FrontPage ] [ 履歴 (RSS) | 差分履歴 (RSS) ] [ ログイン ]
【PR】Amazon | SL-C3000 | SL-C1000(3/18発売!) | SL-C860 | SL-6000W | SL-6000L | SL-6000N

差分表示

最後の更新で追加された行はこのように表示します。
最後の更新で削除された行はこのように表示します。

海、箸等の文字列を正しく検索します。

* [index for euc char]

Perlでは海、箸等の特定の文字の検索で問題が生じます。
この問題を修正します。
参考：[[Perlメモ http://www.din.or.jp/~ohzaki/perl.htm]]

** euc_index()の追加

 # yakty add [index for euc char] start
 sub euc_index {
     my ($text, $word, $index) = @_;
     return index($text, $word, $index) if(length($word) > 2); # 高速化
     return -1 if(index($text, $word, $index) == -1);
     my %matchpos;
     my $pos=0;
     foreach ($text =~ /($EUC_CHAR)/go){
         $matchpos{$pos} = 1;
         $pos += length($_);
     }
     while(($pos = index($text, $word, $index)) != -1){
         last if($matchpos{$pos});
         $index = $pos+1;
     }
     return $pos;
 }
 # yakty add [index for euc char] end

高速化とコメントしてある下記の処理をコメントアウトするとさらに正しく検索を行います。ただし、２バイトより長い文字列がおかしな部分にマッチする可能性は低いので、２バイトより長い文字列を検索する場合には通常のindex()を使用する事で高速化を図っています。
 return index($text, $word, $index) if(length($word) > 2); # 高速化

** match_text()内の変更

     # 検索
     my $match_text  = ($case_sense) ? $text  : lc($text);
     my @match_links = ($case_sense) ? @links : map { $_ = lc($_); } @links;
     my %match;
     foreach my $word (@words) {
         my $match_word = ($case_sense) ? $word : lc($word);
         #my $pos = index($match_text, $match_word);           # yakty del [index for euc char]
         my $pos = &euc_index($match_text, $match_word);       # yakty add [index for euc char]
         if    ($pos != -1)                                       { $match{$word} = $pos; }
         #elsif (grep {index($_, $match_word) != -1} @match_links) { $match{$word} = -1; }          # yakty del [index for euc char]
         elsif (grep {&euc_index($_, $match_word) != -1} @match_links) { $match{$word} = -1; }      # yakty add [index for euc char]
     }

01490