NAME
Lingua::JA::TermExtractor - Term Extractor
SYNOPSIS
use Lingua::JA::TermExtractor;
use utf8;
use feature qw/say/;
use Data::Printer;
my $extractor = Lingua::JA::TermExtractor->new(
df_file => './df.tch', # Please download from http://misc.pawafuru.com/webidf/.
pos1_filter => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/],
ng_word => [qw/編集 本人 自身 自分 たち さん/],
);
p $extractor->extract($document)->dump;
p $extractor->extract(\@documents)->dump;
for my $result (@{ $extractor->extract(\@documents)->list(50) })
{
my ($word, $score) = each %{$result};
say "$word: $score";
}
DESCRIPTION
Lingua::JA::TermExtractor is a term extractor. This extracts terms from one or more documents and sorts them based on their TF*WebIDF or BM25 scores.
METHODS
new( %config || \%config )
Creates a new Lingua::JA::TermExtractor instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE
----------- ---------------
k1 2.0
b 0.75
pos1_filter [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/]
pos2_filter []
pos3_filter []
ng_word []
term_length_min 2
term_length_max 30
concat_max 30
tf_min 1
df_min 0
df_max 250_0000_0000
fetch_unk_word_df 0
db_auto 1
guess_df 1
idf_type 3
api 'YahooPremium'
appid undef
driver 'TokyoCabinet'
df_file './df.tch'
fetch_df 0
expires_in 365
documents 250_0000_0000
Furl_HTTP undef
verbose 0
- k1 => $weight
-
The weight of term frequency(TF).
- b => $weight
-
The weight of document length normalization.
- pos(1|2|3)_filter, ng_word, term_length_(min|max), concat_max, tf_min, df_(min|max), fetch_unk_word_df, db_auto, guess_df
-
See Lingua::JA::TFWebIDF.
- idf_type, api, appid, driver, df_file, fetch_df, expires_in, documents, Furl_HTTP, verbose
-
See Lingua::JA::WebIDF.
extract( $document || \@documents )
Extracts terms from $document or \@documents and sorts them based on their TF*WebIDF or BM25 scores.
If $document, TF*WebIDF is used. If \@documents, BM25 is used.
Word segmentation and POS tagging are done via MeCab.
tfidf, tf
See Lingua::JA::TFWebIDF.
idf, df, purge, db_open, db_close
See Lingua::JA::WebIDF.
AUTHOR
pawa <pawapawa@cpan.org>
SEE ALSO
Lingua::JA::WebIDF::Driver::TokyoTyrant
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.