Name
Text::SenseClusters::Wikipedia::LabelEvaluation - Module for evaluation of labels of the clusters.
SYNOPSIS The following code snippet will evaluate the labels by comparing them with text data for a gold-standard key from Wikipedia .
# Including the LabelEvaluation Module.
use Text::SenseClusters::Wikipedia::LabelEvaluation;
# Including the FileHandle module.
use FileHandle;
# File that will contain the label information.
my $labelFileName = "temp_label.txt";
# Defining the file handle for the label file.
our $labelFileHandle = FileHandle->new(">$labelFileName");
# Writing into the label file.
print $labelFileHandle "Cluster 0 (Descriptive): George Bush, Al Gore, White House,".
" COMMENTARY k, Cox News, George W, BRITAIN London, U S, ".
"Prime Minister, New York \n\n";
print $labelFileHandle "Cluster 0 (Discriminating): George Bush, COMMENTARY k, Cox ".
"News, BRITAIN London \n\n";
print $labelFileHandle "Cluster 1 (Descriptive): U S, Al Gore, White House, more than,".
"George W, York Times, New York, Prime Minister, President ".
"<head>B_T</head>, the the \n\n";
print $labelFileHandle "Cluster 1 (Discriminating): more than, York Times, President ".
"<head>B_T</head>, the the \n";
# File that will contain the topic information.
my $topicFileName = "temp_topic.txt";
# Defining the file handle for the topic file.
our $topicFileHandle = FileHandle->new(">$topicFileName");
# Writing into the Topic file.
# Bill Clinton , Tony Blair
print $topicFileHandle "Bill Clinton , Tony Blair \n";
# Closing the handles.
close($labelFileHandle);
close($topicFileHandle);
# Calling the LabelEvaluation modules by passing the following options
%inputOptions = (
labelFile => $labelFileName,
labelKeyFile => $topicFileName
);
# Calling the LabelEvaluation modules by passing the name of the
# label and topic files.
my $score = Text::SenseClusters::Wikipedia::LabelEvaluation->
new (\%inputOptions);
# Printing the score.
print "\nScore of label evaluation is :: $score \n";
# Deleting the temporary label and topic files.
unlink $labelFileName or warn "Could not unlink $labelFileName: $!";
unlink $topicFileName or warn "Could not unlink $topicFileName: $!";
DESCRIPTION
This Program will compare the result obtained from the SenseClusters with that
of Gold Standards. Gold Standards will be obtained from two independent and
reliable source:
1. Wikipedia
2. Wordnet
For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN
and for comparison of Labels with Gold Standards it uses the Text::Similarity
Module. The comparison result is then further processed to obtain the result
and score of result.
Result:
a) Decision Matrix:
Based on the similarity comparison of Labels with the gold standards,
the decision matrix are calculated as below:
For eg:
===========================================================================
| Cluster0 | Cluster1 | Row Total
---------------------------------------------------------------------------
Topic#1 | 271 | 2713 | 2984
---------------------------------------------------------------------------
Topic#2 | 2396 | 306 | 2702
---------------------------------------------------------------------------
Col Total | 2667 | 3019 | 5686
===========================================================================
b) Calculated decision Matrix:
Now based on decision matrix, a new calculated matrix is printed.
Each of the cell in the matrix, will contains the probabilities value:
CELL_VALUE_IN_DECISION_MATRIX / TOTAL_SCORE_OF_DECISION_MATRIX
For eg:
For cell : Cluster0 - Topic#1
i) First -Value = 271 / 5686 = 0.048
Now based on above decision matrix, new calculated matrix is:
========================================================================
| Cluster0 | Cluster1
------------------------------------------------------------------------
Topic#1 | 0.048 | 0.477
------------------------------------------------------------------------
Topic#2 | 0.421 | 0.054
------------------------------------------------------------------------
c) Interpreting Calculated decision Matrix:
1. Row-Wise Comparison
For each topic, "row score" will be compared and cluster with maximum
value will be assigned to that topic.
for eg:
a) Topic#1 Cluster1 (max-row-score = 0.477 )
b) Topic#2 Cluster0 (max-row-score = 0.421 )
2. Col-Wise Comparison
For each Cluster, "col score" will be compared and topic with maximum
value will be assigned to that Cluster.
for eg:
a) Cluster0 Topic#2 (max-col-score = 0.421 )
b) Cluster1 Topic#1 (max-col-score = 0.477 )
d) Deriving final conclusion from above two comparison:
Result of Row-Wise comparison and Column-wise comparison is matched.
Only matching result is then printed.
For eg:
1. Row-Wise Comparison
a) Topic#1 Cluster1
b) Topic#2 Cluster0
2. Col-Wise Comparison
a) Cluster0 Topic#2
b) Cluster1 Topic#1
Matching Result:
Cluster0 Topic#2
Cluster1 Topic#1
e) Overall score:
This is the multiplication of all the probability scores of all
matching cluster and topics.
For eg:
The score for above example will be: 0.201
BUGS
Supports input of label and topic values through files. Should be able to accept as string value
Currently not supporting the WordNet gold standards comparison.
SEE ALSO
http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/
@Last modified by : Anand Jha
@Last_Modified_Date : 30th Nov. 2012
@Modified Version : 0.04
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Anand Jha, University of Minnesota, Duluth
jhaxx030 at d.umn.edu
COPYRIGHT AND LICENSE
Copyright (C), Ted Pedersen, Anand Jha
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Help --------------------
The LabelEvaluation module expect the 'OptionsHash' as the required argument.
The 'optionHash' has the following elements:
1. labelFile:
Name of the file containing the labels from sense cluster. The syntax of file
must be similar to label file from SenseClusters. This is the mandatory option.
2. labelKeyFile:
Name of the file containing the comma separated actual topics (keys) for the
clusters. This is the mandatory option.
3. labelKeyLength:
This parameters tell about the length of data to be fetched from Wikipedia
which will be used as reference data. Default is the first section of the
Wikipedia page.
4. weightRatio:
This ratio tells us about how much the weight we should provide to Discriminating
label to that of the descriptive label. Default value is set to 10.
5. stopList:
This is the name of file which contains the list of all stop words. This is the
optional parameter.
6. isClean:
This option tells us whether to keep temporary files or not. Default value is
true
7. verbose:
This option will let you see details output. Default value is false.
8. help :
This option will show the details about running this module. This is the
optional parameter.
%inputOptions = (
labelFile => '<filelocation>/<SenseClusterLabelFileName>',
labelKeyFile => '<filelocation>/<ActualTopicName>',
labelKeyLength=> '<LenghtOfDataFetchedFromWikipedia>',
weightRatio=> '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
stopList=> '<filelocation>/<StopListFileLocation>',
isClean=> 1,
verbose=> 1,
help=> 'help'
);
function: makeDecisionOfSense
This function will do the evaluation of labels.
@argument1 : LabelSenseClusters DataType(Reference to HashOfHash)
@argument2 : StandardReferenceName: DataType(String)
Name of the external application.
Currently, its two possible values are:
1. Wikipedia
2. WordNet
@argument3 : StandardTerms: DataType(String)
Terms(comma separated) to be sent to Wikipedia or Wordnet for
getting the Gold Standard Labels.
@return : Score : DataType(Float)
Indicates the measure of overlap of current label mechanisms
with the Gold Standard Labels.
@description :
1). It will go through the Hash which contains the clusters and label terms.
2). Each cluster's label terms will be written to a file whose name will be
same as of cluster name(or number).
3). Then, this will go through the Standard terms against which we have to
compare the cluster labels.
4). We will then create the files with name of the terms and content of the
file will be data fetched from the Wikipedia against a topic.
5). Then, cluster's data and topic's data are compared using the method
from Text::Similarity::Overlaps.
6). Finally the calculated scores are used further for decision matrix and
getting the final score value.
function: printDecisionMatrix
This function is responsible for printing the decision matrix.
@argument1 : clusterNameArrayRef: DataType(Reference_Of_Array)
Reference to Array containing Cluster Name.
@argument2 : standardTermsArrayRef: DataType(Reference_Of_Array)
Reference to Array containing Standard terms.
@argument3 : hashForClusterTopicScoreRef: DataType(Reference_Of_Hash)
Reference to hash containing Cluster Name, corresponding
StandardTopic and its score.
@return1 : topicTotalSumHash: DataType(Reference_Of_Hash)
Hash which will contains the total score for a topic
against each clusters.
@return2 : clusterTotalSumHash: DataType(Reference_Of_Hash)
Hash which will contains the total score for a cluster
against each topics.
@description :
1). It will go through the Hash which contains the similarity score for
each clusters against standard label terms.
2). This uses the above hash to print the decision matrix. Below has the
example of the decision matrix.
3). It will also use the ScoringHash to get new hashes which will store
a) total score for a cluster against each topics.
b) total score for a topic against each cluster.
Example of decision Matrix
==============================================================================
| Cluster0| Cluster1
------------------------------------------------------------------------------
Bill Clinton: | 11 | 12 | 23(ROW TOTAL)
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Tony Blair: | 15 | 9 | 24 (ROW TOTAL)
------------------------------------------------------------------------------
Total | 26 | 21 | 47
(COL TOTAL) (COL TOTAL) (Total Matrix Sum)
Where, 1) Cluster0, Cluster1 are Cluster Names.
2) Bill Clinton, Tony Blair are Standard Topics.
3) 23, 24 are Row Total of the Topic score. (ROW TOTAL)
4) 26, 21 are Col Total of the ClusterName Score. (COL TOTAL)
5) 47 is Total sum of the scores of all clusters again all topics.
(Total Matrix Sum)