Notes of Reading MUSE Project
Table of Contents
1 Muse_Python
The muse_python part is an interface to launch Stanford CoreNLP package to make tokenization, coreference resolution, etc. This package provides two ways of invoking the parser: 1) interactive style provided by StanfordCoreNLP class, 2) batch_parse.
This package use pexpect to call the corenlp as in an interactive command line. Besides, the author likes json very much. I also found the package xmltodict (handle xml like json). Below, I list all the standard python package and site-packages used in this project.
- standard
- json
- os, re, sys, shutil, tempfile, collections
- traceback used in verbose mode
- subprocess call command to process the xml output file to bypass the command-line interface limit
- site-package
- pexpect run application as under an interactive command-line
- xmltodict parse xml to ordered dict
- progressbar show progressbar in verbose mode
- unidecode decode, encode unicode?
From the default.properties
file, the Stanford corenlp pakcage
execute the tokenization, sentence splitting, POS tag, lemmatization,
NER, parsing and coreference resolution tasks. There are also many
commented options.
The class StanfordCoreNLP works in a server mode along with pexpect? I am not sure.
Except the corenlp
module, there is a muse
module in this
package. It contains many things.
- io read CoNLL and LiirCoref data
- lm supposed to be an interface to work with the RNNLM, not complished yet?
- modules
- represe representation of texts, sentences, etc in python data structure
- resource Read SemLink data. VerbClassManager cooperates with nltk.corpus.VerbnetCorpusReader. What is SemLink? What exactly does the VerbClassManager work for?
- spliter Sentence splitter, tokenizer
- srl semiSupervised?
- taggers SimplePOSTagger
2 Terence and Lindo
2.1 LINDO
- be.liir.TimeML* as well as be.liir.TempEval2013* are created for specific tasks
- be.lindo.api* should be used as api for normal usage
- be.lindo.experiments* ?maybe for experiments settings?
- be.lindo.messages* contains the classes used to model from document to token, from annotation to parse.
- be.lindo.objects, be.lindo.test, be.lindo.utils
2.1.1 be.lindo.api*
2.1.1.1 be.lindo.api.ml
Classifier and Features
- AbstractClassifier
- String classify(Parse, SmartFeatureGenerator)
- getPositiveClassProb(Parse, SmartFeatureGenerator, String posClassLabel)
- evaluate: recognize timexes from Statement and evaluate the results. The results is a list of Parse.
- detect: detect if a parse corresponds to a timex.
- SmartFeatureGenerator
- float[] transformFeatures(String[], String[]) string to float, the second array of string is used to store the substring before last "=" of string in the first array, which is the featureValues. float value will be parsed from the substring after the last "=".
- String[] getFeatures(Parse phrase, Statement context)
- getWordBasedFeatures
get the most probable temporal token in phrase, add its
lemma, word, POS. Special attention is made to cadinal
numbers (CD). POS list of phrase, phrase pattern. etc.
- getCharacterPattern: "Word 32" => "Xxxx_99", compact form is "Xx_9". Symbol replaced "./"
- getTokenList: get list of String, numbers are converted into patterns.
- getAllTokens: concatenate strings by whitespace.
- getSimpleEnding: end characters.
- getGoldLabel: get MlLabel of token
- getToken: get next token in parse.
- getParseBasedFeatures
- getWordBasedFeatures
get the most probable temporal token in phrase, add its
lemma, word, POS. Special attention is made to cadinal
numbers (CD). POS list of phrase, phrase pattern. etc.
2.1.1.2 be.lindo.api.timex
2.1.1.3 be.lindo.api.tlnk
2.1.1.4 be.lindo.api*
2.1.2 be.liir.TempEval2013 & be.liir.TimeML
2.1.2.1 be.liir.TempEval2013
- event
- EventDetector String classify(Token, FeatureGenerator) It gets features by feature generator using the token and token context. transform the features to float array and new strings by feature generator. Uses model to eval the new strings and float numbers to get a list of results and then find the best one.
- FeatureGenerator extends SmartFeatureGeneator
Provides many functions, but keep the transform function the
same as the parent class.
add word, POS, CPOS, end2, end3,
- hypernyms
- synonyms
- derivations
2.2 Dependency of terence to lindo
2.2.1 All refered classes
import be.lindo.api.corpus.CorpusReader; import be.lindo.api.ml.Outcomes; import be.lindo.api.ml.dependency.features.CPOSTag; import be.lindo.api.ml.dependency.features.Form; import be.lindo.api.ml.dependency.features.Lemma; import be.lindo.api.ml.dependency.features.POSTag; import be.lindo.api.ml.dependency.malt.MaltDataWriter; import be.lindo.api.nlp.adapters.OpenNLPFactory; import be.lindo.api.nlp.adapters.WebNLPFactoryLoader; import be.lindo.api.timex.model.Calendar; import be.lindo.api.tlink.EventStemFrequency; import be.lindo.api.util.Const; import be.lindo.api.util.EditDistance; import be.lindo.api.util.WebConst; import be.lindo.api.util.WordNet; import be.lindo.experiments.recognition.Classifier; import be.lindo.experiments.tlinks.fables.FablesFeatureGenerator; import be.lindo.experiments.tlinks.fables.parse.MaltFeatureGenerator; import be.lindo.experiments.tlinks.fables.parse.MaltParser; import be.lindo.messages.AnnotatedElement; import be.lindo.messages.Event; import be.lindo.messages.Parse; import be.lindo.messages.Statement; import be.lindo.messages.TLink; import be.lindo.messages.TimeX; import be.lindo.messages.Token; import be.lindo.messages.malt.Node; import be.lindo.utils.Container; import be.lindo.utils.LINDOConsts; import be.lindo.utils.TimexUtils;
Lindo is designed for question answering.
2.2.1.1 Package messages
- AnnotatedElement extends Statement implements Comparable
- NamedNodeMap attributes// dom name (String) to nodes
- Node annotatingNode // dom xml node
- String id; String fileName; int startIndex // index of the first token in the sentence
- HashMap<String, String> // two maps from attribute name to value and vise versa.
- ArrayList<String> // annotationLines, annotationAttributeLines
- ArrayList<Integer> tokenNums
- Event extends AnnotatedElement
- Timex extends AnnotatedElement
- Type: date, duration, time, set
- Role: creation_time, modification_time, etc.
- Value: duration, weekdate, season, etc.
- String[] features
- Calendar// define seasons, quarters, etc, parse a time and get the corresponding equivalent time expressions like date -> day in week -> season.
- TLink
- type
- ids of entities and link
- Parse
- Parse: parent, head
- List<Parse>: parts
- Span, String: type, List<Token> tokens.
- Statement extends Question
- Question implements Serializable, Cloneable
- String //string representation, etc., List<Token>, Parse //the syntatic parse, List<String> chunks
- ArrayList<ArrayList<SemanticLabel>>
- String normlizedQuestion, originalQuestion.
- ArrayList<AnnotatedElement> annotatedEvent, recognizedEvent, annotated timex, recognized timex, annotatedSignals
- ArrayList<ArrayList<Token>> ngrams
- String sourceFile, dct.
- ArrayList<String> timex, ArrayList<Parse> candidates;
- Question implements Serializable, Cloneable
- Token implements Serializable
- index in sentence, String annotationLine?, pos, int (start,
end), String convertedString, some boolean for syntactic
states, Span, String chunkTag. Statement //sentence. String[]
features.
- HashMap<Integer, ArrayList<RuleResults>> ?. integer means level
- HashMap<Sematic,String> different labels in different context?
- HashMap<Enum, ArrayList<Object>> map of TimeML label -> BIO labels? In Question, they only treate the first element and planned to make warm if there are more than one label. twist design. The second(why second?) Object is converted into a NamedNodeMap.
- ArrayList<Container<String, Double>> contextSynonyms. Synonyms and probabilities
- ArrayList<Container<String, Double>> contextIndependentSynonyms
- Parse parentParse ?
- Span int (start, end), String merge?
- SemanticLabel String role; String chunk; ArrayList<Token> tokens;
- index in sentence, String annotationLine?, pos, int (start,
end), String convertedString, some boolean for syntactic
states, Span, String chunkTag. Statement //sentence. String[]
features.
- malt.Node
2.2.1.2 Pacakges api
- CorpusReader, read/get files under directory
- ml.Outcomes; confusion matrix?
- ml.dependency.features.CPOSTag; return concise POS of the head token
- ml.dependency.features.Form; get token from Token/AnnotatedElement(Quasi head token, which is the last token)
- ml.dependency.features.Lemma; Lemma of token returned by Form
- ml.dependency.features.POSTag; POS of token returned by Form
- ml.dependency.malt.MaltDataWriter;
- nlp.adapters.OpenNLPFactory; OpenNLP processors
- nlp.adapters.WebNLPFactoryLoader; wrapper for web.
- timex.model.Calendar; time convertor
- tlink.EventStemFrequency; load frequency from file
- util.Const;
- util.EditDistance;
- util.WebConst;
- util.WordNet;
2.2.1.3 Packages experiment
- recognition.Classifier;
GISModel…
List<Parse> = AbstractClassifier.simpleEvaluate(Question, classifier,
outcomes, false, SmartFeatureGenerator)
- AbstractClassifier.simpleDetect *to do, very long function*
- tlinks.fables.FablesFeatureGenerator; generate features for tlinks
- tlinks.fables.parse.MaltFeatureGenerator; ???
- tlinks.fables.parse.MaltParser; invoke malt parser (dependency parser)
2.2.2 Terence references
start at be.liir.nlptools.adapters.TAFAdapter steps:
- init (protected)
- processStory(string)
- preProcess(string) for pg // replace ';' by '.'
- processSyntax(string)
- Document result = processSemantics(string)
for each sentence
- timexAdapter
- UMLSAdapter.detectMedicalConcepts
- eventAdapter
- getEventParticipants (tokens, empty map, empty list, empty map) remove events with participants, but if the map is empty, do nothing!
- participantProcess will create HasParticipant, but does nothing when previous step does nothing.
- …
- eventProcess, construct the maps and collections between/of tokens and events
- clinkAdapter
- getCauseLinksWithSUBJ// only process HasParticipant objects.
- getCauseLinks// HasParticipant with diff constraints
- getCauseLinksNoSUBJ// use word match and dependency rules(hard coded)
- getCauseLinksbyCausePhrases// use word match and dependency rules(hard coded)
- postProcess
- getTLinks if !pg //
- factual events
- parse dependency for factual events // strange
- tLinkPostProcess(tlinks, jaoContent) editTLinks(tlinks,clinks)// use case links to calibrate BEFORE, AFTER tlinks relations.
- postProcess(Document result)
- return result
- createNewDocument(string)
- new timexAdapter, clinkAdapter
- createSentences
- for each sentence
- Statement.processIt // data structure from lindo doTokenize, doTag, doParse, not doSemanticParse, not isAtimex
- SPreprocAdapter.replaceTokens replace string of token for parentheses (e.g. -LRB- to "(")
- timexAdapter.getTimexes return List<Parse> by classifier.simpleDetect(Statement, classifier, new Outcomes(new String[]{"0", "1"})). detect parse, sub-parse.
- convert to new data structure (JAO tokens)
- eventAdapter.annotate use EventDetector from lindo. opennlp.maxent.GISModel classify vector generated by FeatureGenerator. classify each tokens.
- parseDependency (for relations) Relation here represents the dependency relation between tokens.
- getEventParticipants filter sub-obj relations, return such kind of relations associated to events.
- participantProcess get the participant tokens (JAO)
- eventProcess creating maps and list for tokens and events.
- clinkAdapter.getLinks // Cause Link
- getCauseLinksWithSUBJ
- getCauseLinks
- getCauseLinksNoSUBJ
- getCauseLinksbyCausePhrases
- getCauseLinksbyCausePhrases
- postProcess
- tLinkPostProcess// set tlinks to null before invoking this
function, it would do nothing!
use cLinks
- Link, source, targets, label, id, innerId 两次调用getLinksWith,得到所有包含当前link的target和source的 link(可以对调)
- Optional commented code getTLinks
- remove modal and negated events
- parse dependency for the filtered event token sequence? strange.
- create tlink for each dependency relation
- getCoferClusters
- postProcess // merge duplicated entityMentions
- preprocessing: srl, coref, action (server service)
task definition:
- Timex3
- Relation
- Event
- Document
feature: model: train: test: