RUPEE Reconstruction Notes
Table of Contents
1 TODO Python [2/6]
[X]
Create corpus sets view on multiple corpuses.
I define corpus sets the original corpus settings provided by the task organizers, which is distinguished by the year of task and train, development and test propose. However, in order to fill the requirements during development of algorithm or train a powerful classifier to participate the challenge, one usually merge some of these corpus. So I create corpus sets view that works like a wrapper on the original files and preprocessing results, so we only have only one copy of these files on disk.
However, as I did not make a searialization process, the temporary files made by the prediction models cannot be seperated from the original data. Hence, I still have to save some redundant files for each tasks. This problem can be solved by giving ids to all the objects so that different data can be saved in different XML files and keep the common data shared by each step of the model.
[X]
Serialize objects and save them into XML files separately
Attributes of corpus without annotation.
class Document(object): ''' Document object contains the view of document with respect to a specific tokenization along with the corresponding sentences. Attributes: hyper_doc The reference to the HyperDoc tokens The list of all the tokens sentences The list of Sentences charOffset2token The index from character index to Token ''' class HyperDoc(object): ''' HyperDoc is the abstract document that does not contain the concrete tokenizations. It contains a list of HyperSent, which contains the sentence boundary, entity and annotation information. Attributes id file name, for example: PMID-997... raw_text original text corpus_name corpus name of this document documents dictionary of Documents hyper_sents list of HyperSents proteins dictionary of proteins, {protein_label -> protein_object} target_annotation dictionary of annotations to create (gold annotation, temporary annotation), {annotation_name -> trigger_annotation} charOffset2hyper_sent dictionary of character indices to HyperSent ''' class Sentence(object): ''' Sentence object contains the tokens, index of dependency parse trees/graphs and phrase structure trees. Attributes tokens list of Tokens parseTrees index of phrase structure trees, {parser_name -> parse tree} dependencies index of dependency parse trees/graphs, {parser_name -> parse tree/graph} hyper_sent HyperSent that contains this Sentence document Document that this Sentence belongs to ''' class HyperSent(object): ''' HyperSent manages the Sentence, annotations Attributes: charOffsetBeg Start char index charOffsetEnd End char index hyper_doc HyperDoc that this HyperSent belongs to indexInDocument The list of Sentences sentences The dictionary of sentences {tok_name -> Sentence} ''' class Token(object): ''' basic element, contain the raw text, lemma, character position ... Attributes: lemma lemma stem stemmed word raw_text raw text decorated ... pos part of speech charOffsetBeg index of first character charOffsetEnd index of last character document Docuemnt that this Token belongs to sentence Sentence that this Token belongs to ner ... '''
[ ]
Save different annotations separately. Annotations are independent to the preprocessed documents. Annotations are saved in bothHyperDoc
andHyperSent
, but theEntity
are only saved inHyperSent
. TheHyperDoc
andDocument
do not directly access theEntity
.[ ]
Remove 'base token protocol'.[ ]
Protocol to solve one file.[ ]
Protocol to invoke protein name recognition application.
2 DONE Java [2/2]
- State "DONE" from "TODO"
- State "TODO" from "TODO"
2.1 DONE Archive into an excutable.
- State "DONE" from "TODO"
- State "TODO" from ""
2.2 DONE Make command line interface.
- State "DONE" from "TODO"
- State "TODO" from ""
3 DONE Display
- State "DONE" from "SINGLE"
Using BRAT.
With apache server installed on the machine and the server folder is
/home/xiaoliu/Public/
, clone the git repository of brat into
Public
and run the installation script brat/install.sh
. This
script will ask you for the user name and password for accessing the
server (I think is only used to edit the annotations, the mail address
for the administrator) and create two directories data
and work
with appropriate group and permitions. The data we wish to display
should be placed under the directory data
, whiler work
is for the
script to generate some necessary files. If your apache is well
configured to run the CGI script, you should be able to visit the brat
now by localhost/brat
, which will be automatically redirected to
localhost/brat/index.xhtml
. But if the apache server is not well
configured, you will see nothing and be told that you cannot access to
the brat folder. You can rename the file .htaccess
to
.htaccess_bp
to visit the index.xhtml
by force and collect the
error information. An alternative solution to gathering the error
messages is run brat/tools/troubleshooting.sh
.
The CGI configuration tutorial given by the developer does not work on
my machine. Here I record what I did to make it works. First, you do
not edit the /etc/apache2/httpd.conf
, instead you edit the
/etc/apache2/sites-available/000-default.conf
. In addition to what
to be added noticed in the tutorial, you have to add ExecCGI
to
Options
to make it work. Then you restart the apache server by
sudo service apache2 restart
.
Let me conclude what needs to do for apache CGI below.
- Add
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ <Directory /home/xiaoliu/Public> Options Indexes FollowSymLinks ExecCGI AllowOverride Options Indexes FileInfo Limit Require all granted AddHandler cgi-script .cgi </Directory>
into /etc/apache2/sites-available/000-default.conf
.
- Run
sudo a2enmod cgi
and thensudo service apache2 restart
.
4 SINGLE Web Site [0/2]
[ ]
Input, background excution, display.[ ]
Protein name recognition.