RUPEE Reconstruction Notes

Table of Contents

1 TODO Python [2/6]

  • [X] Create corpus sets view on multiple corpuses.

I define corpus sets the original corpus settings provided by the task organizers, which is distinguished by the year of task and train, development and test propose. However, in order to fill the requirements during development of algorithm or train a powerful classifier to participate the challenge, one usually merge some of these corpus. So I create corpus sets view that works like a wrapper on the original files and preprocessing results, so we only have only one copy of these files on disk.

However, as I did not make a searialization process, the temporary files made by the prediction models cannot be seperated from the original data. Hence, I still have to save some redundant files for each tasks. This problem can be solved by giving ids to all the objects so that different data can be saved in different XML files and keep the common data shared by each step of the model.

  • [X] Serialize objects and save them into XML files separately

Attributes of corpus without annotation.

class Document(object):
    '''
    Document object contains the view of document with respect to a  specific tokenization
    along with the corresponding sentences.

    Attributes:
        hyper_doc            The reference to the HyperDoc
        tokens              The list of all the tokens
        sentences           The list of Sentences
        charOffset2token    The index from character index to Token
    '''
class HyperDoc(object):
    '''
    HyperDoc is the abstract document that does not contain the concrete tokenizations.
    It contains a list of HyperSent, which contains the sentence boundary, entity and
    annotation information.

    Attributes
        id                   file name, for example: PMID-997...
        raw_text             original text
        corpus_name          corpus name of this document
        documents            dictionary of Documents
        hyper_sents           list of HyperSents
        proteins             dictionary of proteins, {protein_label -> protein_object}
        target_annotation    dictionary of annotations to create (gold annotation, temporary annotation), {annotation_name -> trigger_annotation}
        charOffset2hyper_sent dictionary of character indices to HyperSent
    '''
class Sentence(object):
    '''
    Sentence object contains the tokens, index of dependency parse trees/graphs and phrase structure trees.

    Attributes
        tokens          list of Tokens
        parseTrees      index of phrase structure trees, {parser_name -> parse tree}
        dependencies    index of dependency parse trees/graphs, {parser_name -> parse tree/graph}
        hyper_sent      HyperSent that contains this Sentence
        document        Document that this Sentence belongs to
    '''
class HyperSent(object):
    '''
    HyperSent manages the Sentence, annotations

    Attributes:
        charOffsetBeg       Start char index
        charOffsetEnd       End char index
        hyper_doc           HyperDoc that this HyperSent belongs to
        indexInDocument     The list of Sentences
        sentences           The dictionary of sentences {tok_name -> Sentence}
    '''
class Token(object):
    '''
    basic element, contain the raw text, lemma, character position ...
    Attributes:
        lemma               lemma
        stem                stemmed word
        raw_text            raw text
        decorated           ...
        pos                 part of speech
        charOffsetBeg       index of first character
        charOffsetEnd       index of last character
        document            Docuemnt that this Token belongs to
        sentence            Sentence that this Token belongs to
        ner                 ...
    '''
  • [ ] Save different annotations separately. Annotations are independent to the preprocessed documents. Annotations are saved in both HyperDoc and HyperSent, but the Entity are only saved in HyperSent. The HyperDoc and Document do not directly access the Entity.
  • [ ] Remove 'base token protocol'.
  • [ ] Protocol to solve one file.
  • [ ] Protocol to invoke protein name recognition application.

2 DONE Java [2/2]

  • State "DONE" from "TODO" [2014-10-09 jeu. 18:14]
  • State "TODO" from "TODO" [2014-10-09 jeu. 18:11]

2.1 DONE Archive into an excutable.

  • State "DONE" from "TODO" [2014-10-09 jeu. 18:14]
  • State "TODO" from "" [2014-10-09 jeu. 18:13]

2.2 DONE Make command line interface.

  • State "DONE" from "TODO" [2014-10-09 jeu. 18:14]
  • State "TODO" from "" [2014-10-09 jeu. 18:13]

3 DONE Display

  • State "DONE" from "SINGLE" [2014-10-14 mar. 14:17]

Using BRAT.

With apache server installed on the machine and the server folder is /home/xiaoliu/Public/, clone the git repository of brat into Public and run the installation script brat/install.sh. This script will ask you for the user name and password for accessing the server (I think is only used to edit the annotations, the mail address for the administrator) and create two directories data and work with appropriate group and permitions. The data we wish to display should be placed under the directory data, whiler work is for the script to generate some necessary files. If your apache is well configured to run the CGI script, you should be able to visit the brat now by localhost/brat, which will be automatically redirected to localhost/brat/index.xhtml. But if the apache server is not well configured, you will see nothing and be told that you cannot access to the brat folder. You can rename the file .htaccess to .htaccess_bp to visit the index.xhtml by force and collect the error information. An alternative solution to gathering the error messages is run brat/tools/troubleshooting.sh.

The CGI configuration tutorial given by the developer does not work on my machine. Here I record what I did to make it works. First, you do not edit the /etc/apache2/httpd.conf, instead you edit the /etc/apache2/sites-available/000-default.conf. In addition to what to be added noticed in the tutorial, you have to add ExecCGI to Options to make it work. Then you restart the apache server by sudo service apache2 restart.

Let me conclude what needs to do for apache CGI below.

  • Add
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
<Directory /home/xiaoliu/Public>
    Options Indexes FollowSymLinks ExecCGI
    AllowOverride Options Indexes FileInfo Limit
    Require all granted
    AddHandler cgi-script .cgi
</Directory>

into /etc/apache2/sites-available/000-default.conf.

  • Run sudo a2enmod cgi and then sudo service apache2 restart.

4 SINGLE Web Site [0/2]

  • [ ] Input, background excution, display.
  • [ ] Protein name recognition.

Author: Xiao LIU

Created: 2014-10-29 Wed 18:05

Emacs 24.3.1 (Org mode 8.2.10)