Gene Regulation Network(GRN) task in BioNLP

1. GRN task definition
- 1.1. GRN annotations has three levels
- 1.2. Generation of network
2. KU Leuven method
- 2.1. Framework
- 2.2. FEATURES

In order to work on the patient guideline events extraction part in the MUSE ¹ project, I was advised to exploit the methods used by the team of KU Leuven in the BioNLP workshop. They attend this workshop to develop and evaluate algorithms on the benchmark data set and planed to use developed algorithms on the patient guideline events extraction problem.

1 GRN task definition

The GRN task contains two steps: first, extract formulas from text; second, create GRN from extracted formulas. The algorithm used for the second step had been developed by the task organizer, hence, the participants only have to study the algorithm for the first step.

1.1 GRN annotations has three levels

Text-bound entities are given both in train and test. Unlike GENIA task, GRN task provide also trigger words and distinguish the type of gene, protein, gene-family, etc. Gene, protein, gene-family, etc. are called genic entites.
Biomedical events and relatiosn are like the simple events in GENIA task. However, the defined relations distinguish the passive and positive roles such as Transcription_from and Transcription_by are defined as two events. Promoter_of and Master_of represent the knowledge more precisely. The argument type are strictly defined respect to the type of gene, protein, etc. defined in the first level. This level of annotations are called events and relations, and it does NOT contain recursive events.
Interactions contains six types of relations: Binding, Transcription, Activatio, Requirement, Inhibition and Regulation. The first two types represent mechanisms, the next three types represent effects and the last collect all the other relations. Interactions can be recursive.

1.2 Generation of network

Thought only the level 3 annotation and the network will be submitted to the official evaluation, the construction of GRN needs the inference with the level 2 annotation for the interactions that do not directly link to the genic entities.

If the agent/target of an interaction is a genic named entity, the agent/target node is the gene identifier of the entity. If the entity does not contain gene identifier, it is not a genic name. In GENIA task, there are some protein entities that are sub-strings such as Il-1,2,3. Does GRN contain similar annotations? Are they ignored (2, 3 do not contain gene identifier)?
If the agent/target is an event, the node is the entity referenced by the event.
If the agent/target is a relation, the agent of both arguments (agent/target) are nodes.
If the agent/target is a promoter, the agent is the argument follows the promoter_of or master_of_promoter relation.
Edges are ordered by hierarchy and remove edges with lower priority.When both (A, Transcription, B) and (A, Regulation, B) exit, (A, Transcription, B) is kept.

2 KU Leuven method

2.1 Framework

SVMLight implementation in the Shogun Machine Learning Toolbox. Observing all pairs of genic entities in a sentence. Differential weighting to deal with the data imbalance. Do they worked on the extraction of the level 2 annotation?

2.2 FEATURES

Entity features f_ent and pairwise featues f_extra. Used Stanford parse tree. dependency tree or phrase structure tree?

f_ent contains the base features and context features for all the words in the entity. Features are normalized by the number of words.
1. Base features f_base:
  1. entity type
  2. similarity scores for words in the dictionary by shared beginning (details? stemming?).
  3. Part-of-speech produced by NLTK. similarity scores?
  4. Location of words in the sentence, normalized to (0,1). subspace of the two location dimensions of the two entites?
  5. Depth in the parse tree.
2. Context features Weighted average of all other words in the sentence. It is a weighted sum of the f_base feature vectors of every words in the sentence. The weights are computed by α^d_i(w,w_j), where the d_i(w,w_j) is the parse tree distance from w to w_j for sentence i.
f_extra: distance of two entities on Stanford parse tree, location and count of Promoter entities.

Footnotes:

Machine Understanding for interactive StorytElling (MUSE) project http://www.muse-project.eu/index.html. Username: muse, password: Pa4MpPw@kul.