# Summary of Reviews for Paper of BMC Bioinformatics

This document contains the summary of reviews for the paper I submitted to the BMC Bioinformatics (BioNLP special issue). Please find the reviews shared on Google doc. I divide the reviews into five groups: 1) English rephrasing, 2) details enrichment, 3) arguable comments, 4) complement, 5) format. English rephrasing section contains the comments that require the author to rephrase English descriptions. Details enrichment section contains the requirements of adding details of experiments, references, etc. Arguable comments are different opinions of reviewers. Complement section includes requirements of completing our application and generalizing it to all the tasks. Format comments section focuses on the conflicts between our current paper and the BMC journal format.

I add my explanation in bold at the beginning of each section, and my comments in /green italic/ style after the list items. The issues that are crossed by lines are solved. For the issues followed by an underlined sentence, I know how to solve them, but have not taken action.*

## 1 English Rephrasing Comments

The English rephrasing comments are the easiest to deal with. I list all of this kind of comments in this section.

• In section of Abstract: "Admittedly" is required to be changed
• In section of Background: "An example of event" to "of an event"
• In subsection of Data: "abstracts of full articles" is supposed to be replaced by "abstracts or full articles"
• In subsection of Post-Processing: "extra-arguments" is supposed to be split into two words
• In subsection of Related Work: "are ran" should be "run"; "consequential" is required to be changed
• In subsection Of Performance Analysis: "rightfully" is required to be changed

## 2 Details Enrichment Comments

Most of detail enrichment comments need me to add more details about the definitions, methods and hyper-parameters, which are not difficult.

• Abstract
• "the best results" is not true. The reviewer think it is fair to add FAUST system in comparison according to the task requirements.
• Background
• "huge amounts of electronic bio-medical documents" is unnecessary imprecise.
I do not understand what the reviewer want. A precise number of documents?
• "[bio-medical documents] involve domain-specific […] dependencies" is an unclear claim. The same as the comment above.
• "15 times faster [than Riedel et al.] " should specify the condition (e.g. omission of feature caching)
• Data
• "extracts of articles from PubMed Central": some are only in PubMed, not PMC.
It is my mistake. I thought they are the same thing by inferring from their names. In fact, PubMed is the super-set of PMC, where articles on PMC can be freely download but only abstracts are accessible for other articles on PubMed.
• About the discussion of consistency and effect on performance, reviewer suggest to consider contrasting analysis with that presented in "Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011"
I read the article suggested, and found a statement "each entity annotation consists of a type". Note that this article is published for BioNLP 2011. It does not explain the inconsistency between BioNLP2009 and 2011.
• Methods
• Algorithm1G
• The reviewer suggest to replace "score and label" and "highest score" with more precise descriptions of the steps.
• Fitting the Pairwise Model
• "two hyper-parameters that are set by cross-validation". More information on selection (e.g. considered values) are required.
• The method that the author used to deal with the heavy class imbalance should be added.
We already had cited the paper about the method we used, which is C+/C-.
• Computational Considerations
• "SVM-based classification scales well" is unnecessarily imprecise. Efficient linear SVM implementations are invariant to the number of examples at prediction time (the weight vector is explicitely represented)
I do not understand
• Pre-Processing & Features
• Add examples of the two tokenizations
"inhibit NF-kappaB-dependent pro-inflammatory gene transcription" vs "inhibit NF-kappaB dependent pro-inflammatory gene transcription"
• Add examples of each feature for clarity.
Should we add examples in the feature table or in paragraph? The table is not large enough to contain examples whereas enumerating examples for each feature in paragraph would be very ugly to see.
• Add details of generating IntAct features.
• Identify the "heuristics from the UCLEED system"
Does this requirement means explaining how to trim the dependency path?
• Related Work
• Mention rule-based (e.g. NCBI) and pattern-based (e.g. NICTA) methods.
• Results
• Add the comparison with the system by Miwa et al (2010), as it reported one of the best performances, particularly with Binding events.
This system was run on the BioNLP2009 data set, we do not have results on this data set.
• Add FAUST system into comparison
• Conclusion
• Provide more detail regarding future work
• "the best result" is not true (FAUST is the best).

## 3 Arguable Comments

The arguable comments contain something that reviewers do not agree with us. Some of this kind of comments need the supports from new experiments, which could cost certain amount of time. Considering the deadline of the camera version and my defense date, I have to carefully organize the schedule for them. Besides, I doubt about some of these comments.

• Parts of the terminology are confusing. The word "entity" can be interpreted as a term in biology and a term in text processing. Besides, the trigger is not exactly a "named entity".
I think it is better to replace named entity by text entity and add definition that a text entity is a character chain.
• Methods
• Algorithm
• The definition of P_S=(c_i,a_j) is inconsistent because the arguments are initialized by proteins but involve predicted triggers during the process.
• Pre-Processing
• About the claim "high quality dependency parse trees require a fine grained tokenization", reviewer require us to provide evidence to support this claim.
It is hard to give a direct evidence of the quality of dependency parse. We can only say that dependency parse with find grained tokenization provides better features for this task. But the quality of dependency parse itself is not defined. To demonstrate the better effect of dependency parse with fine grained tokenization, we have to run experiments with dependency parsing based on coarse tokenization.
• "trees are finally obtained using ….". Reviewer thought such parses are provided in the supporting data?
The support data are generated using different tokenization, hence are differ to what we generated.
emphasize we computed them ourselves.
• Related work
• "rich features coding for (trigger, argument) pairs [16] are only used by pipeline models for assigning arguments": this is false at least for the TEES and EVEX systems.
/This is differ to what I read. TEES has two steps: trigger detection and edge detection. The features of pairs are not used in trigger detection./
rephrase and explanation for reviewer.
• Results
• Genia Shared Task 2013
• "it is difficult to estimate standard errors properly": consider referencing statistical significance results in [1].
Reply in the response email
• clarify if EVEX and TEES also use the data-set from BioNLP2011
They did not talk about this in their papers published in the workshop. I have sent a mail to ask which data-set did they use. However, even if they reply me and say they had used BioLNP2011 data-set, their mail can not act as reference.
• Genia Shared Task 2011
• "TEES (no detailed results available)": available on the task web page and in "The Genia Event and Protein Co-reference tasks of the BioNLP Shared Task 2011"
• precision-recall curves with different features
• Figure 5 in the result is easy to misinterpret. From this curve it seems that V-walk features

are better left out. The authors mention that this is due to the limited size of the test set, but I would like to see a more careful analysis on how this phenomenon scales with the size of the test set. Related to this, the authors can do a more elaborate feature selection step: this is easy to do with SVM, and improved results using feature selection for event extraction have been described before. I would like to discuss about it.
Do not worth experiments. But I have to be aware of feature selection methods (Lasso).

• training duration
• more theoretical characterization of the computational complexity (Reviewer 1)
It should be in the section of Methods. I do not understand the reviewer's opinion.
Explain our limitation with a soft description that claims the uncertainty
• Scala is slower than C. The training duration may be caused by the programming language. (reviewer 1)
• efficiency order of pipeline > RUPEE > globally joint system is rough and lack of support. There is no training duration test for pipeline system and only one case for joint system. (reviewer 3)
20 mins for pipeline couterpart

## 4 Complement Comments

Requirements in the complement comments need much time. The comments of reviewer 3 are for future job in my opinion.

• Make software usable outside the BioNLP community. Web tool or API that provide biologists to use this method on text data. (reviewer 1)
Display like http://corpora.informatik.hu-berlin.de
It will cost a large amount of time to add the protein recognition ourselves.
• Make general framework for all the BioNLP tasks. Will break the constraints in Genia task. (reviewer 3) It is not a requirement for this paper, but for the further work

## 5 Format Comments

Comments about format are essential to publish our paper.

• Figures (e.g. figure 5) are hard to interpret with printed out in gray-scale
• BMC Bioinformatics does not allow footnotes and requires that URLs be included in references. Link to BMC format.

## 6 Reports

### 6.1 Coarse Tokenization

I ran experiments to determine the difference impacts of coarse and fine-grained tokenization to the final performance. Since we argue the effects of dependency parses with different tokenizations, it is necessary to find the optimal thresholds of the shortest dependency paths as I did before. Hence, the first part of these experiments is selecting the thresholds on BioNLP2013 development set and the second part is testing the performance on both BioNLP2011 and 2013 test test using the optimal thresholds.

#### 6.1.1 Selection Dependency path

I used the training sets from BioNLP2011 and 2013 to train the classifiers and evaluated them on the BionLP2013 development set. Unlike the previous experiments made to select the optimal thresholds, I selected the thresholds on the RUPEE model instead of pairwise model. Hence, the intermediate results are not comparable to those written on my thesis.

1. Trigger-Theme Pair Detection Step

I list below the f-scores, precisions and recalls of single argument events and total trigger-theme pairs predictions. I costed much time to solve the conflicts between the data in last Thursday and Friday, so I ran all the experiments using a shell script to save time. However, I wrote a wrong command in the shell script, so I lost some results about the trigger-theme pairs. From the experimental results, I selected 2,3 and None for the experiments of next step (Binding argument fusion).

Dep length <2 <3 <4 <5 <6 None
SVT 75.09 75.10 73.82 74.27 73.08 74.70
prec. 85.31 85.05 78.64 78.74 75.19 80.54
recall 67.05 67.23 69.85 70.28 71.08 69.65
THEME-TOTAL 63.45 63.11       63.22
prec. 77.05 72.25       65.00
recall 53.94 56.03       61.54
2. Binding Argument Fusion Step

I list below the f-scores of the Binding events prediction using different thresholds. Each line refers to the results of experiments using different thresholds during the Binding argument fusion step. "Prev" in the first column indicates the threshold used in previous step. It seems that thresholding the dependency paths does not improve the performance. Hence, I only keep one configuration for next step, where the threshold for the first step is 3 and these is no threshold for Binding argument fusion step.

Dep length 3 4 5 6 None
Prev = 2 39.94 41.30 44.18 44.18 44.11
Prev = 3 39.89 42.84 45.93 45.60 46.00
Prev = None 39.69 42.42 45.54 45.30 45.65
3. Regulation Cause Argument Assignment Step

In the last step, we have two thresholds to select, one is for the path between the trigger and cause argument, another is for the path between two arguments. The table below lists all the experimental results, where each line lists the total event f-scores of experiments using different trigger-cause thresholds and each column lists the results of experiments using different theme-cause thresholds. The differences between these results are not very significant and it is hard to find a general law from them. Therefore, I simply used the configurations with respect to the best results ($$2\le l \le None$$, $$l\le 3$$).

Dep length l<2 l<3 l<4 None
$$1\le l\le 5$$ 50.76 50.72 51.00 51.22
$$2\le l\le 5$$ 50.86 50.93 51.02 51.14
$$3\le l\le 5$$ 50.35 50.01 50.09 50.31
None 50.86 50.79 50.90 49.85
$$1\le l\le 4$$ 50.51 50.73 50.80 50.87
$$2\le l\le 4$$ 50.86 50.89 50.89 51.15
$$3\le l\le 4$$ 50.02 50.18 50.11 50.06
$$1\le l\le None$$ 50.86 50.79 50.90 51.17
$$2\le l\le None$$ 51.04 51.46 51.39 51.24
$$3\le l\le None$$ 50.35 50.43 50.26 50.42

#### 6.1.2 Test Experiments

After selecting the optimal thresholds on the BioNLP2013 development set, I train and evaluate the model with these thresholds for online test evaluations. The result of experiment using coarse tokenization on BioNLP2011 test set is surprisingly high, which is only $$0.25\%$$ lower than fine-grained tokenization. But on BioNLP2013 test set, fine-grained tokenization outperforms the coarse tokenization significantly.

Coarse Fine-Grained
BioNLP2011 55.35 55.6
BioNLP2013 52.16 54.4

Created: 2014-10-29 Wed 18:05

Emacs 24.3.1 (Org mode 8.2.10)