Summary of Reviews for Paper of BMC Bioinformatics
Table of Contents
This document contains the summary of reviews for the paper I
submitted to the BMC Bioinformatics (BioNLP special issue). Please find
the reviews shared on Google doc. I divide the reviews into five
groups: 1) English rephrasing, 2) details enrichment, 3) arguable
comments, 4) complement, 5) format. English rephrasing section
contains the comments that require the author to rephrase English
descriptions. Details enrichment section contains the requirements of
adding details of experiments, references, etc. Arguable comments are
different opinions of reviewers. Complement section includes
requirements of completing our application and generalizing it to all
the tasks. Format comments section focuses on the conflicts between
our current paper and the BMC journal format.
I add my explanation in bold at the
beginning of each section, and my comments in /green
italic/ style after the list items. The issues
that are crossed by lines are solved. For the issues followed by an
underlined sentence, I know how to solve them, but have not taken
action.*
1 English Rephrasing Comments
The English rephrasing comments are the easiest to deal with. I list all of this kind of comments in this section.
- In section of Abstract: "
Admittedly" is required to be changed - In section of Background: "
An example of event" to "of an event" - In subsection of Data: "
abstracts of full articles" is supposed to be replaced by "abstracts or full articles" - In subsection of Post-Processing: "
extra-arguments" is supposed to be split into two words - In subsection of Related Work: "
are ran" should be "run"; "consequential" is required to be changed - In subsection Of Performance Analysis: "
rightfully" is required to be changed
2 Details Enrichment Comments
Most of detail enrichment comments need me to add more details about the definitions, methods and hyper-parameters, which are not difficult.
- Abstract
- "
the best results" is not true. The reviewer think it is fair to add FAUST system in comparison according to the task requirements.
- "
- Background
- "
huge amounts of electronic bio-medical documents" is unnecessary imprecise.
I do not understand what the reviewer want. A precise number of documents? - "
[bio-medical documents] involve domain-specific […] dependencies" is an unclear claim.The same as the comment above. - "
15 times faster [than Riedel et al.] " should specify the condition (e.g. omission of feature caching) - Data
- "
extracts of articles from PubMed Central": some are only in PubMed, not PMC.
It is my mistake. I thought they are the same thing by inferring from their names. In fact, PubMed is the super-set of PMC, where articles on PMC can be freely download but only abstracts are accessible for other articles on PubMed. - About the discussion of consistency and effect on performance,
reviewer suggest to consider contrasting analysis with that
presented in "Overview of the ID, EPI and REL tasks of BioNLP
Shared Task 2011"
I read the article suggested, and found a statement "each entity annotation consists of a type". Note that this article is published for BioNLP 2011. It does not explain the inconsistency between BioNLP2009 and 2011.
- "
- "
- Methods
- Algorithm1G
The reviewer suggest to replace "score and label" and"highest score" with more precise descriptions of the steps.
- Fitting the Pairwise Model
- "
two hyper-parameters that are set by cross-validation". More information on selection (e.g. considered values) are required. - The method that the author used to deal with the heavy class
imbalance should be added.
We already had cited the paper about the method we used, which is C+/C-.
- "
- Computational Considerations
- "SVM-based classification scales well" is unnecessarily imprecise. Efficient linear SVM implementations are invariant to the number of examples at prediction time (the weight vector is explicitely represented)
I do not understand
- "SVM-based classification scales well" is unnecessarily imprecise. Efficient linear SVM implementations are invariant to the number of examples at prediction time (the weight vector is explicitely represented)
- Pre-Processing & Features
Add examples of the two tokenizations
"inhibit NF-kappaB-dependent pro-inflammatory gene transcription" vs "inhibit NF-kappaB dependent pro-inflammatory gene transcription"Add examples of each feature for clarity.
Should we add examples in the feature table or in paragraph? The table is not large enough to contain examples whereas enumerating examples for each feature in paragraph would be very ugly to see.Add details of generating IntAct features.
- Identify the "heuristics from the UCLEED system"
Does this requirement means explaining how to trim the dependency path?
Reply in reponse
- Related Work
Mention rule-based (e.g. NCBI) and pattern-based (e.g. NICTA) methods.
- Algorithm1G
- Results
Add the comparison with the system by Miwa et al (2010), as itreported one of the best performances, particularly with Bindingevents.
This system was run on the BioNLP2009 data set, we do not have results on this data set.Add FAUST system into comparison
- Conclusion
- Provide more detail regarding future work
- "
the best result" is not true (FAUST is the best).
3 Arguable Comments
The arguable comments contain something that reviewers do not agree with us. Some of this kind of comments need the supports from new experiments, which could cost certain amount of time. Considering the deadline of the camera version and my defense date, I have to carefully organize the schedule for them. Besides, I doubt about some of these comments.
Parts of the terminology are confusing. The word "entity" can beinterpreted as a term in biology and a term in textprocessing. Besides, the trigger is not exactly a "named entity".
I think it is better to replace named entity by text entity andadd definition that a text entity is a character chain.- Methods
- Algorithm
The definition of P_S=(c_i,a_j) is inconsistent because thearguments are initialized by proteins but involve predictedtriggers during the process.
- Pre-Processing
About the claim "high quality dependency parse trees require afine grained tokenization", reviewer require us to provideevidence to support this claim.
It is hard to give a direct evidence of the quality of dependency parse.We can only say that dependency parse with find grained tokenization provides better features for this task.But the quality of dependency parse itself is not defined.To demonstrate the better effect of dependency parse with fine grained tokenization, we have to run experiments with dependency parsing based on coarse tokenization.- "
trees are finally obtained using ….". Reviewer thought suchparses are provided in the supporting data?
The support data are generated using different tokenization, hence are differ to what we generated.
emphasize we computed them ourselves.
- Related work
- "
rich features coding for (trigger, argument) pairs[16]
areonly used by pipeline models for assigning arguments": this isfalse at least for the TEES and EVEX systems.
/This is differ to what I read. TEES has two steps: triggerdetection and edge detection. The features of pairs are not used in trigger detection./
rephrase and explanation for reviewer.
- "
- Algorithm
- Results
- Genia Shared Task 2013
- "it is difficult to estimate standard errors properly": consider
referencing statistical significance results in
[1]
.
Reply in the response email clarify if EVEX and TEES also use the data-set from BioNLP2011
They did not talk about this in their papers published in the workshop. I have sent a mail to ask which data-set did they use. However, even if they reply me and say they had used BioLNP2011 data-set, their mail can not act as reference.
- "it is difficult to estimate standard errors properly": consider
referencing statistical significance results in
- Genia Shared Task 2011
- "
TEES (no detailed results available)": available on the task web page and in "The Genia Event and Protein Co-reference tasks of the BioNLP Shared Task 2011"
- "
- precision-recall curves with different features
- Figure 5 in the result is easy to misinterpret. From this curve it seems that V-walk features
- Genia Shared Task 2013
are better left out. The authors mention that this is due to the limited size of the test set, but I
would like to see a more careful analysis on how this phenomenon scales with the size of the test
set. Related to this, the authors can do a more elaborate feature selection step: this is easy to
do with SVM, and improved results using feature selection for event extraction have been
described before.
I would like to discuss about it.
Do not worth experiments. But I have to be aware of feature
selection methods (Lasso).
- training duration
more theoretical characterization of the computationalcomplexity (Reviewer 1)
It should be in the section of Methods. I do not understand the reviewer's opinion.
Explain our limitation with a soft description that claims the uncertaintyScala is slower than C. The training duration may be caused bythe programming language. (reviewer 1)efficiency order of pipeline > RUPEE > globally joint system isrough and lack of support. There is no training duration testfor pipeline system and only one case for jointsystem. (reviewer 3)
20 mins for pipeline couterpart
4 Complement Comments
Requirements in the complement comments need much time. The comments of reviewer 3 are for future job in my opinion.
- Make software usable outside the BioNLP community. Web tool or API
that provide biologists to use this method on text data. (reviewer 1)
Display like http://corpora.informatik.hu-berlin.de
It will cost a large amount of time to add the protein recognition ourselves. - Make general framework for all the BioNLP tasks. Will break the constraints in Genia task. (reviewer 3) It is not a requirement for this paper, but for the further work
5 Format Comments
Comments about format are essential to publish our paper.
Figures (e.g. figure 5) are hard to interpret with printed out in gray-scaleBMC Bioinformatics does not allow footnotes and requires that URLsbe included in references.Link to BMC format.
6 Reports
6.1 Coarse Tokenization
I ran experiments to determine the difference impacts of coarse and fine-grained tokenization to the final performance. Since we argue the effects of dependency parses with different tokenizations, it is necessary to find the optimal thresholds of the shortest dependency paths as I did before. Hence, the first part of these experiments is selecting the thresholds on BioNLP2013 development set and the second part is testing the performance on both BioNLP2011 and 2013 test test using the optimal thresholds.
6.1.1 Selection Dependency path
I used the training sets from BioNLP2011 and 2013 to train the classifiers and evaluated them on the BionLP2013 development set. Unlike the previous experiments made to select the optimal thresholds, I selected the thresholds on the RUPEE model instead of pairwise model. Hence, the intermediate results are not comparable to those written on my thesis.
- Trigger-Theme Pair Detection Step
I list below the f-scores, precisions and recalls of single argument events and total trigger-theme pairs predictions. I costed much time to solve the conflicts between the data in last Thursday and Friday, so I ran all the experiments using a shell script to save time. However, I wrote a wrong command in the shell script, so I lost some results about the trigger-theme pairs. From the experimental results, I selected 2,3 and None for the experiments of next step (Binding argument fusion).
Dep length <2 <3 <4 <5 <6 None SVT 75.09 75.10 73.82 74.27 73.08 74.70 prec. 85.31 85.05 78.64 78.74 75.19 80.54 recall 67.05 67.23 69.85 70.28 71.08 69.65 THEME-TOTAL 63.45 63.11 63.22 prec. 77.05 72.25 65.00 recall 53.94 56.03 61.54 - Binding Argument Fusion Step
I list below the f-scores of the Binding events prediction using different thresholds. Each line refers to the results of experiments using different thresholds during the Binding argument fusion step. "Prev" in the first column indicates the threshold used in previous step. It seems that thresholding the dependency paths does not improve the performance. Hence, I only keep one configuration for next step, where the threshold for the first step is 3 and these is no threshold for Binding argument fusion step.
Dep length 3 4 5 6 None Prev = 2 39.94 41.30 44.18 44.18 44.11 Prev = 3 39.89 42.84 45.93 45.60 46.00 Prev = None 39.69 42.42 45.54 45.30 45.65 - Regulation Cause Argument Assignment Step
In the last step, we have two thresholds to select, one is for the path between the trigger and cause argument, another is for the path between two arguments. The table below lists all the experimental results, where each line lists the total event f-scores of experiments using different trigger-cause thresholds and each column lists the results of experiments using different theme-cause thresholds. The differences between these results are not very significant and it is hard to find a general law from them. Therefore, I simply used the configurations with respect to the best results (\(2\le l \le None\), \(l\le 3\)).
Dep length l<2 l<3 l<4 None \(1\le l\le 5\) 50.76 50.72 51.00 51.22 \(2\le l\le 5\) 50.86 50.93 51.02 51.14 \(3\le l\le 5\) 50.35 50.01 50.09 50.31 None 50.86 50.79 50.90 49.85 \(1\le l\le 4\) 50.51 50.73 50.80 50.87 \(2\le l\le 4\) 50.86 50.89 50.89 51.15 \(3\le l\le 4\) 50.02 50.18 50.11 50.06 \(1\le l\le None\) 50.86 50.79 50.90 51.17 \(2\le l\le None\) 51.04 51.46 51.39 51.24 \(3\le l\le None\) 50.35 50.43 50.26 50.42
6.1.2 Test Experiments
After selecting the optimal thresholds on the BioNLP2013 development set, I train and evaluate the model with these thresholds for online test evaluations. The result of experiment using coarse tokenization on BioNLP2011 test set is surprisingly high, which is only \(0.25\%\) lower than fine-grained tokenization. But on BioNLP2013 test set, fine-grained tokenization outperforms the coarse tokenization significantly.
Coarse | Fine-Grained | |
---|---|---|
BioNLP2011 | 55.35 | 55.6 |
BioNLP2013 | 52.16 | 54.4 |