Today I presented a paper (co-authored with Nizar Habash) on statistical parsing at the IWPT 2011 conference. A three day event at Dublin City University, this was a great opportunity to meet some leading names in the fields of Computational Linguistics and Natural Language Parsing, and to discuss ideas for further research work and future collaboration.
Due to family commitments, I was only able to attend the first day of the conference to present my talk, and so I missed out on the full three day event. However, it was still a great experience.
The paper I presented, One-Step Statistical Parsing of Hybrid Dependency-Constituency Syntactic Representations, was received well, judging by the feedback and response I got after the talk. I managed to get across the key points of the research in the presentation: The linguistic context for why the Quranic Treebank uses a hybrid syntactic representation, the rich morphological features annotated in the treebank, and the challenges this gives rise to for statistical parsing. I mentioned that although there could be many ways to solve the hybrid parsing problem, we focused on transition-based shift-reduce parsing, as opposed to a graph-based parsing algorithm. In other words, more like MaltParser as opposed to MSTParser.
At the end of the talk, I had time to answer a few questions.
In the first question, Joakim Nivre wanted some further clarification on exactly what the input to the parser was. Although the presentation described the input as gold-standard morphologically tagged text with segmentation, I did not make clear during the talk if empty categories were assumed in the input, or if these were generated by the parser. This was a fair point by Joakim. The paper does cover this in more detail – the parser handles elision directly and this is not assumed in the input. We take only the original source text, segmented and annotated with morphological features.
The section question by Feiyu Xu related to how the parser produced phrase structure. In particular, how it was possible to produce complete subgraphs under a phrase or clause. The assumption, I explained, was that at a certain point in its operation, the parser would learn to recognize the head of a sub-graph that should be raised to a phrase from the top of the stack. Of course, not all phrases could be formed this way, but given the strong accuracy of the parser for hybrid phrase structure reported in the paper, this would appear to be a reasonable assumption.
In the last question, Mark Steedman wanted to know more about the traditional Arabic grammar used as the linguistic framework for annotating the Quranic Arabic Treebank. In particular, the question was under what conditions the grammar would treat a chunk as a phrase and give it a phrase label as opposed to using only dependency structure. My answer was that as far as I could tell from having been through numerous examples from the grammatical gold-standard reference texts, was that phrases appear to be made explicit in the grammar when a chunk can stand alone, independent of the rest of the sentence, such as an embedded sentence or subordinate clause.
Opportunities for Future Work and Collaboration
I met a lot of interesting and smart people at the conference, too many to list all here by name. Overall, I received two pieces of common feedback when discussing my parsing research. The first, was that the hybrid representation was interesting and appealing as a research idea given that not much work has been done in this area, and that there is definitely merit in combining the best of both representations into a single treebank. Secondly, a lot of the feedback I received cantered around the next logical step in the research, which would be to integrate morphological analysis into the parser. This would allow the parser to run against raw text instead of using gold-standard morphological input. Different people had different ideas about how this could be done, but nearly everyone agreed it was an important next step.
I also learnt that although some recent initial work has been done on integrating POS-tagging and transition dependency parsing for Chinese, there does not appear to be any work on joint morphological analysis for transition dependency parsing in any language. Kenji Sagae confirmed my own hunch that for a full integrated transition approach some form of non-deterministic parsing would be necessary, in order to explore the joint disambiguation search space. He pointed me to his 2010 ACL paper on introducing dynamic programming into shift-reduce parsing. He suggested that I might want to get in touch with Takuya Matsuzaki (also at IWPT 2011) whose 2011 IJCNLP paper uses the same algorithm as Kenji’s to perform joint POS-tagging and syntactic dependency parsing for Chinese. Interestingly, Nizar had pointed out a related 2011 EMNLP paper to me back in July, also on joint tagging for Chinese, but with a focus on graph algorithms instead of transition parsing – another good paper.
I later met with Khalil Sima’an who it turns out can speak Arabic as well as Hebrew. Interestingly, Khalil was Reut Tsarfaty’s co-supervisor during her PhD thesis on joint morphological and syntactic analysis for Hebrew. Khalil also knows Eric Atwell, my PhD supervisor at the University of Leeds. He advised that research into joint morphological and syntactic analysis for Arabic was something definitely needed.
Finally, I ended the day with a follow-up discussion with Joakim Nivre after the main conference talks had ended. Joakim was open to the idea of collaborating on future research, especially if this involved doing further work on transition-based parsing. Some ideas could include revisiting his work on hybrid parsing for Swedish and German. He seemed impressed with my presentation and the paper, and especially liked the strong empirical results – achieving around 90% accuracy (near state-of-the-art) for dependency parsing. Confirming the other feedback I had received today, he thought that joint morphological and syntactic analysis would be the way to go for further research into parsing Classical Arabic. He also liked the way in which the basic MaltParser algorithm had been extended using additional parser actions to handle hybrid parsing – apparently something he had wanted to do himself for some time.
We also talked briefly about different possible ways to add non-determinism to the parser, as a step towards joint morphological disambiguation. Dynamic programming could be one way, but Joakim suggested that even experimenting with vanilla beam search would be a good first step.
All-in-all a great day. I met some very intelligent people, experts in their respective fields, and also got to listen to some very interesting and relevant talks. A shame I could only stay for the first day instead of the full three day conference, but I do have a lot else going on right now with work and family. I would definitely like to pursue some of the ideas discussed today for further collaboration. Hybrid parsing was of interest to the conference delegates given that it is a bit different and not often studied. I also heard from nearly everyone I spoke to that joint morphological disambiguation and syntactic parsing would be a very interesting next step. From what I could tell, the start-of-the-art in this particular research area for transition-based parsing was to include only POS tagging as a joint task. Joakim suggested that including morphological analysis directly into a transition-based parser would be new research, but something that other researchers might soon also be looking at as well.