Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems
Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems
Two languages are more informative than one
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
An endogeneous corpus-based method for structural noun phrase disambiguation
EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
Extracting and evaluating general world knowledge from the Brown corpus
HLT-NAACL-TEXTMEANING '03 Proceedings of the HLT-NAACL 2003 workshop on Text meaning - Volume 9
Can we derive general world knowledge from texts?
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Open knowledge extraction through compositional language processing
STEP '08 Proceedings of the 2008 Conference on Semantics in Text Processing
Hi-index | 0.00 |
Collocation-based tagging and bracketing programs have attained promising results. Yet, they have not arrived at the stage where they could be used as pre-processors for full-fledged parsing. Accuracy, is still not high enough.To improve accuracy, it is necessary to investigate the points where statistical data is being misinterpreted, leading to incorrect results.In this paper we investigate inaccuracy which is injected when a pre-pocessor relies solely on collocations and blurs the distinction between two separate relations: thematic relations and sentential relations.Thematic relations are word paris, not necessarily adjacent, (e.g., adjourn a meeting) that encode information at the concept level. Sentential relations, on the other hand, concern adjacent word pairs that form a noun group. E.g., preferred stock is a noun group that must be identified as such at the syntactic level.Blurring the difference between these two phenomena contributes to errors in tagging of pairs such as expressed concerns, a verb-noun construct, as opposed to preferred stocks, an adjective-noun construct. Although both relations are manifested in the corpus as high mutual-information collocations, they possess different properties and they need to be separated.In our method, we distinguish between these two cases by asking additional questions of the corpus. By definition, thematic relations take on further variations in the corpus. Expressed concerns (a thematic relation) takes concerns expressed, expressing concerns, express his concerns etc. On the other hand, preferred stock (a sentential relation) does not take any such syntactic variations.We show how this method impacts preprocessing and parsing, and we provide empirical results based on the analysis of an 80-million word corpus.