Reasoning about naming systems
ACM Transactions on Programming Languages and Systems (TOPLAS)
Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
Domain-independent text segmentation using anisotropic diffusion and dynamic programming
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Topic segmentation: algorithms and applications
Topic segmentation: algorithms and applications
Applying Machine Learning to Text Segmentation for Information Retrieval
Information Retrieval
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
TextTiling: segmenting text into multi-paragraph subtopic passages
Computational Linguistics
Advances in domain independent linear text segmentation
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Text segmentation based on similarity between words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A bootstrapping method for learning semantic lexicons using extraction pattern contexts
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Proceedings of the 15th international conference on World Wide Web
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Hi-index | 0.00 |
Item descriptions on an online e-Commerce site such as eBay consist of item-specific information along with generic information such as shipping and return policies, requests for feedback, and contact information. Extracting these textual segments from the item descriptions is non-trivial as they contain html markups, advertisements, templates, and navigational elements. Since sellers have considerable editorial freedom in how to describe their items, many of the descriptions lack homogeneity and compactness. Very often, the relevant information has to be extracted from incomplete, ill-formed discourse units adding to the challenge of finding coherent segments. In this paper we describe an approach that identifies item-specific text segments from eBay descriptions. This approach uses a bootstrapping technique to learn high-quality semantic lexicons for item-agnostic text segments. We first extract useful text by removing html markups using a boiler-plate removal technique that preserves markup information and captures visual segmentation. Each segment is further processed to extract discourse units that play the same role as sentences. This is followed by a clustering technique that identifies thematic breaks to extract coherent segments. We evaluate our approach on a diverse set of descriptions and show that our approach outperforms a commonly-used approach that relies only on the title keywords.