Segmenting eBay item descriptions into coherent sections

  • Authors:
  • Smruthi Mukund;Nitin Indurkhya;Neel Sundaresan

  • Affiliations:
  • CEDAR, SUNY, Buffalo, Amherst, NY;eBay Research Labs, San Jose, CA;eBay Research Labs, San Jose, CA

  • Venue:
  • Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Item descriptions on an online e-Commerce site such as eBay consist of item-specific information along with generic information such as shipping and return policies, requests for feedback, and contact information. Extracting these textual segments from the item descriptions is non-trivial as they contain html markups, advertisements, templates, and navigational elements. Since sellers have considerable editorial freedom in how to describe their items, many of the descriptions lack homogeneity and compactness. Very often, the relevant information has to be extracted from incomplete, ill-formed discourse units adding to the challenge of finding coherent segments. In this paper we describe an approach that identifies item-specific text segments from eBay descriptions. This approach uses a bootstrapping technique to learn high-quality semantic lexicons for item-agnostic text segments. We first extract useful text by removing html markups using a boiler-plate removal technique that preserves markup information and captures visual segmentation. Each segment is further processed to extract discourse units that play the same role as sentences. This is followed by a clustering technique that identifies thematic breaks to extract coherent segments. We evaluate our approach on a diverse set of descriptions and show that our approach outperforms a commonly-used approach that relies only on the title keywords.