Unsupervised user-generated content extraction by dependency relationships

Authors:
Jingwei Zhang;Yuming Lin;Xueqing Gong;Weining Qian;Aoying Zhou
Affiliations:
Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai, China;Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai, China;Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai, China;Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai, China;Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai, China
Venue:
WISE'11 Proceedings of the 12th international conference on Web information system engineering
Year:
2011

Citing 13
Cited 0

Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Web wrapper induction: a brief survey

AI Communications
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Automatic extraction rules generation based on XPath pattern learning

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

User-generated contents are very valuable for event detection, opinion mining and so on, but the extraction of those data is difficult because users are given strong power to present their contents in Web 2.0 pages. Compared to machine-generated contents, user-generated contents are very personalized, which often take on complex styles, combine various information and embed much noise. Users' deep participation makes data acquisition environment a great change and breaks the hidden assumption of traditional extraction methods, which is that Web pages should be relatively regular. The traditional extraction methods can not adapt complex user-generated contents well. In this paper, we consider user-generated contents as unstable contents and advise an unsupervised approach to extract high-quality user-generated contents without noise. Those stable information in machine-generated contents, which are often omitted by traditional extraction methods, are firstly picked up by a two-stage filtering operation, page-level filtering and template-level filtering. Path accompanying distance is then defined to compute the dependency relationships between unstable information and stable information, which guide us to locate user-generated contents. Our approach gives a full consideration on structures, contents and the dependency information between stable and unstable contents to assure the extraction accuracy of user data. The whole process does not need any artificial participation. The experimental results show its good performance and robustness.