Towards a global schema for web entities

Authors:
Conglei Yao;Yongjian Yu;Sicong Shou;Xiaoming Li
Affiliations:
Peking University, Beijing, China;Peking University, Beijing, China;Peking University, Beijing, China;Peking University, Beijing, China
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 21
Cited 0

Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Database intergration using neural networks: implementation and experiences

Knowledge and Information Systems
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Extracting Logical Schema from the Web

Applied Intelligence
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Discovering complex matchings across web query interfaces: a correlation mining approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A holistic paradigm for large scale schema matching

ACM SIGMOD Record
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Extracting Objects from the Web

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

Popular entities often have thousands of instances on the Web. In this paper, we focus on the case where they are presented in table-like format, namely appearing with their attribute names. It is observed that, on one hand, for the same entity, different web pages often incorporate different attributes; on the other, for the same attribute, different web pages often use different attribute names (labels). Therefore, it is imaginably difficult to produce a global attribute schema for all the web entities of a given entity type based on their web instances, although the global attribute schema is usually highly desired in web entity instances integration and web object extraction. To this end, we propose a novel framework of automatically learning a global attribute schema for all web entities of one specific entity type. Under this framework, an iterative instances extraction procedure is first employed to extract sufficient web entity instances to discover enough attribute labels. Next, based on the labels, entity instances, and related web pages, a maximum entropy-based schema discovery approach is adopted to learn the global attribute schema for the target entity type. Experimental results on the Chinese Web achieve weighted average Fscores of 0.7122 and 0.7123 on two global attribute schemas for person-type and movie-type web entities, respectively. These results show that our framework is general, efficient and effective.