A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Entity Matching in Heterogeneous Databases: A Distance Based Decision Model
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Identification of time-varying objects on the web
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Entity matching: how similar is similar
Proceedings of the VLDB Endowment
A constraint-based tool for data integrity management on the web
Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Data extraction from web pages based on structural-semantic entropy
Proceedings of the 21st international conference companion on World Wide Web
Multiple instance learning for group record linkage
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hi-index | 0.00 |
Departments of organizations such as companies and universities tend to publish various information on their own Web sites. For example, descriptions of the members of a certain laboratory at a university may appear on the laboratory's Web site, the department's Web site, and so on. However, inconsistencies may occur between descriptions on these sites if their update timings and management policies are different. It is not easy to find such inconsistencies on large-scale Web sites, and the maintenance costs of doing so are huge. Record linkage techniques, which determine if two entities represented as relational records are approximately the same, have been developed as ways of identifying whether two entities are approximately the same. The current methods focus on simple objects, that are represented by individual records. But objects often consist of numerous simple objects; namely, they are often compound objects. For example, a research team object may contain several researcher objects. In this case, the research team object is a compound object, and the individual researcher objects are simple objects. The current record-level linkage methods can't detect such compound objects correctly when a record of one compound object doesn't match the record of the other. We propose novel methods of linking compound objects for supporting maintenance of large-scale Web sites. We first extract the relational records of Web objects by exploiting the structure of the Web pages they are on and the linguistic features of their descriptions. To find linkable compound objects that are constituted of simple objects, after the record-level linkage, we look at the compound objects' features, i.e., records continuity, common attribute values, and co-occurrences. Experimental results show that our method can detect compound objects that can't be detected by making only record-level linkages.