New indices for text: PAT Trees and PAT arrays
Information retrieval
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Commercial Web Sources
Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources
Data & Knowledge Engineering
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Finding and Extracting Data Records from Web Pages
Journal of Signal Processing Systems
Hi-index | 0.00 |
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.