Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Developing feeds with rss and atom
Developing feeds with rss and atom
Constructing a large scale text corpus based on the grid and trustworthiness
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Hi-index | 0.00 |
This paper presents a new approach and a software for collecting specialized corpora on the Web. This approach takes advantage of a very popular XML-based norm used on the Web for sharing content among websites: RSS (Really Simple Syndication). After a brief introduction to RSS, we explain the interest of this type of data sources in the framework of corpus development. Finally, we present Corporator, an Open Source software which was designed for collecting corpus from RSS feeds.