Towards the automation of address identification

Authors:
Fernanda Morillo;Javier Aparicio;Borja González-Albo;Luz Moreno
Affiliations:
Instituto de Estudios Documentales sobre Ciencia y Tecnología (IEDCYT), Centro de Ciencias Humanas y Sociales (CCHS), Spanish National Research Council (CSIC), Madrid, Spain 28037;Instituto de Estudios Documentales sobre Ciencia y Tecnología (IEDCYT), Centro de Ciencias Humanas y Sociales (CCHS), Spanish National Research Council (CSIC), Madrid, Spain 28037;Instituto de Estudios Documentales sobre Ciencia y Tecnología (IEDCYT), Centro de Ciencias Humanas y Sociales (CCHS), Spanish National Research Council (CSIC), Madrid, Spain 28037;Instituto de Estudios Documentales sobre Ciencia y Tecnología (IEDCYT), Centro de Ciencias Humanas y Sociales (CCHS), Spanish National Research Council (CSIC), Madrid, Spain 28037
Venue:
Scientometrics
Year:
2013

Citing 6
Cited 0

Very fast and simple approximate string matching

Information Processing Letters
Matchsimile: a flexible approximate matching tool for searching proper names

Journal of the American Society for Information Science and Technology
Address standardization with latent semantic association

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Names: a new frontier in text mining

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Visualizing polysemy using LSA and the predication algorithm

Journal of the American Society for Information Science and Technology
Affiliation disambiguation for constructing semantic digital libraries

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new semi-automatic method is presented to standardize or codify addresses, in order to produce bibliometric indicators from bibliographic databases. The hypothesis is that this new method is very trustworthy to normalize authors' addresses, easy and quick to obtain. As a way to test the method, a set of already hand-coded data is chosen to verify its reliability: 136,821 Spanish documents (2006---2008) downloaded previously from the Web of Science database. Unique addresses from this set were selected to produce a list of keywords representing various institutional sectors. Once the list of terms is obtained, addresses are standardized with this information and the result is compared to the previous hand-coded data. Some tests are done to analyze possible association between both systems (automatic and hand-coding), calculating measures of recall and precision, and some statistical directional and symmetric measures. The outcome shows a good relation between both methods. Although these results are quite general, this overview of institutional sectors is a good way to develop a second approach for the selection of particular centers. This system has some new features because it provides a method based on the previous non-existence of master lists or tables and it has a certain impact on the automation of tasks. The validity of the hypothesis has been proved taking into account not only the statistical measures, but also considering that the obtaining of general and detailed scientific output is less time-consuming and will be even less due to the feedback of these master tables reused for the same kind of data. The same method could be used with any country and/or database creating a new master list taking into account their specific characteristics.