Tokenization
The process of nominal data linkage depends on the matching of names.
The spelling of names can vary widely, especially in historical periods
in which spelling was not standardized. Because computer keyboards are
not uniformly standardized across different languages, computer data
entry may even exacerbate this problem. Various schemes have been used
to standardize the spelling of names (e.g., Soundex). Our own scheme
depended on our knowledge of the languages or orthographies used locally
in Slavonia (Croatian, Latin, German, Hungarian). Names were also
truncated to a maximum of 10 characters for last names, 5 characters
for first (baptismal) or place names. The concordance lists showing our
system of tokenization of names can be seen by clicking on the links
below. The original, untokenized form is given first, followed by the
token. The list for last names also includes a frequency count in the
first column.
Last
Names
First Names
Place Names
Back to the Main Page