tokenize.htm

Tokenization

The process of nominal data linkage depends on the matching of names. The spelling of names can vary widely, especially in historical periods in which spelling was not standardized. Because computer keyboards are not uniformly standardized across different languages, computer data entry may even exacerbate this problem. Various schemes have been used to standardize the spelling of names (e.g., Soundex). Our own scheme depended on our knowledge of the languages or orthographies used locally in Slavonia (Croatian, Latin, German, Hungarian). Names were also truncated to a maximum of 10 characters for last names, 5 characters for first (baptismal) or place names. The concordance lists showing our system of tokenization of names can be seen by clicking on the links below. The original, untokenized form is given first, followed by the token. The list for last names also includes a frequency count in the first column.

Last Names

First Names

Place Names

Back to the Main Page