Sources for data files: Unknown (files taken from various sources, we need to replace these before shipping) british-english.txt (the standard British dictionary which Ubuntu systems come with) Mendeley: hint-email.txt hint-keywords.txt hint-institution.txt months.txt person-titles-after.txt person-titles.txt van-names.txt CiteSeerX (0.12): (license for data files not stated explicitly in the files, but the project itself is Apache License v2.0. The papers which describe the header parsing service designed for CiteSeer provide some background on the data sources. See Hui, Giles et al. 2003 'Automatic Document Metadata Extraction using Support Vector Machines') first-names.txt surnames.txt country-names.txt chinese-surnames.txt