CELEX - Lexical databases
The Dutch database contains information on 381,292 present-day Dutch word forms, corresponding to 124,136 lemmata.The final release of the English database contains 52,446 lemmata representing 160,594 word forms. The German database holds 51,728 lemmata with 365,530 corresponding word forms.
Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora.
Furthermore, information has been collected on syntactic and semantic subcategorizations for Dutch. Also, the Dutch orthography has been converted to comply with the 1994 spelling reform. These extensions have been made available as part of the Spoken Dutch Corpus (see the Dutch Language And Speech Technology Repository) and the web-based CELEX version.