Manual

WordGen’s manual is available below or as an off-line PDF version (click here). For the theoretical framework of the parameters described below, we refer to:

Duyck, W., Desmet, T., Verbeke, L., & Brysbaert, M. (2004). WordGen: A Tool for Word Selection and Non-Word Generation in Dutch, German, English, and French. Behavior Research Methods, Instruments & Computers, 36(3), 488-499. (full text available here)

 

‘Options’

+ Language selection:

-> Select one of four available languages. For Dutch, English and German, the respective Celex lemma frequency databases are used. For French, WordGen uses the lexique.org lemma database.

 + Write output to file

-> WordGen allows for the output to be saved to a data file (the default file is called WWG_LOG.prn and is saved in the program’s directory/folder). If this option is not chosen, the output only appears in the right frame of the program and is lost when the program is shut down.

+ Detailed Output

-> When selected, WordGen provides more detailed information in the output window. This Xtra information consists of (a) the neighbors that were found in the database and (b) the separate type frequencies of each of the word’s bigrams.

+ Search type for Celex or Lexique

-> Choose ‘Linear’ (alphabetic) or ‘Random’.

-> Linear: if WordGen is asked to generate a word, it will give you the first word it encounters (in the alpabetically sorted lemma database) that satisfies all constraints. Hence, setting this option will always result in the same word if WordGen is probed for a word satisfying the same constraints.

-> Random: WordGen enters the lexical database at a random position, and returns the first word it encounters, satisfying all constraints. Hence, setting this option, will mostly result in a different output.

-> Note: if WordGen is asked to generate a list of word stimuli (see below), that list will contain only one different word, if ‘linear’ is selected here!

+ Limit search time option

-> Non-word searching time can be limited to any number of seconds (and is set to 30 seconds as a default). This option is provided because asking the program for a non-word with a constraint combination which is too narrowly defined or even impossible can lead to an infinite (or very long-lasting) search. For instance, asking the program for a Dutch ten-letter non-word with fourteen neighbors and a very low summated bigram frequency is unlikely to be successfully completed within a reasonable time (if it is possible to find such a non-word at all). Thus, the program will keep on searching if the search time is not set to a specific limit. Users that are looking for non-words that have to meet certain strict – but not impossible – constraints should set the time limit very high or should deactivate it at all. In practice, the to-be-generated non-words will be matched to existing words, mostly leading to reasonable combinations of constraints.

-> Note that this time limit does not apply to word generation because it does not take much time for the program to perform an exhaustive check of every word in the databases against the constraints that were set.

+ Abort list generation time limit

-> ‘Break’: if WordGen is asked to generate a list of stimuli (see below), the program will stop generating stimuli if one of the stimuli is not found within the specified time limit.

-> ‘Continue’: if this happens, WordGen proceeds to the next stimulus.

 

‘Generation’

+ Generate

-> Specify whether you want to generate a non-word or select a word from the lexical database

-> WARNING: When ‘word’ is selected, WordGen will sometimes generate words that are in fact not real words (especially in Dutch; English example: ‘sango’). These stimuli actually have an entry in the CELEX lexical database! Thus, they appeared somehow in a written text which was compiled for CELEX. These (sometimes typing) errors that are inherent in CELEX can not be corrected by WordGen.

-> WARNING: When ‘non-word’ is selected, WordGen will sometimes generate an existing word. This is because WordGen only uses lemma databases, which do not contain inflected word forms.

+ Constrain number of neighbours

-> Specify the number of orthographic neighbors that the word/non-word you are looking for should have.

-> For words, WordGen counts and reports all lemma entries in the lexical database which share all but one letter with the generated word.

-> For non-words, increasing the number of neighbours will result in more ‘wordlike’ non-words.

-> When setting this constraint, it is important to realise that this variable is highly correlated with (non)word length. Click here to see the distribution of neighborhood size as a function of word length for words in the four lexical corpora. This information might be useful to select reasonable neighbour constraints. For example, these figures illustrate that it does not make sense to probe WordGen for a Dutch 8-letter word with 11 neighbors.

+ Constrain word frequency (only available in word generation)

-> Specify minimum and maximum frequency that the word you are looking for should have.

-> In order to increase comparibility across languages/studies, WordGen only includes frequency per million words.

-> WordGen only includes lemma frequencies. For example, the frequency of the word ‘bank’ includes the frequency of both associated English word forms, i.e. the furniture and the financial institution.

-> In accordance with the word recognition literature, we prefer to use the logarithm (base 10) of the frequency/million as a measure of word frequency. This rescaling corrects for the fact that the difference between two words occuring respectively 1 and 3 times per million is not the same as the difference between two words occuring respectively 101 and 103 times per million.

-> IMPORTANT: In addition to the theoretical argument discussed above, we also STRONGLY advise to use log freq/million because of computational reasons. WordGen’s source code primarily uses log freq/million numbers. Raw frequency is computed as the inverted (base 10) logarithm freq/million; therefore, raw frequencies may contain (very) small errors due to rounding off. They are made included as approximate equivalents of log values which might be more difficult to interpret.

+ Constrain summated type bigram frequency

-> Specify minimum and maximum summated bigram frequency that the (non-)word you are looking for should have.

-> For nonwords, increasing the bigram frequency boundaries will generally result in more ‘wordlike’ non-words.

-> WordGen summates the position-nonspecific frequency of each bigram of the generated letter string (word or non-word), based on how many times a bigram occurs in the Celex or Lexique lemma databases, independent of its position in the word. For example, the Dutch word ‘boek’ has a bigram frequency of 19898, which is the sum of the number of occurrences of each of its bigrams: ‘bo’ (4123), ‘oe’ (9120), and ‘ek’ (6655) in the Celex.

-> Because the four languages have a different number of words in the database there is a big difference between the bigram frequencies for these languages. For instance, the Dutch and English databases in the CELEX contain 124.136 and 52.447 entries, respectively. This means that on average Dutch summated bigram frequencies will be more than twice as high than English summated bigram frequencies. Also, because the program works with summated bigram frequencies, on average the bigram frequency for short words will be lower than the bigram frequency for long words. To help the user set the summated bigram frequency constraint we included four figures with the distribution information of bigram frequencies as a function of word length, plotted separately for each language. This is available here.

+ Constrain minimum legal bigram frequency (only available in non-word generation)

-> Specify the bigram frequency that each of the bigrams of the generated word/non-word should have.

-> In general, increasing this number will result in more ‘wordlike’ non-words.

-> This parameter supplements the summated bigram frequency parameter discussed above. It prevents that WordGen generates words/non-words which contain illegal bigrams but still have a high summated bigram frequency, because one of the bigrams is very frequent.

-> Because Celex also contains some typing errors, it is not advised to use ‘1’ for example as the criterion for a ‘legal’ bigram. WordGen’s default value has been shown to be efficient in practice.

+ Constrain minimum position-specific onset/suffix bigram frequency (only available in non-word generation)

-> Specify the minimum position-specific bigram frequency that the first and last bigram of the generated word/non-word should have.

-> WordGen counts how many words in the lexical database contain the first/last bigram of the generated nonword as the first/last bigram of the word (position specific). For example, the Dutch word ‘boek’ has a bigram frequency of 19898, which is the sum of the number of occurrences of each of its bigrams: ‘bo’ (4123), ‘oe’ (9120), and ‘ek’ (6655) in the Celex. The position specific onset bigram frequency of the first bigram, ‘bo’, in Dutch is 1608. Hence, there are 1608 words in Celex that begin with the letters ‘bo’. The suffix bigram frequency of ‘ek’ is 1815. Hence, there are 1815 words in Celex ending in ‘ek’.

-> In general, increasing this number will result in more ‘wordlike’ non-words.

-> This paramater supplements the bigram frequency parameters discussed above. Some bigrams are quite frequent in a certain language, but almost never occur as the first (or last) two letters of a word (e.g. ‘rt’ in English). This option prevents that WordGen generates a non-word which is illegal because it has such an onset (or suffix).

-> Because Celex also contains some typing errors, it is not advised to use ‘1’ for example as the criterion for a ‘legal’ bigram. WordGen’s default value has been shown to be efficient in practice.

+ Use heuristic (only available in non-word generation)

-> By default, WordGen generates random letter strings, and then checks whether the generated letter string meets all the specified constrains (about 1000 strings per second). When using a difficult combination of strict constraints, this process can take some time, especially when searching for long non-words. In that case, it may be advisable to use the heuristic approach.

-> When selected, WordGen will create a nonword by randomly selecting an entry from the lexical corpus, and changing one letter from that lemma. It then checks whether the created non-word meets all other constraints.

+ Use wildcard

-> Specify any letters that the generated word/non-word should have (position specific)

-> Use an asterix as the wildcard. For example, if you are looking for a 6-letter word/non-word starting with a ‘b’, and ending in a ‘k’, enter ‘b****k’ (without the quotation marks).

+ Forbidden letter list

-> Specify any letters that the generated word/non-word should not contain (not position specific)

-> type in all ‘forbidden’ letters next to each other, without spaces. For example, if you do not want any words/non-words containing the letters ‘x’, ‘y’ or ‘z’, simply enter ‘xyz’ (without the quotation marks).

+ Load/save paramaters

-> Sometimes, it can take a while before the user finds an adequate set of parameters for those specific stimuli that the user needs. Also, when generating nonwords, it may be useful to be able to look up the criteria that were used in nonword generation some time after the creation itself. In those cases, it may be useful to save the parameter combinations in a file on the harddisk. This information is stored in a plain text file (.pmf extension), in any location that the user wants. This information can be loaded back into the GUI by the ‘load parameters’ button.

-> An example parameter file can be found here.

+ Generate

-> When clicked, WordGen generates a single word/non-word satisfying the constraints that were set.

+ Generate List frame

-> It is often the case that researchers need several words/non-words satisyfing the same constraints. Also, somebody may want to see several nonwords for example, all satisyfing specified constraints, before then manually selecting one from that list. Or, somebody may want a list of words in a certain frequency category, before manually selecting all nouns from that list. In those cases, WordGen can generate a list of words/non-words satisyfing the same set of parameters.

-> Enter the desired number of words/non-words. Note that WordGen still operates within its ‘search time limit’ constraint (see above). For each word/non-word, WordGen will try to find a letter string within the specified time limit. If this fails for a given word, WordGen will abort if ‘break’ is selected in the ‘options’ pane (see above). In that case, the output file will contain less stimuli than requested. When generating long lists of words/non-words with strict constraint combinations, it is advised to disable the ‘limit search time’ feature, or select ‘continue’ in the ‘options’ pane (see above).

-> always specify a location and file name where WordGen can store its search results. By default, file extension is .prn. IMPORTANT: With each new search, WordGen overwrites this file if another filename is not specified with the new search!

-> An example output file can be seen here. This file contains WordGen’s output when it was probed for 100 five-letter English non-words having 2-10 neighbours and satisyfing the minimum legal bigram and legal onset/suffix constraints. This parameter combination corresponds to this example parameter file (also mentioned above).

-> The output file contains the following fields (separated by spaces, exportable to a spreadsheet program)

– (non)wordstring

– log frequency per million of wordstring (this field is set to 0 for non-words)

– number of neighbours (n)

– neighbourstrings (1…n)

– sumatted type bigram frequency

– number of bigrams (o)

– bigramstrings [1…o]

– type bigram frequencies [1…o]

-> if ‘write detailed output to file’ is not selected, WordGen will only write the first field [(non)wordstring]

-> IMPORTANT: if WordGen is asked to generate a list of word stimuli, that list will contain only one different word, if ‘linear’ is selected on the ‘options’ pane (see above) !

 ‘Checking’

+ Word/Non-word to check

-> Simply enter the word/non-word for which you want to know the log freq/million, bigram frequency measures & neighbour count.

-> For detailed word/non-word information (e.g. what are the neighbours?), see the ‘options’ pane described above.

-> If a letter string is checked as a word, and it does not appear as a seperate entry in the lemma database, WordGen will report this. This allows to check whether a letter string is an existing word in a certain language. WARNING: inflections (e.g. plurals) may be existing words, but will not be recognized that way because they do not constitute separate entries in the lemma databases.

+ Cross-language checking

-> Specify with which language you wish to cross-check.

-> When selected, WordGen will parse the entered letter string through the two databases associated with the specified languages. For example, for the English-Dutch homograph room  (meaning cream  in Dutch), WordGen will report both the English and Dutch language-specific frequency of that word.

-> This feature also allows to search for cross-language neighbours. In the example above, WordGen will report the Dutch neighbors of the English word room.

‘Batch mode’

+ Although WordGen is designed to provide an easy-to-use ‘click and retrieve’ graphical user interface (GUI) for word selection and nonword generation, repetitive queries can be highly automated using the batch mode feature.

+ Commands may be entered in the command line box, or in seperate batch files (recommended).

+ This allows the experienced user to specify the different parameter settings of a large stimulus set before WordGen is probed for results. That way, WordGen can be programmed to search independently and uninterrupted for a large stimulus set, without human intervention.

+ An example batch syntax file can be seen here. With this file, WordGen is first probed for an English (-language 2) non-word (-w 0) having four letters (-n 4), having 2 to 7 neighbours (-neighbor 2 7), having summated bigram frequency from 6000 to 18000 (-bigramF 6000 18000), satisying the legal bigram (-legal 30) and legal onset/suffix bigram (-posbigramF 15)  constraint. WordGen is probed for 10 nonwords (-l 10) and has to write the detailed (‘true’) results to output.prn on the c:\ drive. Doing this, WordGen will display its progress in the output window (‘-show’). After this, WordGen will proceed to the next command in the file. In the example batch file above, it is probed for a Dutch 5-letter word starting with a ‘b’.

+ The command syntax is described in the wwg_batch_options.prn file delivered with the program, and here:

option [parameters]  //explanation

-show //write options to screen

-reset //set all values to default (** see below)

-language [1-4] //language: Dutch(1) English(2) French(3) German(4)

-f [filename] //read parameters from file (pmf file, without extension!)

-w [0-1] //switch between non-words (0) / words (1)

-t [0-1] //switch between checking (0) and generation (1)

-l [n_strings][outputfilename][details (true-false)] //’Generate List’ options

-n [number of letters] //number of letters

-bigramF [minimum][maximum] //constrain min and max bigram freq

-celexF [minimum][maximum] //constrain min and max celex / lexique freq

-neighbor [minimum][maximum] //constrain min and max number of neigbours

-wildcard [wildcard] //using wildcard

-heuristic //using heuristic

-forbidden [forbidden letters] //exclude forbidden letters

-legal [frequency] //min legal bigram frequency constraint

-posbigramF [frequency] //minimum positional bigram frequency constraint

 

//turning off options and constraints: without arguments!

-lx

-bigramFx

-celexFx

-neighborx

-wildcardx

-heuristicx

-forbiddenx

-legalx

-posbigramFx

 

//execute: all further options on the same input line are discarded

-go!

 

IMPORTANT notes on WordGen syntax:

+  if a constraint is set, it stays on for the next go! command, unless it is reset (using the -reset or -[constraint]x command)

+ the constraint settings of the batch system and the GUI are stored seperately. Thus, clicking constraints with the mouse does not have an impact on the batch system.

+ batch files can contain line breaks. In the command line box in the GUI, everything should be entered as a single line of text. For example, the following command should be entered to write 10 English 4-letter nonwords, having 2 to 7 neighbours, to a file called output.prn on the c:\drive: -language 2 -w 0 -n 4 -neighbor 2 7 -l 10 c:\output.prn true -show go!

+ do not use spaces in filenames (not yet…)

+ when wwg encounters an error, it continues command line processing

+  default values are:

–  language=English;

– words / non-words=NONWORDS

– task=GENERATION

– Number of letters=5

– neighbor constraint=false

– celex / lexique frequency constraint=false

– bigram frequency constraint=false

– minimum legal bigram frequency constraint=false

– minimum positional bigram frequency constraint=false

– use wild card (obliged letter constraint)=false

– forbidden letters constraint=false

– use heuristic=false

– generate list=false

New: after you installed the program, you can also download a small add-on tool here. This program allows you to automatically check a long list of stimuli in a textfile for lexical characteristics, without having to enter the stimuli manually.

Unzip this file in the WordGen install folder that you chose. Put your stimuli in the file named batch.in. Run batch_checking.exe . Choose language, and output will be written to batch.out.

Leave a Reply