This program reads a text file containing lines of words, and then, depending on the criteria you select, puts the words you want in one file ("Wheat") and the words you don't want into another file ("Chaff"). It separates the proverbial wheat from the proverbial chaff.
The original text file will remain untouched, whilst all the words will be copied
into either the "Wheat" file or the "Chaff" file. This means that you can check the "Wheat" and "Chaff" files to see if the program removed the words you wanted, and if it didn't you can try different options with nothing to lose.
Download it here
. [4.37KB] Windows 95, 98, ME, 2k, XP.
Written in assembly language by Tim Warriner 30/12/2003.
For example suppose i had the following list of words:
If i decided i only wanted words that were 7 letters long, then the program would put words that contained more or fewer letters into the "Chaff" file, whilst putting 7 letter words into the "Wheat" file.
The resulting "Chaff" file would look like this:
The resulting "Wheat" file would look like this:
I wrote the program in order to sort out word lists and dictionaries. If you collect random words off the internet and from files off your computer you may end up with a lot of gibberish words that would take ages to sort out. The example above takes up 55 bytes so isn't too difficult to sort by hand. When you have a 40mb dictionary it is pretty much impossible to do by hand. An example of part of such a dictionary is this:
All the words here are obviously nonsense. Some contain numbers, some are just random letters, all of them start with "ww".
Most of the options are designed to filter out a lot of the characteristics that aren't possessed by everyday English words. Each option has its benefits and drawbacks - the more words a filter removes, the more likely it is that a few valid English words will get taken out as well. For example, choosing to remove all words that contain a "Q" without a following "U" will remove a lot of nonsense and foreign words, but will also remove words you may want to keep such as "Iraq", "Qatar", "Qwerty" and so on. Although it was never intentional the options have a tendency to remove words either from or derived from the following languages: Welsh, Arabic, Dutch, German and Slavic (Polish, Russian etc).
This program can deal with both Dos and Unix text files for the file being filtered. A good program for creating word lists is my Tugboat program here
. Using it you can collect strings from within any file. A good program for viewing, sorting and generally handling huge text files is UltraEdit by IDM, which can manage text files of 50mb or so with no apparent effort at all.
The program seems to work ok, although I haven't tested it much. Someday I'll think of a better way of presenting the options, but until that time you'll have to make do with its HUGE window.
How to Use:
Choose one or more options then drag and drop the file you want dealt with onto the window. The "Wheat" and "Chaff" files will be put in the same directory that the program is run from. If the files already exist then the program complains rather than overwriting them. Remember "Wheat" is good, "Chaff" is bad.
The Options In detail:
Remove lines not consisting entirely of alphanumeric characters or spaces:
This will get rid of lines containing punctuation, accents, or symbols including hyphens.
Remove lines containing any spaces:
Does what you would think it would do.
Remove lines consisting entirely of numbers:
This will help get rid of gibberish, whilst keeping more useful number strings (should you want them).
Remove lines containing ANY numbers:
Good for when you just want letters. It's sensible to combine this with 1: and 2:.
Remove lines with no vowels (treat y as a vowel).
As far as I know there are no proper English words that don't contain either a vowel or the letter 'y'. Therefore this is a good filter for quickly cutting down on rubbish. You will lose some improper words such as "dvd".
Remove lines with no vowels (treat y as a consonant).
This goes one step further and will weed out more words, but at the same time you will lose words such as:
Remove lines with either no vowels OR no numbers (y is a vowel).
If a line doesn't have any vowels or numbers in it then it is excluded. In other words, "catfish" is kept, "dvd1" is kept, "rhythm" is kept, but "dvd" is excluded.
Remove lines containing a "Q" without a consecutive "U".
This gets rid of a lot of gibberish words but you will lose some words that have Arabic roots such as:
and also words such as "Qwerty".
Remove lines containing 2 repeating vowels. (aa, ii, uu, yy)
This will exclude many foreign words especially those with Dutch and Arabic roots. You will lose words such as:
Remove lines containing 2 repeating vowels. (ii, uu, yy)
This is the same as option 9 but will keep words like "aardvark".
Remove lines containing 2 repeating "bad" consonants: jj, qq, vv, ww, xx
You will lose things such as:
Remove lines containing 3 or more consecutive repeating characters.
As far as I know there are no English words containing 3 consecutive identical letters. This is one of the best options for getting rid of gibberish.
Remove lines containing 3 or more consecutive consonants (keeping certain combinations).
This will only keep words that contain 3 consecutive consonants if one of the following is true:
- There is an h, r, s or l in the 3 consonants.
- If the 3 consonants are: mpt, nct, wkw, ptn
- If the last 2 of the three consonants are: kn, dw or a double letter
- If the first 2 of the three consonants are: mc, ng, gn, ck, nd, nt, wd, nk, mb, mp, rt, nc, ld, dg, wn, ft, wf, bt, ct, wk, dg, mn, xt
You will lose surprisingly few good words from this. One you will lose is "deptford".
Remove lines containing "bad" sequences.
This will remove all words that contain:
aew, bfx, cz, cbs, fv, fx, gij, ij, kti, ljm, nkk, pff, sj (but not misj), sz (but not misz), tz, tuj, yj, zk, zl (but not zzle).
You will lose words such as:
Remove lines starting with 2 consonants.
This removes all lines starting with two consonants apart from *r, *h, s*, *y, *l, *w (where "*" is any OTHER letter), mc, ps, kn? and gn? (where "?" is any vowel).
Examples of words you may want to keep that will be lost with this option are:
Remove lines ending with "bad" sequences (wm, jf, sw, ij, cl, zl, zt, jt, wj, jp, dt, dk, lz, ov, cbs).
You will lose words such as:
Remove lines starting with string entry.
This will exclude words that have the whole
of the word in the "String Entry" box at their start. For example if the string entry was, "aard", then it would weed out words such as "aardvark", but keep words such as "softvark" or "aar".
Remove lines that aren't between x and x letters long (inclusive).
This will remove all words that are bigger or smaller than the limits you enter. If both limits are set to the same number then only words of that length will be selected.
1: Use string entry as prefix for all lines (put into Wheat, Chaff unused).
2: Use string entry as suffix for all lines (put into Wheat, Chaff unused).
These options will either prefix or suffix each line of the input file with the word that was entered in the "String Entry" box. The results will be put into the Wheat file (No Chaff file will be created). You cannot combine these options with any of the other options. This is not really related to the Wheat from Chaff idea, but it can be useful.