Predictive Text Alternatives Program

This program will try to find words that a mobile phone with predictive text might confuse. For example if you enter the word "knives" into a text message you may end up with the word "loiter" instead and vice versa.

yes, I wrote this a long time ago

Some of these combinations can be profound, amusing or interesting.

Depending on the words that are in your phone some examples that can occur are:

    madam      ocean
    stocking   rumbling
    shored     rinsed
    rooms      snoop
    putrid     stupid
    boney      comfy
    fancies    damages
    heroics    hernias
    internee   governed
    sterile    puerile
    subside    quashed
    suffrage   steerage
    sussex     surrey
    swinger    pygmies
    thriving   visiting
    tonguing   vomiting
    toupee     unused
    toxoid     townie
    trussed    usurped

predictive txt program

Download the latest version of the program here (4.61KB)
For Windows 95, 98, Me, NT4, 2K, XP.
Written by me, Tim Warriner, January 2004 in Assembly language.

Download a reasonable dictionary file here. (144KB)

Instead of running the program you can just download the results this dictionary would produce:
Without the filter working here (110KB)
With the filter working here (8.22KB).

The program is based on an English phone keypad:
2: abc
3: def
4: ghi
5: jkl
6: mno
7: pqrs
8: tuv
9: wxyz

Obviously any words it finds would have to be in the phone's vocabulary for them to appear on the phone.

How to use:
1: choose a text file containing the words you'd like to find predictive text alternatives for.
2: choose a text dictionary file that these words can be checked against.
3: choose an output file that the results will be put into.
4: press 'Go'. (You can enter the same text file for parts 1: and 2:)

If you select the filter option then the program will ignore any words that aren't between 5 and 8 letters long. It will also ignore results that have more than half the letters the same as the original. The filter may make the program run faster and the results it finds will tend to be more interesting, but at the same time it will miss out on thousands of potentially good results.

The program updates which word it's got to every 2 seconds. If you can't wait that long then you can click on the 'View all progress' button. This may slow the program down as the CPU has to concentrate on writing text as well as finding strings.

predictive txt program working

predictive txt program working

- The dictionary file MUST be in alphabetical order and in lowercase - the program needs this as it can work faster with such files. It jumps to the appropriate letter in the dictionary instead of going from a - z. The dictionary can have Dos or Unix end of lines in it. (It doesn't matter if you don't know what this means).

- The text file containing words to find alternatives for and the dictionary can be the same file.

- It can take a long time to check against a very large dictionary, although recent versions of the program are a lot faster. The dictionary supplied here, when used as the word list and the dictionary at the same time, will take:

34 minutes on a Pentium 4 2.8Ghz without the filter (92,123 possible alternative words tested per second).
21 minutes on a Pentium 4 2.8Ghz with the filter (153,455 alternatives per second).
1 hour, 10 minutes on a Pentium 3 Celeron 633Mhz without the filter (49,256 per second).
43 minutes on a Pentium 3 Celeron 633Mhz with the filter (79,666 per second).

- Shorter words tend not to have interesting alternatives; longer words tend to take a long time to find alternatives for. The best word length, in my view, is between 4 and 8 letters. You can use my Wheat from chaff program to trim dictionaries down to words of this size.

If a word consists only of letters that there are 2 other alternatives for (e.g. a -> a, b, c), then the number of alternatives for that word is 3 to the power of the number of letters in that word.

  aa has 9 alternatives (3^2)
  aaa has 27
  aaaa has 81
  aaaaa has 243
  aaaaaa has 729
  aaaaaaa has 2187
  aaaaaaaa has 6561
  aaaaaaaaa has 19683
  aaaaaaaaaa has 59049

For words containing p, q, r & s and w, x, y & z there are even more. The program has to check each of these words against the dictionary which gives an idea why a very long word will take a very long time to complete. A 20 letter word would have at least 3,486,784,400 different alternatives - if the program could check 100,000 strings a second (which is fast for a good dictionary) it would take nearly 10 hours for just that one word. Ignoring all sense this program will let you use words containing 50 letters.

- The program will only ever be as good as the word lists and dictionaries you use. There are thousands of word lists on the internet that you can try out or you can use the one supplied above to start you off. The best program for opening, sorting and generally playing with very large text files is UltraEdit (look on the internet for it). It can easily cope with 50MB text files, should you ever have anything that large.

- To get the fastest results from the program use a small word list containing short words, and a small dictionary.

Various things worthy of note:
It should work in all versions of Windows since Windows 95. It's been tested in Windows XP Pro, Windows 2000, and Windows 98SE. I've improved the error messages when file handling goes wrong. Now, except with the rarer file errors, it will produce proper error messages instead of hex numbers.

Once I had got the program to work properly I just spent more time trying to optimise it. This was done for the sake of programming rather than to actually get more alternatives quicker. I'm not really that obsessed with knowing predictive text alternatives. I can use everything I've developed here to make other programs I've written run faster or look prettier.
The version here can do the dictionary here in 34 minutes; the previous version did it in 37 minutes; the version before that took four and a half hours.

When the program reads the dictionary for the first time, it marks out the first occurrence of each block of words starting with each letter in the alphabet. Within each first letter block it marks out the second letters too. In this way when it is comparing strings it can just jump to the relevant section of the dictionary instead of searching through the entire file. For example with the word "apple" it would jump to the "ap" section, and search only that part of the dictionary.

If you hibernate the computer (Windows 2000, XP) the program will notice and stop timing. Therefore when it shows the statistics on completion, it won't take into account the time the computer was switched off. This is probably the most pointless addition to a program that doesn't really do anything that useful that I've ever done.

If the priority of the program is set to anything but "normal" the program will reset it to "normal" when the computer is hibernating to let hibernate work more quickly. Without this hibernation can be extremely slow and unreliable.