Results from Word Ratios Analysis

Discussion

Subconsciously, people use ratios of common word in a consistent way but are completely unaware of it. Simple words, such as "at", "the", "I", "to" and so on. In large texts, the ratios of these word may form a sort of fingerprint for a particular author with the values varying only slightly over time. If two pieces of a similar type are written within a few months, the ratios should be approximately the same.

We know that if genuine, the Proof of Evidence was written between the first hearing date of 10th October 1994 and Tod Potsgrove's death early in the week of 31st October 1994, having been supplied with the transcripts of the covertly recorded tapes at the beginning of that week. On the first day of the hearing - 10th October1994 - the company stated that the interlocutory meeting was a short meeting of only ten to fifteen minutes when only passwords were discussed and it was revealed that in fact, it was bullying session lasting over an hour as the 1 hour dictaphone tape had ran out. The Proof of Evidence changes the company's story to "Whilst the meeting started at 3 o'clock, I must have left the building by around 4.15pm or 4.30pm, because I know from my diary I was elsewhere at 5.30pm" therefore this must have been written after the 10th October 1994. Tod Potsgrove was, of course, unable to write any proof of evidence that would be acceptable in a court after he had died early in the week of 31st October 1994.

One serious problem of this type of analysis on the Proof of evidence is that there is so little to go on - several thousand words - hardly a book. Finding enough repetitions of a word to make the sample viable was the first step.

Method

The text was loaded into a word processor (Lotus Word Pro 97) and all of the words that the automatic spell checker did not recognise were hidden - that is, the hightlighting for those words was turned off.

In turn, each test word was placed in the search and replace function and a non-existent word that the spell checker would highlight was replaced used to replace all occurrences of the test word in the text. Each time, the search and replace tells the user how many occurrences there has been and this number was noted. After each test, the undo button was pressed to return the text to its original state. The results of this preliminary search are shown in Table 1.

the to I of and that was in not a/an he on for it with at did am do
158 115 79 67 65 62 46 46 42 38 35 34 29 27 20 19 16 4 1
Table 1

As there are 17 paragraphs in the document, a minimum of 60 occuerences was chosen as a compromise between having enough occurrences in each paragraph and having a reasonable number of words to choose from. Once these words had been chosen, the repacement and highlighting process was repeated, this time counting the number of replacements highlighted in each paragraph by the automatic spell checker. These results were noted and then calculeted as percentages of the wordage for each paragraph.

Results

Section
Name
the to I of and that Total
words
  Ratios x 100
% % % % % % to/the I/the I/to
Whole 5.76 4.08 2.88 2.44 2.37 2.26 2744 - 71 50 71
1 3.75 3.75 6.25 6.25 7.50 0.00 80 - 100 167 167
2 8.77 3.51 3.51 3.51 5.26 0.00 57 - 40 40 100
3 6.13 5.52 1.23 1.23 1.23 1.84 163 - 90 20 22
4 4.28 5.35 3.21 2.67 2.14 1.60 187 - 125 75 60
5 8.08 5.05 1.01 1.01 3.03 3.03 99 - 63 13 20
6 6.84 4.28 1.71 0.00 0.85 2.56 117 - 63 25 40
7 2.61 7.84 4.58 1.31 2.61 1.96 153 - 300 175 58
8 3.51 1.75 3.51 1.75 5.26 5.26 57 - 50 100 200
9 7.02 1.32 4.82 1.75 3.95 2.19 228 - 19 69 367
10 5.00 5.00 2.00 1.00 0.00 1.00 100 - 100 40 40
11 3.24 7.19 5.04 1.8 1.44 2.16 278 - 222 155 70
12 8.43 2.41 3.01 3.01 3.01 2.41 166 - 29 36 125
13 7.75 2.33 1.94 4.26 2.71 1.94 258 - 30 25 83
14 9.89 2.20 3.30 4.40 2.20 3.30 91 - 22 33 150
14a 3.61 2.06 0.52 1.03 1.55 3.09 194 - 57 14 25
15 8.00 5.00 1.00 4.00 0.00 2.00 100 - 63 13 20
16 5.29 3.85 2.40 3.13 2.16 2.88 416 - 73 45 63
Table 2.

Table 2 shows the results expressed as percentages of each paragraph. The results for of, and and that each have a zero and therefore were not used for ratio calculations. The raios for to:the, I:the and I:to were calculated and are included above.

In addition to this, the word count for each section is included to that it can be seen how significant any anomaly is - paragraphs 2 and 8, for example have only 57 words, therefore the potential for error is greater than in paragraphs 7 or 11 which have well over 100 words each.


Thus, an anomaly has to be more significant in the shorter paragraphs for it to be considered as genuine - this is the case with paragraph 8 which has a large proporation of "that"s and a low proportion of "to"s.


Inferences

  the to I of and that - to/the I/the I/to
High 2,14 11,7 9,11,1 1 8,2,1 8 - 4,11,7 9,4,8,11,1,7 8,9
Low 7,11,8 9,8 14a,15,3 6 15,10,3 1,2,10 - -   5,15,3,14a
Table 3.

Table 3 shows the anomalies for each type of anomaly with outstanding anomalies highlighted. Table 4 shows these anomalies for each paragaraph.

We can see from Table 3 that there were no significantly low levels of the word to word ratios. In Table 4, we can see that "I" values vary markedly between paragraph 1 (80 words) and paragraph 14a (194 words).


Section
Name
Anomalies High
(weighted)
Low
(weighted)
Total
(weighted)
the to I of and that to/the I/the I/to
1     H H H L   H   4 (8) 1 (2) 5 (10)
2 H       H L       2 (2) 1 (2) 3 (4)
3     L   L       L 0 (0) 3 (3) 3 (3)
4             H H   2 (2) 0 (0) 2 (2)
5                 L 0 (0) 1 (1) 1 (1)
6       L           0 (0) 1 (2) 1 (2)
7 L H           H   2 (4) 1 (2) 3 (6)
8 L L     H H   H H 4 (5) 2 (3) 6 (8)
9   L H         H H 3 (4) 1 (2) 4 (6)
10         L L       0 (0) 2 (3) 2 (3)
11 L H H       H H   4 (7) 1 (1) 5 (8)
12                   0 (0) 0 (0) 0 (0)
13                   0 (0) 0 (0) 0 (0)
14 H                 1 (2) 0 (0) 1 (2)
14a     L           L 0 (0) 2 (3) 2 (3)
15     L   L       L 0 (0) 3 (4) 3 (4)
16                   0 (0) 0 (0) 0 (0)
Table 4.

Paragraphs 1 (80 words), 8 (57 words) and 11 (278 words) have the highes numbers of weighted anomolies with paragraphs 1 and 11 having the best distinctions of highs and lows.

Paragraphs 12 (166 words), 13 (258 words), 16 (416 words), 5 (99 words), 6 (117 words) and 14 (91 words) are unremarkable and possibly written by the same author.

The Changes in values involving "I" may possibly be the most important as people tend to talk in a particular way, constructing sentences around rules that influence the proportion of references to themselves in this way.

If a piece was being written by someone else, they would be conscious of the number of times "I" was used and, unless they took the trouble to analyse the paragraphs themselves, would have a higher probability of having a significantly different proportion of "I"s. Paragraphs that are remarkable in terms of levels of "I" are 1 (80 words) and 6 (117 words) - high and low respectively - and in proportion of "I/the" are 11 (278 words), 1 (80 words) and 7 (153 words).

Paragraphs 12, 13 and 16 set the norm for the overall writing according to this method of analysis and paragraphs 1, 7, 8, 9 and 11 appearing to be of different origin. It is, of course, a pity that the document is not many times longer as this would increase the reliability of any analsis done upon it, however, we are not in a position of being able to as the alleged author to expand upon the document.

The short length of paragraphs 2, 8, 1 and 14 make the results on those paragraphs less reliable therefore, it can be said that, based upon the limited extent of the available material, paragraphs 7, 9 and 11 appear to originate from a sources different to most of the rest of the document and there is a reasonable probability that paragraphs 1 and 8 also originate from a source different to most of the rest of the material.


Back to the Results Index

 
Site Map
Back to the Index Copyright 1998 - 2003 P.A.Grosse.
All Rights Reserved
Results from Corel WordPerfect V6.1 Grammatik analysis Results from Lotus Word Pro 97 Grammar Checker analysis