116 Approved For Release 2000/08/15 : CIA-RDP96-00792R~Ppt700960002-4
Papers istical Issues and Methods
carry out a global significance test for such single hypotheses to
which a superordinate null hypothesis can be assigned.
It should be clear that by performing global significance
tests many psi experiments must lose their significance. I remem-
ber, though, that I also mentioned the ~interexperimental selection
above, to whose avoidance, at the least, all similar psi experiments
should be combined and submitted to a, global significance test.
'u gh such a "meta- analysis," on t~e other hand, the sigy-Aft-
10
r
T"
can' e m ay increase so that the singled experiment loses part of its
meani
My econd theme is the reduc&n of beta errors in the sta-
e
My
tistical ev ation of psi experiments. The problem is to increase
,,al te \val
the statistic efficiency (or power)1of the significance tests in such
ta isti.
a way that--dNpite the avoidance f selection errors--minimal psi
.c
an be _, tisti,
effects c cteI confine myself to two differ-
'all~ det,
ent questions, bol\tof which are f considerable importance to the
practice. The firstlxuestion is: which are the statistically optimal
methods for correctirN given s lection or for combining single re
n
sults which shall underg a glob I significance test?
a
s
n'
f
ca
c
g
glob
Here ~ it can first be a , red that for any selection of a
re
d
i.'
0
r
be
a
eaI
h
a
t
t
single res~it there is a simple atistical correction possible that
s
i
t t
i.
c
p
replaces the global signif! ance t .t. An approximate formula for
fi, a-ce t tA_ pp
itc lies the p value of the selected
J.t one _u
this purpose requires Thai
lies th
0 f ;one mu
giIts.
result with the number f ,vres s. Naturally, in this manner,
Nal
OfV
f
X~dh
the p value will be strlgly ii creaseothat the statistical signifi-
_tal
'gly 'a..
cance will in most cases disap ear, as ithe case of a global sig-
disap ear, asithe
nificance test. Nevertheless, this is a uersal and very simple
.1t
eisu
method of correcting intra- o interexperim tal selection.
r. ter
expe'm
Most of the other meth ds consist in weiftted combinations of
to ip
't af
the single results so as aos e ficienAgglobal significance
test. In the case of standaf )eriments tha seems trivial be-
cause one needs only to th~.sdieffTerent hits, who- be
ad N sum can
evaluated with a CR just as well as the separate resu s. However,
-w-2
t
an analysis of intra- and i erindividual distributions o psi scores
d
p~ ally
shows that the simple addit on of hits is one of the stati ically
id
-tatl c
'.all er-
ev
least efficient methods, ev for the aggregation of small eri-
e \.
mental units such as indi * ual runs. The reason for this *es in
r tl s
the strong variability of pi scores, which can vary even in bi-
i
v
p
polar fashion between psi! itting and psi-missing so that the
s that the
p
deviations cancel out each other. Therefore, I have suggested
ch ot
uisf
special. (nonlinnearl trans' rmations weighting the single scores ac-
cording to their size. ;i ally, following the method of the likeli-
hood quotient, I came to a measure which is statistically most effi-
a
cient for strongly varyi psi scores and is a linear function of
p
the well-known "run-sco variance.
The second question refers to the identification of permissible
117
forms of selection which one could use to increase the statistical
efficiency. For example, the above definition of selection error al-
lows one to exclude any partial results f om the eobal significance
Ir
ac
test of an experiment if the exclusion ensues rding to a criterion
nde enden
that, under the null hypothesis, is inde enden of the respective
results. If one, in this way, discovers cert * clues that particu-
'ers cert
lar experimental situations, certain subjects, ertain variables, etc. ,
b _ts rt
3.0
could be unsuccessful, one is allowed to e * inate them as is. This
to ,in
t
'Jda
can be a great advantage because every nsignificant partial result
"ry nagn
reduces the significance of the total res t.
t
In 1he f=,Ir statistical evaluatio/i of a multivariate experiment,
one should reduce correlatl~ criterion or predictor vari
ables to a sm er number of fac
because the sta ical efficiency
decreases with the umber of va
extreme-group metho hould be
is allowed to eliminate t midd
able when calculating corre i
all the chance-scoring subjec
psi-hitters and psi-missers ma
coul~\hat
"r'
rs by performing a factor analysis,
t
Znthe case of correlated variables
bles. Finally, the so-called
nentioned, according to which one
cases of the distribution of a vari-
For example, one could eliminate
a correlational study, if enough
The correlations between psi
t
variables and other variabl s
way, become much more
significant.
I am afraid my e#lanations will no d to a decisive change
tdto a d
When
)10
in the statistical methods of parapsycho: sts. When I pointed to
'ttte980
the problem of statist;6al selection errors at the 980 PA Convention
a
gi - \e
1 1
nsid ab e
in Reykjavik, it also id not have any considerabl effect. One
to attain
must, apparently, t rn to the psi skeptics to attain uch effects.
Probably, selectio errors serve the general psychological tendency
e
to synchronize th given empirical data with one's own expectations
regarding realit . Therefore, the final demand can only be to
answer one's o ways of acting with increased self-criticism, even
in such a~n obj tive area as mathematical statistics. Otherwise,
~j
those cynics I be confirmed who always have contended that,
with statistics, one can prove everything.
Z!
EVALUATING FREE-RESPONSE RATING DATA
Sybo A. Schoutent and Gert Camfferman (Parapsychology
Laboratory, University of Utrecht, Sorbonnelaan 16,
3584CA Utrecht, The Netherlands)
During the recent decades the use of forced-choice methods
in experimental research in parapsychology has gradually declined
in favor of free-response techniques. A disadvantage of free-
response techniques is that they are rather time consuming. The
Approved For Release 2000108/15: CIA-RDP96-00792ROO0700960002-4
,tP960002
Approved For Release 2000/08/15p: CIA-RDP96-00792R~qRTO -4
118 apers al Issues and Methods 119
discrepancy in time investment between free-response and forced-
choice studies seems only acceptable if it can be proven that either
free-response studies are more sensitive for detecting ESP or that
knowledge is gained from the process analysis which free-response
studies allow. These two potential advantages of free-response
studies require, however, more sensitive techniques for analyzing
free-response data than evaluations based on hit/miss ratios which'
are used with forced-choice methods.
An evaluation method often used in free-response studies is
one that employs different target sets for each trial and has the
subject assign ratings to all pictures of the set. A target set con-
sists of a number of pictures from which one is randomly selected
to serve as the actual target in the experiment. The others are
used as controls. The rating values assigned to pictures are based
on the agreement between mentation (reported or not) and the con-
tent (or perhaps symbolic meaning of the content) of the pictures.
Based on the ratings assigned to each response, the pictures can
be ranked and one of the familiar evaluation methods for preferen-
tial ranking may be applied. But by turning ratings into ranks the
greater sensitivity that the rating method might yield is lost. Hence,
a statistical evaluation is needed which does credit to the higher
sensitivity which ratings might offer. To this end most often Z-
scores are applied, first used and reported by Stanford and Mayer
in 1974 (JASPR, 1974, 182-191).
When free-response rating data of an experiment were analyzed
by applying nonparametric tests on the Stanford Z-score distribution
of the targets a significant result indicating psi-missing was ob-
served. However, it soon became clear that the result was purely
artifactual and could be explained by the rating behavior of the
subjects. This led us to study the properties of the Stanford
Z-scores in more detail.
Hansen reported to the 1985 PA Convention (RIP 1985, 93-94)
that Z-score distributions are bimodal. We found that Z-score dis-
tributions are in all cases nonnormal and only symmetrical but bi-
modal when subjects select ratings with equal probability from the
whole range. Decreasing the size of the target set yields flatter
distributions. Decreasing the range of the ratings results in more
irregular distributions. All distributions have an upper and lower
limit of Z-scores. In cases in wl-dch subjects select ratings with
unequal probabilities from the range applied, the distributions be-
come asymmetrical. Hence it can be concluded that rating behavior
influences the distributions of Stanford Z-scores. This seems an
important problem because in many cases the conditions of the ex-
periment will influence the rating behavior of subjects. That im-
plies that an influence of conditions on the rating behavior, and
consequently on the Z-scores, must be eliminated before a proper
evaluation of the difference as regards ESP scoring can be made.
Stanford Z-scores are also peculiar in some other respects.
Their value and range are rather sensitive to the number of equal
ratings assigned. In the case in which equal values are assigned,
the actual size of the ratings has no influence on the size of the
Stanford Z-score. For instance, 1-0-0-0-0 (first rating is target)
yields the same Stanford Z-score for the target as 100-0-0-0-0; in
both cases the target receives a Stanford Z-score of +1.72. Hence,
the Stanford Z-scores do not always reflect the similarities or dif-
ferences between mentation and targets that subjects express in
their assignment of rating values.
Another complication is that when relatively many ratings of
equal value are assigned, the Z-score distribution tends to become
discrete rather than continuous. Especially since free-response
studies in general involve few trials, the discrete character of such
distributions violates the assumptions on which many parametric and
nonparametric tests are based. To meet these objections a different
evaluation procedure based on a randomization test is proposed.
The randomization test is based on the sum of ratings over
the trials. In the case of assigning rating values to the control
pictures it can be assumed that ESP can have no affect on these
ratings. If we randomly select from each trial a control-picture
rating value and take the sum of these ratings, then- based on all
possible combinations of ratings for control pictures over the trials
a distribution is obtained which will tend to be normal even in the
case that the ratings themselves were selected with unequal proba-
bilities. The randomization test provides an answer to the question
to what extent the sum, over the trials, of the ratings assigned to
the target pictures deviates from the mean sum of the ratings as-
signed to the control pictures. Consequently, the sum of the rat-
ings assigned to the target pictures is expressed as a standard
normal score based on the distribution of the sum of the ratings
assigned to the controls. This standard normal score will be called
the "standardized sum-of-ratings score" or SSR score. A good ap-
proximation of this distribution is obtained by calculating the mean
and standard deviation from the mean and variance of the ratings
for the controls of the individual trials. The mean of the distribu-
tion of sum of scores will be equal to the sum of the means of con-
trols for the trials. The standard deviation is found by taking the
square root of the sum of the variances for control ratings over
the trials.
Since SSR scores can be assumed to be standard normal,
their associated probability can be obtained from the standard nor-
mal distribution. SSR scores of different conditions can be com-
pared because SSR scores not only can be considered standard
normal but also are independent of differences between conditions
in range of ratings, rating behavior, or number of controls applied.
To obtain ESP scores for individual trials the rating value
Approved For Release 2000/08/15 : CIA-RDP96-00792ROO0700960002-4
Approved For Release 2000108/15 : CIA-RDP96-00792ROPRt~?
tyft60002-4
120 Papers I Issues and Methods 121
assigned to the target is converted into a standardized average scores appear more sensitive than Stanford Z-scores in cases of
rating score for the target (SAR score), strong ESP and extreme rating behavior.
The distribution of the sum of ratings for the controls can
be considered as the distribution of ratings associated with that
condition. Reduced to the level of individual trials we assume this
distribution to be typical for the condition and express all ratings
in this distribution of average ratings. Thus, all ratings are con-
verted into standard normal scores by computing its distance from
the mean of average ratings for the controls of the trials and divid-
ing it by the standard deviation observed for these average ratings.
Then for each trial a SAR score for the target
is derined as
the difference between this standard normal
score for the target
and the average standard normal score for target
and controls.
Since the SAR scores are based on true standard
normal scores,
which means scores obtained from a normal distribution,
SAR scores
can be considered normal too. For each trial
the sum of SAR scores
for controls and targets is zero. Therefore,
in the case of related
samples we might compare individual achievement
over conditions by
calculating a product-moment correlation between
the SAR scores of
the two conditions.
Although the randomization test described above seems sta-
tisticaEy sound we further studied its properties, especially regard-
ing its sensitivity to detect ESP. To this end we conducted a com-
puter simulation of 100 "experiments" for each combination of two
variables. Each experiment consisted of 20 trials and 5 pictures
per trial and was simulated by randomly generating 20 rows of 5
numbers between rating values 0 and 30, inclusive. The two vari-
ables involved were subjects' rating behavior and amount of ESP.
For rating behavior we manipulated the probability of selecting rat-
ing values of zero. The amount of ESP was operationalized. as the
number of subjects assigning the highest rating value to the target
in addition to what could be expected by chance.
From the data obtained it can be concluded that in most con-
ditions the sensitivity of the SSR scores is rather low and less than
that when, for instance, a simple binomial test was applied. Only
in extreme cases of rating behavior and amount of ESP do the SSR
scores become more sensitive than the binomial test. For instance,
in the case of 5 ESP hits when in total 5 + 15/5 = 8 hits can be ex-
pected, the binomial yields an exact one-tailed probability of p_= .01
whereas the SSR score yields on average a Z of 1.7 with an associ-
ated one-talied probability of .045.
In the same simulation studies Stanford Z-scores were com-
puted. We know that the distributions for these Z-scores are non-
normal but leaving tl-ds aside we found that in most cases the sen-
sitivity of t-test evaluations based on Stanford Z-scores is compar-
able to that of evaluations based on SSR scores. However, SSR
From these findings some practical conclusions can be drawn.
In general we must assume that the ESP influence on the data is
relatively little. Hence, unless there is reason to expect a strong
ESP influence in the experiment the binomial test can be assumed to
be more sensitive than an evaluation based on the rating values.
The same applies for experiments in which no extreme rating be-
havior can be expected, for instance, in an experiment in which
an atomistic approach to the judging is followed. In that case we
expect in general nonzero ratings assigned to all pictures, and our
findings show that in that case the SSR scores, as well as Stan-
ford's Z-scores, are rather insensitive,
A METHODOLOGY FOR THE DEVELOPMENT OF A
KNOWLEDGE-BASED JUDGING SYSTEM FOR FREE-RESPONSE
MATERIALS
Dick J. Bierman (Dept. of Psychology, University of Amsterdam)
It has been fouiNd that certait judges perform consistently
better than others whe, at ~n targets to a target set. It seems
unlikely that this is pu g~
1 b a - of the judge's psi, since psi
generally does not d, ispI consi$ent behavior. Therefore, it might
at it
at
be hypothesized th; t it i the tuitive) knowledge of the specific
judge that accounts
for i be er performance on this task. It
has been proposed (Morris E P, 1986, 137-149) that the use of
expert systems might help researchers in tasks where they lack
expertise, such as in the d ction of fraud. Morris argues that
the expertise of magicians o d be formalized in such a system and
made available to each ind' 'du researcher. Similarly, the exper-
tise of the best judge o free- sponse material could become avail-
able through impleme a on of a nowledge-based free-response
judging system. Thi se of tee iques from the field of artificial
intelligence (Al) to r resent scar knowledge should not be con-
fused with the use o I techniques for the representation of free-
response material ( aren, RIP 1986, 97-99). According to Maren,
the free-response aterial and the protocols should be represented
in the form of trees in which the nodes are perceivable "objects,"
like "flames," and the links represent relations, like "adjacent to."
We expect that focusing our attention on the (knowledge used in
the) human matching process might reveal more fundamental informa-
tion about the role of the meaning of the material. It is striking
that in Maren's proposed representation of complex target material
only visual features are present. Actually, the type of visual
matching that Maren proposes to be done by a machine can be bet-
ter performed by any sighted human.
Approved For Release 2000/08/15 : CIA-RDP96-00792ROO0700960002-4