116 Approved For Release 2000/08/15 : CIA-RDP96-00792R~Ppt700960002-4 Papers istical Issues and Methods carry out a global significance test for such single hypotheses to which a superordinate null hypothesis can be assigned. It should be clear that by performing global significance tests many psi experiments must lose their significance. I remem- ber, though, that I also mentioned the ~interexperimental selection above, to whose avoidance, at the least, all similar psi experiments should be combined and submitted to a, global significance test. 'u gh such a "meta- analysis," on t~e other hand, the sigy-Aft- 10 r T" can' e m ay increase so that the singled experiment loses part of its meani My econd theme is the reduc&n of beta errors in the sta- e My tistical ev ation of psi experiments. The problem is to increase ,,al te \val the statistic efficiency (or power)1of the significance tests in such ta isti. a way that--dNpite the avoidance f selection errors--minimal psi .c an be _, tisti, effects c cteI confine myself to two differ- 'all~ det, ent questions, bol\tof which are f considerable importance to the practice. The firstlxuestion is: which are the statistically optimal methods for correctirN given s lection or for combining single re n sults which shall underg a glob I significance test? a s n' f ca c g glob Here ~ it can first be a , red that for any selection of a re d i.' 0 r be a eaI h a t t single res~it there is a simple atistical correction possible that s i t t i. c p replaces the global signif! ance t .t. An approximate formula for fi, a-ce t tA_ pp itc lies the p value of the selected J.t one _u this purpose requires Thai lies th 0 f ;one mu giIts. result with the number f ,vres s. Naturally, in this manner, Nal OfV f X~dh the p value will be strlgly ii creaseothat the statistical signifi- _tal 'gly 'a.. cance will in most cases disap ear, as ithe case of a global sig- disap ear, asithe nificance test. Nevertheless, this is a uersal and very simple .1t eisu method of correcting intra- o interexperim tal selection. r. ter expe'm Most of the other meth ds consist in weiftted combinations of to ip 't af the single results so as aos e ficienAgglobal significance test. In the case of standaf )eriments tha seems trivial be- cause one needs only to th~.sdieffTerent hits, who- be ad N sum can evaluated with a CR just as well as the separate resu s. However, -w-2 t an analysis of intra- and i erindividual distributions o psi scores d p~ ally shows that the simple addit on of hits is one of the stati ically id -tatl c '.all er- ev least efficient methods, ev for the aggregation of small eri- e \. mental units such as indi * ual runs. The reason for this *es in r tl s the strong variability of pi scores, which can vary even in bi- i v p polar fashion between psi! itting and psi-missing so that the s that the p deviations cancel out each other. Therefore, I have suggested ch ot uisf special. (nonlinnearl trans' rmations weighting the single scores ac- cording to their size. ;i ally, following the method of the likeli- hood quotient, I came to a measure which is statistically most effi- a cient for strongly varyi psi scores and is a linear function of p the well-known "run-sco variance. The second question refers to the identification of permissible 117 forms of selection which one could use to increase the statistical efficiency. For example, the above definition of selection error al- lows one to exclude any partial results f om the eobal significance Ir ac test of an experiment if the exclusion ensues rding to a criterion nde enden that, under the null hypothesis, is inde enden of the respective results. If one, in this way, discovers cert * clues that particu- 'ers cert lar experimental situations, certain subjects, ertain variables, etc. , b _ts rt 3.0 could be unsuccessful, one is allowed to e * inate them as is. This to ,in t 'Jda can be a great advantage because every nsignificant partial result "ry nagn reduces the significance of the total res t. t In 1he f=,Ir statistical evaluatio/i of a multivariate experiment, one should reduce correlatl~ criterion or predictor vari ables to a sm er number of fac because the sta ical efficiency decreases with the umber of va extreme-group metho hould be is allowed to eliminate t midd able when calculating corre i all the chance-scoring subjec psi-hitters and psi-missers ma coul~\hat "r' rs by performing a factor analysis, t Znthe case of correlated variables bles. Finally, the so-called nentioned, according to which one cases of the distribution of a vari- For example, one could eliminate a correlational study, if enough The correlations between psi t variables and other variabl s way, become much more significant. I am afraid my e#lanations will no d to a decisive change tdto a d When )10 in the statistical methods of parapsycho: sts. When I pointed to 'ttte980 the problem of statist;6al selection errors at the 980 PA Convention a gi - \e 1 1 nsid ab e in Reykjavik, it also id not have any considerabl effect. One to attain must, apparently, t rn to the psi skeptics to attain uch effects. Probably, selectio errors serve the general psychological tendency e to synchronize th given empirical data with one's own expectations regarding realit . Therefore, the final demand can only be to answer one's o ways of acting with increased self-criticism, even in such a~n obj tive area as mathematical statistics. Otherwise, ~j those cynics I be confirmed who always have contended that, with statistics, one can prove everything. Z! EVALUATING FREE-RESPONSE RATING DATA Sybo A. Schoutent and Gert Camfferman (Parapsychology Laboratory, University of Utrecht, Sorbonnelaan 16, 3584CA Utrecht, The Netherlands) During the recent decades the use of forced-choice methods in experimental research in parapsychology has gradually declined in favor of free-response techniques. A disadvantage of free- response techniques is that they are rather time consuming. The Approved For Release 2000108/15: CIA-RDP96-00792ROO0700960002-4 ,tP960002 Approved For Release 2000/08/15p: CIA-RDP96-00792R~qRTO -4 118 apers al Issues and Methods 119 discrepancy in time investment between free-response and forced- choice studies seems only acceptable if it can be proven that either free-response studies are more sensitive for detecting ESP or that knowledge is gained from the process analysis which free-response studies allow. These two potential advantages of free-response studies require, however, more sensitive techniques for analyzing free-response data than evaluations based on hit/miss ratios which' are used with forced-choice methods. An evaluation method often used in free-response studies is one that employs different target sets for each trial and has the subject assign ratings to all pictures of the set. A target set con- sists of a number of pictures from which one is randomly selected to serve as the actual target in the experiment. The others are used as controls. The rating values assigned to pictures are based on the agreement between mentation (reported or not) and the con- tent (or perhaps symbolic meaning of the content) of the pictures. Based on the ratings assigned to each response, the pictures can be ranked and one of the familiar evaluation methods for preferen- tial ranking may be applied. But by turning ratings into ranks the greater sensitivity that the rating method might yield is lost. Hence, a statistical evaluation is needed which does credit to the higher sensitivity which ratings might offer. To this end most often Z- scores are applied, first used and reported by Stanford and Mayer in 1974 (JASPR, 1974, 182-191). When free-response rating data of an experiment were analyzed by applying nonparametric tests on the Stanford Z-score distribution of the targets a significant result indicating psi-missing was ob- served. However, it soon became clear that the result was purely artifactual and could be explained by the rating behavior of the subjects. This led us to study the properties of the Stanford Z-scores in more detail. Hansen reported to the 1985 PA Convention (RIP 1985, 93-94) that Z-score distributions are bimodal. We found that Z-score dis- tributions are in all cases nonnormal and only symmetrical but bi- modal when subjects select ratings with equal probability from the whole range. Decreasing the size of the target set yields flatter distributions. Decreasing the range of the ratings results in more irregular distributions. All distributions have an upper and lower limit of Z-scores. In cases in wl-dch subjects select ratings with unequal probabilities from the range applied, the distributions be- come asymmetrical. Hence it can be concluded that rating behavior influences the distributions of Stanford Z-scores. This seems an important problem because in many cases the conditions of the ex- periment will influence the rating behavior of subjects. That im- plies that an influence of conditions on the rating behavior, and consequently on the Z-scores, must be eliminated before a proper evaluation of the difference as regards ESP scoring can be made. Stanford Z-scores are also peculiar in some other respects. Their value and range are rather sensitive to the number of equal ratings assigned. In the case in which equal values are assigned, the actual size of the ratings has no influence on the size of the Stanford Z-score. For instance, 1-0-0-0-0 (first rating is target) yields the same Stanford Z-score for the target as 100-0-0-0-0; in both cases the target receives a Stanford Z-score of +1.72. Hence, the Stanford Z-scores do not always reflect the similarities or dif- ferences between mentation and targets that subjects express in their assignment of rating values. Another complication is that when relatively many ratings of equal value are assigned, the Z-score distribution tends to become discrete rather than continuous. Especially since free-response studies in general involve few trials, the discrete character of such distributions violates the assumptions on which many parametric and nonparametric tests are based. To meet these objections a different evaluation procedure based on a randomization test is proposed. The randomization test is based on the sum of ratings over the trials. In the case of assigning rating values to the control pictures it can be assumed that ESP can have no affect on these ratings. If we randomly select from each trial a control-picture rating value and take the sum of these ratings, then- based on all possible combinations of ratings for control pictures over the trials a distribution is obtained which will tend to be normal even in the case that the ratings themselves were selected with unequal proba- bilities. The randomization test provides an answer to the question to what extent the sum, over the trials, of the ratings assigned to the target pictures deviates from the mean sum of the ratings as- signed to the control pictures. Consequently, the sum of the rat- ings assigned to the target pictures is expressed as a standard normal score based on the distribution of the sum of the ratings assigned to the controls. This standard normal score will be called the "standardized sum-of-ratings score" or SSR score. A good ap- proximation of this distribution is obtained by calculating the mean and standard deviation from the mean and variance of the ratings for the controls of the individual trials. The mean of the distribu- tion of sum of scores will be equal to the sum of the means of con- trols for the trials. The standard deviation is found by taking the square root of the sum of the variances for control ratings over the trials. Since SSR scores can be assumed to be standard normal, their associated probability can be obtained from the standard nor- mal distribution. SSR scores of different conditions can be com- pared because SSR scores not only can be considered standard normal but also are independent of differences between conditions in range of ratings, rating behavior, or number of controls applied. To obtain ESP scores for individual trials the rating value Approved For Release 2000/08/15 : CIA-RDP96-00792ROO0700960002-4 Approved For Release 2000108/15 : CIA-RDP96-00792ROPRt~? tyft60002-4 120 Papers I Issues and Methods 121 assigned to the target is converted into a standardized average scores appear more sensitive than Stanford Z-scores in cases of rating score for the target (SAR score), strong ESP and extreme rating behavior. The distribution of the sum of ratings for the controls can be considered as the distribution of ratings associated with that condition. Reduced to the level of individual trials we assume this distribution to be typical for the condition and express all ratings in this distribution of average ratings. Thus, all ratings are con- verted into standard normal scores by computing its distance from the mean of average ratings for the controls of the trials and divid- ing it by the standard deviation observed for these average ratings. Then for each trial a SAR score for the target is derined as the difference between this standard normal score for the target and the average standard normal score for target and controls. Since the SAR scores are based on true standard normal scores, which means scores obtained from a normal distribution, SAR scores can be considered normal too. For each trial the sum of SAR scores for controls and targets is zero. Therefore, in the case of related samples we might compare individual achievement over conditions by calculating a product-moment correlation between the SAR scores of the two conditions. Although the randomization test described above seems sta- tisticaEy sound we further studied its properties, especially regard- ing its sensitivity to detect ESP. To this end we conducted a com- puter simulation of 100 "experiments" for each combination of two variables. Each experiment consisted of 20 trials and 5 pictures per trial and was simulated by randomly generating 20 rows of 5 numbers between rating values 0 and 30, inclusive. The two vari- ables involved were subjects' rating behavior and amount of ESP. For rating behavior we manipulated the probability of selecting rat- ing values of zero. The amount of ESP was operationalized. as the number of subjects assigning the highest rating value to the target in addition to what could be expected by chance. From the data obtained it can be concluded that in most con- ditions the sensitivity of the SSR scores is rather low and less than that when, for instance, a simple binomial test was applied. Only in extreme cases of rating behavior and amount of ESP do the SSR scores become more sensitive than the binomial test. For instance, in the case of 5 ESP hits when in total 5 + 15/5 = 8 hits can be ex- pected, the binomial yields an exact one-tailed probability of p_= .01 whereas the SSR score yields on average a Z of 1.7 with an associ- ated one-talied probability of .045. In the same simulation studies Stanford Z-scores were com- puted. We know that the distributions for these Z-scores are non- normal but leaving tl-ds aside we found that in most cases the sen- sitivity of t-test evaluations based on Stanford Z-scores is compar- able to that of evaluations based on SSR scores. However, SSR From these findings some practical conclusions can be drawn. In general we must assume that the ESP influence on the data is relatively little. Hence, unless there is reason to expect a strong ESP influence in the experiment the binomial test can be assumed to be more sensitive than an evaluation based on the rating values. The same applies for experiments in which no extreme rating be- havior can be expected, for instance, in an experiment in which an atomistic approach to the judging is followed. In that case we expect in general nonzero ratings assigned to all pictures, and our findings show that in that case the SSR scores, as well as Stan- ford's Z-scores, are rather insensitive, A METHODOLOGY FOR THE DEVELOPMENT OF A KNOWLEDGE-BASED JUDGING SYSTEM FOR FREE-RESPONSE MATERIALS Dick J. Bierman (Dept. of Psychology, University of Amsterdam) It has been fouiNd that certait judges perform consistently better than others whe, at ~n targets to a target set. It seems unlikely that this is pu g~ 1 b a - of the judge's psi, since psi generally does not d, ispI consi$ent behavior. Therefore, it might at it at be hypothesized th; t it i the tuitive) knowledge of the specific judge that accounts for i be er performance on this task. It has been proposed (Morris E P, 1986, 137-149) that the use of expert systems might help researchers in tasks where they lack expertise, such as in the d ction of fraud. Morris argues that the expertise of magicians o d be formalized in such a system and made available to each ind' 'du researcher. Similarly, the exper- tise of the best judge o free- sponse material could become avail- able through impleme a on of a nowledge-based free-response judging system. Thi se of tee iques from the field of artificial intelligence (Al) to r resent scar knowledge should not be con- fused with the use o I techniques for the representation of free- response material ( aren, RIP 1986, 97-99). According to Maren, the free-response aterial and the protocols should be represented in the form of trees in which the nodes are perceivable "objects," like "flames," and the links represent relations, like "adjacent to." We expect that focusing our attention on the (knowledge used in the) human matching process might reveal more fundamental informa- tion about the role of the meaning of the material. It is striking that in Maren's proposed representation of complex target material only visual features are present. Actually, the type of visual matching that Maren proposes to be done by a machine can be bet- ter performed by any sighted human. Approved For Release 2000/08/15 : CIA-RDP96-00792ROO0700960002-4