Pula: Botswana Journal of African Studies vol.17 (2003) nO.1 Designing and Coding Survey Instruments for Statistical Analysis. Ntonghanwah Forcheh Department of Statistics, Universityof Botswana email: Forchehn@mopipLub.bw Abstract A forgollen part or generally ignored aspect of survey research is the preparation of the research instrument for suitable and efficient data entry and analysis. What ever is available appears to elude most researchers in the social sciences and humanities. The consequence is that these researchers, some of whom are authorities in their fields, frequently fall victim to poorly designed instruments, which can only be used to answer their research questions in the most extraneous manner. Ajier many years of interacting with researchersfrom several areas of enquiry, and following urging from some of the researchers, I decided to write this paper as a sort of 'where there is no doctor' type of paper. Theproblems discussed are illustrated using actual questionnaire items from a wide range of instruments, which I have been involved in at one stage or another. The main focus is how to turn questionnaire items into suitable data for data entry and analysis. Other issues discussed include understanding data handling in statistical packages, especially in relation to the traditional classifications of variables that most empirical researchers are familiar with. The paper highlights challenges in designing questionnaire instruments that are relevant in investigating the research hypotheses in social and behavioural enquiries. Introduction At the dawn of Statistics as a scientific discipline, an excited enthusiast predicted its future as follows: Even on the very threshold of the scientific realization of this valuable adjunct it is clear that a great future awaits our present 'Statistik', and we may reasonably anticipate that the combination of statistical data and analysis will create a science which will excel every other based on mathematics, not even excepting astronomy, mechanics and physics (Zeuner circa 1860, in Pearson 1978). Within a century of this prediction, Statistics has now penetrated into more disciplines than any other scientific subject. No university degree programme is nowadays complete without a course in at least elementary statistics. In their revolutionary 'Outcome Based Education (OBE)' programme, the South Africa education ministry has stated as one of its goals 'to make all South Africans Statistically literate'. Even self-confessed qualitative researchers still have to depend on statistical methods for sample selection. Despite this widespread interest in statistical methods and their use in scientific and behavioural inquiries, many researchers continue to stumble as they try to apply the methods in their research endeavours. During the last decade, I have interacted extensively, as consultant and/or adviser, with researchers and consultants from a wide range of backgrounds, notably energy and environmental studies, market 67 research, HIV/AIDS surveys, agriculture, career and counselling, information and communication, and have identified recurrent problems to most researchers, especially those from the social and behavioural sciences. These problems are identifying appropriate instruments for investigating research hypotheses, preparing such instruments in forms suitable for quick and efficient data entry and analysis, and selecting the appropriate statistical tools for performing the analyses. The focus of this paper is to provide a systematic and comprehensive approach to preparing quantitative survey instruments when the resulting data are to be captured and analysed using a statistical software package. The outcome of the suggested methods would be a tidy, well-coded questionnaire aimed at avoiding expensive and time-wasting coding after data collection. Even after a questionnaire has been well designed, determining variables from complex questionnaire items can still be a frustrating endeavour. This has been the case in a number of large-scale studies brought to my attention, in which several thousand questionnaires had been captured into the software packages, only to discover that some of the variable definitions were so bad that the data could not be analysed after data entry. The paper therefore, provides illustrated methods of how to turn questionnaire items into variables for purposes of data entry and analysis. Suggestions are also made on how to select variable names that do not follow the usual methods of using fancy and confusing abbreviations, dealing with open-ended questions, linking questionnaire items to research objectives. A brief discussion of the need to relate survey questions to research objectives is presented in the next section. This is followed in section 3 by a discussion of the main classes of variables and data types, with emphasis on their representation in computer programmes. The different types of questions encountered in surveys in the social sciences, behavioural studies and related disciplines are presented in section 4, and the associated variables that should be derived from these questions are discussed. Questions considered range from the simplest YesfNo-type questions to the complicated grid, combination and open-ended questions. Throughout, questionnaire items from actual questionnaires are used. The paper should be relevant to any researcher conducting a survey with the view of analysing the resulting data using a statistical package. Data Representation in Statistical Packages A benefit of the computer revolution is that standard data analysis packages now include a wide array of data analysis tools which until recently were available only from purposely written programmes. An understanding of the assumptions behind most ~f the techniques requires more advanced statistical knowledge than is possible fro~ mt~oductory statistics courses offered to most undergraduate students in the SOCialsCiences and humanities. Yet with a few clicks of the mouse, pages of results from each technique can be generated, giving the impression that expert knowledge is no longer required, if only one is computer literate. The inappropriate application of these tools to data analysis is becoming the main source of the misuse of statistics. A common example of this misuse is when techniques designed for the analysis. of quantitative data such as comparing means, standard deviation, Pearson corr~lat~on coefficient and fitting linear regression are used to analyse pre-coded qualitative data. 68 For example, in a paper on job satisfaction, an experienced researcher included the following two questions among other questionnaire items: Q2: Highest Level of Education I I. None I 2. Primary I 3. Secondary I 4. Tertiary The responses were appropriately coded and labelled using codes: 1='Very satisfied', 2='Satisfied', 3='Neutral', 4='Dissatisfied' and 5= 'Very Dissatisfied' 1 'None', 2 ' Primary', 3 'Secondary', and 4 'Tertiary' Data for both questions were then entered into the computer as numbers rather than categories. The researcher wanted to determine if there was any correlation between job satisfaction and educational level, and to compare the level of satisfaction of male respondents with that of female respondents. The Pearson correlation coefficient between job satisfaction and the level of education was found to be -0.333, with a 'two-sided significance' of 0.000 (n=308). The mean and standard deviations of the level of job Satisfaction and the level of education were tabulated as shown in Table 1 below: Table 1. Comparison of Job Satisfaction and Level of Education of Female and male respondents: Gender Statistic Job Satisfaction Education Level Male Mean 3.25 2.35 Std. Deviation 1.24 0.75 Female Mean 3.47 2.01 Std. Deviation 1.13 0.78 Total Mean 3.42 2.08 Std. Deviation 1.15 0.79 The researcher concluded that there was a significant negative correlation between job satisfaction and the level of education, and further that male respondents were slightly more satisfied (mean = 3.25) than females (mean = 3.47). Also, males had attended higher levels of education (mean=2.35) than female respondents (mean = 2.01). Subsequent discussions were based on these and similar results. The numbers as presented unfortunately do not prove any link between higher education and job satisfaction. To prove such a link, one needs to compare education levels of those satisfied with levels for those not satisfied, using tools such as contingency table analysis, chi-squared tests for association, ordinal regression, etc. The negative correlation coefficient between job satisfaction and level of education is a result of the arbitrary coding system employed and the value is also dependent on 69 coding scale. The application and interpretation of the mean. standard deviation. linear correlation coefficients and tests based on them such as t-tests. analysis of variance and regression are based on the assumption that the data are measured on a ratio scale rather than nominal or ordinal scale (such as number of years of education as opposed to education level, income as opposed to income category. etc.). In general, the appropriate tools for analysing data depend on the type of variable/data as well as the objective of the study (or research question). While concepts such as mean age, mean weight, mean number of years of education. have an inherent meaning, concepts such as mean marital status. mean level of satisfaction. mean level of education are virtually meaningless. Relating Survey Questions to Research Objectives Relating survey questions to research objectives requires a clear idea of the full meaning of issues to be investigated, the clarification of any ambiguous words or phrases in the stated objectives and the specification of the population to be surveyed. The type and source of information required in order to address the general aims and objectives as well as each of the specific objectives of the study must be properly specified from the onset. Operationally, it will usually be necessary to break down each main objective into sub-objectives (specific objectives) in which ambiguous terms are clarified. Each specific objective may itself be broken down into further sub-objectives. until the basic concepts and issues to be investigated have been clearly defined. A similar 'step- wise-refinement' approach can then be utilised to determine the information to be gathered. In this regard, the specific objectives could be rephrased in terms of research questions. Illustration 1 Consider a study whose main objective is to establish the needs of potential orphans from terminally ill patients in Botslmna. with a l'iell' to formulating appropriate policy procedures to assist the children. Key issues that must be resolved early on include: • Who is a terminally ill patient and where can thev be contacted') • What questions should the questionnaire contain to ensure that all relevant information is captured and unnecessarv information is avoided? • What age group of children should be included (the usual restriction to under 18 may be inadequate, since some children over 18 may still be in schoo!)? • How should the questionnaire items be phrased to avotd further distress to the family, who would naturally still harbour hopes that their loved one would recover? Illustration 2 Consider a study whose main objective is 10 i/ll'esligale Ihe causes of school dropout in Botswana. This main objective must be broken down into sub-objectives before any meaningful ~ttempt can be made at designing the relevant questions. Some specific objectives mclude: 70 1. To determine the 'causes' of dropout in Primary, junior secondary and senior secondary schools in Botswana. 2. To investigate if the 'causes' of dropout differ between the rural, semi-urban and urban areas. 3. To determine if there are gender-forces at play in school dropout. 4. To suggest recommendations on how to reduce/or eliminate school dropout. Key terms will also need clarification, for example: a) What does 'school dropout' actually mean and how would it be measured? • Does it mean those children of school going age not actually in school? • Is it to be the difference between the school enrolments at the beginning of the year and the enrolments at the end? • What about those children who transferred/changed schools during the year, or those who joined in the middle of the school year? b) Is classification of schools as 'rural', 'semi-urban' and 'urban' clear and unambiguous? c) How are possible 'gender forces' to be measured? Will differences between male and female dropout rates indicate gender forces? What about sex and or marital status of head of household? d) What other background factors might be related to causes of dropout that may assist in policy formulation? For example, might the causes of dropout be related to: • Parents' factors such as their occupation, socio-economic status, culture, religion, marital status, age, income level, etc.? • Child's factors, such as physical or psychological impairment, age at which the child started school, earlier abuse by adults, etc.? • School factors such as conflict between school discipline and home discipline, victimisation by peers and/or teachers, dislike of school food, times at which classes begin, etc.? e) Is the severity of the problem known or is it to be estimated? Does dropout rate vary widely from region to region? In order to investigate the first two main objectives, the data collection instrument (such as a questionnaire), must differentiate between the different school levels as well as between rural, semi-urban and urban schools. If 'gender forces' are to be measured using differences in male and female dropout rates as well as sex of head of the household, then appropriate variables must be included in the survey instrument to capture this information. Similarly, variables must be included in the instrument to capture information necessary for investigating each of the factors listed in (a) to (e). Illustration 3 Suppose that your department wishes to set up a Postgraduate degree programme during the next National Development Programme (NDP). As one of the 71 conditions for the University approving the programme. you must carry out a 'needs assessment'. The primary objective appears to be quite simple: To carry out a needs assessment of the proposed programme. But then: how is need to be defined? a) Does it refer to citizens expressing an interest in joining the programme (as proposed)? b) Is it the number of students that potential sponsors are prepared to support in a year? c) Is it the number of job vacancies requiring people with the given qualifications? d) Is it the existing pool of graduates who meet the entry requirements? e) Is it the projected enrolments at the undergraduate level? f) Is it the proportionate number of graduates from sister departments who are joining their own existing programmes? g) Is it the age-specific national populations? h) What University and/or Departmental factors (such as content of the proposed programme, previous experience in lower level courses in that department, faculty/university regulations, etc.) may influence interest in the programme? i) What external factors such as sponsorship, perceived market demand for graduates, perceived prestige of graduating from that department/university might influence demand? The appropriate definition will be arrived at through consultation with all stakeholders. More than one definition may even be adopted in order to satisfy different stakeholders. The survey population as well as the key questions to include in the survey instrument will depend on which of the above definitions are adopted. Data Types Vs Variable Types Data resulting from a survey are generally entered into a statistical package as cases and variables. In a survey, each questionnaire will produce one case if all information collected relate to a single respondent (such as the head of a household), or each questionnaire could produce two or more cases if some of the required information relates to more than one person (such as members within a household). The variables consist of specific responses to each question. Thus one questionnaire item may produce a single variable if it leads to a single response only, or one question could produce two or more variables, if it leads to two or more possible responses. I.n stati~tical analyses, variables are distinguishable in a number of ways dep<:ndl?g mamly on the type of values that they take, and sometimes on area of application. The two main classifications are qualitative (string) and quantitative (numeric) variables . . Quali~ative variables are also called categorical variables in the stati~tics literature, smce the underlying values that they take simply divide objects mto 72 different mutually exclusive categories (such as male/female; poor/average/good, etc.). These categories can be stored in the computer as they are (Le. as strings such as Male, Female, Good, Poor, etc.) or they can be coded into numbers and the numbers entered into the computer instead of their text equivalents. Non-numeric values take up a lot more computer memory than numeric values. Also typing text values such as (Male or Female) is much more time consuming and error prone than typing numbers (such as I and 2). Furthermore, data analysis programmes think of upper and lower case characters as representing different values. Hence the three string values 'male', 'Male', 'MALE' are taken to be different values. As such, the values of qualitative variables are usually given numeric codes (such as I for Male and 2 for Female), and these numbers are then entered into the computer in the place of the actual text values. This innovative method of dealing with an otherwise intractable problem has become a source of much confusion in data entry and analysis. Once data have been entered as numbers, analysis tools have no way of telling that these numbers are arbitrary. This is because data analysis programmes 'see' only data types and not variable types. The data type for a given variable is the way in which the values for that variable are represented in the computer. Typically, the data will be stored either as numbers (numeric), letters (alpha) or a combination of both (alphanumeric or string). When numeric codes are used to denote nominal and ordinal categories of a variable, the data type is then numeric, while the variable type is string. Appropriate tools for analysis apply to variable types which the researcher knows, and not the data types which the computer software accepts. The onus is on the researcher to know the type for each variable, and hence use the appropriate statistical tools for data analysis. Failing this, one risks computing meaningless summary measures such as average gender, average age group, average level of satisfaction, and the like, especially when using tools such as regression and analysis of variance, which implicitly compute and utilise these measures in order to provide the model parameters and significance. Types of Questions and Associated Variables Synonymous to variable classification is the classification of questions in a questionnaire. This section considers the most common classes of questions and the corresponding variable types that can be derived from such questions. Each type of question has been illustrated by examples from actual surveys. No attempt has been made to improve the phrasing of questions or the codes used for the values of the qualitative variables. The coding systems used for categorical variables is created only by numeric data types, since these are the most efficient data types for both data entry and analysis. Dichotomous or Yes/No Questions A dichotomous question is one that has only two possible answers. Such questions are also referred to as YES/NO questions. Examples "~;ld include the following: AI, H,,, yoo to '~" ;n the ",min. ""tion' (DRP. 1999) Yes No c=J A2. District Number I 73 S 1 Ngamiland West ,(Botswana, 1997) S2 NgamiiandEast A3. Sex (of Respondent), (Nkrumah,199S). 1. Male 2. Female A4. Men must share,in house work (IEC, 1997) .-:..' Agree _ Disagree AS. What was your final result last academic year? o 'Repeatedtheyear 1 Proceeded to, the next year A6: Health'Status (of potential orphan): H1V Positive /D (Orphans Form B) While some questions have a natural ..diehotom¥ in their response, such as At-A3, for some, 'the dichotOmy is only forced on the respondent by the stated options (A4, AS), while for 'Others,.there .is.no ordering at all (A6). Each dichotomous 'question, produces'. a' single nOmintll (dichotomous) variable atdata entry~ even when there is a naturatordering in the answers. such as in questions A4 and AS.' When using numbers to represent the values of a dichotomous variable, it is good practice to, use ,'0' •and '1" unless the" codes have a, particular meaning (as question A2).There are several advantages to this convention'during data analysis. Forexampte, the total number of'Yes'cases can be obtained just by summing the responses. Also, some statistical procedures such as linear regression will give fueaningfulresults only if the binary responses are recoded as 0 and 1 (also called dummy variables). Multlple-ch?ice Questions lVlth a Sln,"'lIerp~nse Here the respondent is provided with a Faftle'ot\~~ionsifromwRfe~. mHyond'opti~1,is, to beseleeted.The categories could be ftominafior'or'dt11at For nominal OItegorie5~Janyset;ofnumbers coUld be used to' representthe~ttgon'res.F()r •• $e 'of'dItEllntrYf.eael'l cat"lory 'skoutdbe "pre-eoded in the qUestion. at Age group: 1)15--20 2) 21----29 3)3()"';-39'4)4(495) S0+(FSS,I998) B2. What is the main s<>ureeoffoodofyour household? (CBPPI9J1) f Own production 1 Market ;purcbases 2 Government food rations 3 Wages in Kind 4 Gifts from Relatives S Other (specify) " 6 B3. How do you rate the state of refuse collection in your area? (Gaborone, 1997) Very good Good Satisfactory Poor Don't Know Usually, each multiple-choice question with a single response produces a single nominal or ordinal variable. If the question includes 'Other (specify)' category as in B2, then the specified responses can either be analysed qualitatively, or an additional string variable could be created, and treated as an open-ended question. It should be noted however, that open-ended questions are time consuming to complete, complicated to analyse, and usually contain much higher non-response rates than closed-ended questions. Question B I and B3 above lead to one ordinal variable each, while question B2 leads to a nominal variable. Note that the wording of B2 is crucial, if it is to lead to a single variable. It is possible, and indeed likely, that many respondents will have more than one source of food. By requesting just the 'main source', perhaps the researcher wanted to restrict attention only to one source. Had the question been phrased as - 'What are the sources of food for your household?' - then the question would have been a multiple-choice question with several responses, sometimes referred to as 'Check-All-that-Apply' questions. These are discussed in the next section. Multiple-choice Question with Several Responses A multiple-choice question with several responses is a collection of dichotomous questions under one main question. During data entry, each category produces a dichotomous (Yes/No) variable. Examples: C 1. Which of the following do you think can promote career guidance activities/services for youths in Botswana? Please put X in front of your response. (NCSS, 1998) I. Adequate career resources/materials _n__m n __n_n n_n __n __n __nmnn 2. Trained personnel in career services n_m_nn nm __nn __m nnn n 3. Upgrading career services/activities in schools _mn m m m _ 4. Establishing of Career Centres for youth in the regions-mm-mn-----n-----n 5. Other (speci fy)__n nnn n __n __nn __n __nn nn n nn _ C2. Which of the following languages do you speak at home? (MPSAQ. 1998) I 3 English I ~etswana Kalanga 75 C3. Since 1966, several elections have been held in Botswana in every 5 years. Please indicate whether or not you voted. (DRP, 1999) 1965 1969 1974 1979 1984 1989 1994 C4. Identify all the correct ways you know on how to avoid pregnancies (please tick every way you know) (lEC, 1997) 1. Non-Penetratin2 Sex 9. Male Sterilisation 2. Condoms 10. Female Sterilisation 3. Withdrawal II. Full Abstinence 4. IUD 12. Periodic Abstinence 5. Iniectibles 13. Use Traditional Doctor 6. Diaohra2mlFoam/Jellv 14. Morning After Pill 7. Pills 15. Home Preparation 8. Abortion 16. Have sex with a Virgin Question Cl from the Career Services study is equivalent to 4 different Yes/No questions on whether each of the specified items 1-4, can promote career guidance, and a fifth open-ended question about which other activity could promote such services. The question thus leads directly to 4 nominal variables: QCla 'Adequate Career resources can promote career guidance'; QC 1b 'Trained Person in career services can promote career guidance', and similarly, for QClc and QCld. The string variable QCle 'Other factors that can promote career guidance' should be defined if many respondents select the 'Other (specify)' category. Question C2 from the Media Profile and Situation Analysis Questionnaire (MPSAQ) leads to 3 dichotomous variables: QC2a 'Speaks English at home'; QC2b 'Speaks Setswana at home'; QC2c 'Speaks Kalanga at home'. Question C3 from the University of Botswana 1999 Democracy Project questionnaire leads to 7 nominal 'Voted/Did not vote' variables corresponding to the seven different elections. However, on closer examination this question should really have three categories (1: 'Voted', 2: 'Did not vote' and 3: 'Was not eligible'). The added category is needed to distinguish those people who were eligible and did not vote, from those who were not eligible due to some restrictions, such as age. Question C4 from the Information, Education and Culture (IEC) study appears to be a collection of 16 Yes/No questions, which could be represented by 16 76 dichotomous variables. However, the purpose of the question was to assess the knowledge level of respondents about methods of preventing pregnancies rather than measuring how many respondents answered yes/no to a particular category. So ultimately, what is required is an indicator of how knowledgeable the respondents were. For data entry purposes, the 16 nominal (Yes/No) variables (QC4 1 _ QC4_16) should be defined. During data analysis, however, two indicator vari;bles need to be created to measure the knowledge level of the respondents. These new variables could be: QC4A: 'Number of correct responses' and QC4B 'Number of incorrect responses' Without loss of generality, let us suppose that QC4_1 to QC4_1 1 represent the correct ways, and QC4_12 to QC4_16 represented the incorrect ways of preventing pregnancies. Then the number of correct responses, QC4A would be obtained by counting the number of 'Yes's' among QC4_1 to QC4_1 I, while QC4B would be obtained by counting the number of 'Yes's' among QC4_12 to Q26_16. This counting would be greatly facilitated if the values of QC4_1 to QC4_16 were pre- coded with 1 used to represent 'Yes' and 0 used to represent 'No' respectively. Ranking Questions A Ranking question is a multiple-choice type question in which the respondent ranks the given options. The range of values is equal to the number of options. Unlike the usual multiple-choice questions, the answer to each of the categories of a rank question are inter-dependent. Firstly, one needs to see all the options, before deciding on which is the best, and which is the worst. Secondly, tied ranks are usually not allowed, so that once one category has been given a particular rank, no other category is allowed be given the same rank. Hence there should be only 1 tick in each row, and in each column of the table. D 1 Rank the following social activities according to your interest (I =best, 2=2nd best, etc.) Rank Activity 1 2 3 4 5 6 1. Eatinl! Out 2. Going to the Movies 3. Going to Nite clubs 4. Attending house parties 5. Watchinl! TV 6. Recreational sports (Soccer, sQuash, etc.) There are two ways of going about a ranking question. If greater emphasis is on the categories, rather than ranks, then each category will generate an ordinal variable. Hence in question DI, the 6 ordinal variables will be: QD1_1 '(rank of) Eating out' QDI_2 'Going to movies' up-to QD1_6 'Recreational sports'. 77 If emphasis is on determining which activities are ranked first. which are ranked second, etc., then each rank will generate a nominal variable with six categories, representing the six activities, that is: I 'Eating Out"; 2 'going to movies' ... '; 6 'Recreational sports'. The six (6) nominal variables are: QDla 'Most preferred social activity'; QD I b 'Second most preferred social activity'; up-to QDlf 'Least Preferred Social Activity'. Grid Questions Grid Questions are essentially a collection of multiple-choice questions into a single question. Each category in the grid is equivalent to a multiple- choice question (either with a single response or with several responses). Examples are as follows; E I. Which type of music do you like/dislike hearing on the radio? O=Hate it; I=Dislike it; 2=Like it; 3=Like it a lot; 4=Love it!; 9=D;m~tKmwit (RLS,1998) Tvpe of Music 0 1 2 3 4 9 I. R&B 2. Gospel 3. Traditional 4. Reggae 5. Hip HOP 6. Jazz 7. CHOIR 8. AfiuTL ,I_~ .Nlma ctc.) 9. De Gong (Thebe etc.) 10. Kwaito II. Kwasa Kwasa 12. Other (specify) Question E I combines II multiple-choice questions, each with a single response into a single question. Each of the I I questions then leads to a single ordinal variable: QEI_I 'Extent to which respondent would like to hear R&B on the radio'. up-to QE 1_I I 'Extent to which respondent would like to hear Kwasa Kwasa on the radio'. th The 12 category (other) is equivalent to asking respondents to list any music they would like/dislike, and then specify to what extent they would like/dislike hearing the music. This can lead potentially to many multiple response questions. [n general however, most respondents will leave this option blank and hence the few responses can be analysed qualitatively. Note the inclusion of the last category (Do not know). This ensures that each respondent answers each of the questions. Some research questions in this particular study required an analysis involving the music that respondents liked and those that they disliked. The required 78 variables can be computed from QE I_I to QE I_II. Unlike for rank questions, it is not appropriate to define just 4 variables, each having 12 possible categories corresponding to the different types of music: R&B, Gospel, etc.: QEI_A 'Music that Hate' QEI B 'Music that Dislike' QEI_C 'Music that Like' QEI D 'Music that Like a Lot' This is because a variable can have only one value per unit of analysis in the database, whereas for each of the variables QEI_A to QEI_D, each respondent can specify more than one answer. Faced with such responses, some researchers have resolved to enter values such as 137, etc. to indicate that the respondent had selected options I, 3 and 7 for the given variable. Others resort to more fancy/ingenious methods of recoding, but in my experience, none of these approaches is ever adequate beyond frequency analysis of data. The second example (E2) of a grid question is taken from a 1996 survey of teachers about a proposed curriculum in Population/Family life Education (POP/FLE). Here each of the 10 topics constitutes a multiple-choice question with several (I to 8) possible responses. E2. If a Policy is made to include all the topics listed (a-k) below in the POP/FLE curricula circle what problems listed (1-8), you would have in teaching? I None 5 Against Religion 2 Lack of teaching Materials 6 Opposition from parents 3 Lack of adequate Knowledge 7 Opposition from religious leaders 4 Against Culture 8 Other (specify) (POP/FLE,1996) a. Population growth and development I 2 3 4 5 6 7 8 b. Economic and Social development I 2 3 4 5 6 7 8 c. National Resources and environment I 2 3 4 5 6 7 8 d. Marriage, family life and welfare I 2 3 4 5 6 7 8 e. Pregnancy and birth control I 2 3 4 5 6 7 8 f. Human reproduction I 2 3 4 5 6 7 8 g. Social-Cultural factors influencing I 2 3 4 5 6 7 8 sexual development and sexual life h. Social responsibility I 2 3 4 5 6 7 8 i. Problems and issues relating to I 2 3 4 5 6 7 8 sexual conduct i.e. undesired pregnancy, STDs, abortion, HIY/AIDS j. Gender Issues-Traditional and I 2 3 4 5 6 7 8 changing social roles and relationships between men and women K Other (specify) I 2 3 4 5 6 7 8 79 This grid is equivalent to IIx8 = 88 nominal (Yes/No) variables, corresponding to the II questions (a to k) and the eight (8) yes/no responses to each question. The following variables could be defined: QEZal_1 'There will be No problem in teaching if Population growth & development is included' QEZal_Z 'We shall experience lack of teaching materials if Population growth & development is included' Until: QEZL7 'We shall experience opposition from religious leaders if gender issues are included' QEZk_8 'We shall experience other problems if other specified issues are included' Any other method of representation will lead to loss in information, or analysis problems. For example, treating each of the question a-k as a multiple-choice question with a single response will not work. This is because each respondent can select more than one reason from lack of teaching materials to 'other'. During data analysis, different multi-response sets can be generated, depending on the questions of interest. Furthermore, if it turns out that no respondent listed more than 3 problems say, then 4 multi-response sets can be defined for 'No Problem', I S\ Znd and 3rd problems, each containing 10 categories corresponding to each of the 10 policies. If on the other hand, interest is to see how each problem (say lack of teaching materials) affects the different policies, then 6 multi-response variables corresponding to the 6 problems: a) lack of teaching materials, b) Lack of adequate knowledge, and so on, could be defined. Each of the 6 variables will also contain 10 categories corresponding to each of the 10 policies. Combination Questions Combination questions arise when more than one type of response is required from each question. Consider question FI (below) from a survey of Hotel users in Botswana. Fl. Which hotels do you stay in most often? (Under the rank column, record I for the first choice, 2 for the second choice and so on). Then for each hotel ranked, indicate what was your main reason for choosing this hotel. 80 Main Reasons for choosing hotel (Please record the appropriate number next to each hotel) J =Cost 2=Service 3=Availability 4=Reputation 5=Membership 6=Company choice 7= Other BWHU,1997) Hotel Rank Main reason Bosele Hotel-Best Western Botsalo Travel Inn City Lodge Gaborone Sun Gaborone Travel Inn Grand Palm/Sheraton Maransz Morninp; Star Mowana Safari Lodsze Oasis President Rilevs-Best Western Sedia Tati-F/Town Thaoama Cresta Lodsze This question is really a combination question comprising of what looks like a rank question, and what looks like a multiple-choice question with several responses. Note that each respondent would have stayed in anything between 1 and all 16 hotels. He/she will rank only those hotels that they have stayed in, and give corresponding reasons why they chose that hotel. That is why the resulting questions are not exactly rank and multiple-choice questions. The general approach to organising the resulting data is to define two sets of variables; one set for the information on ranks and the other set for the 'reasons'. For example, to collect information on the hotel rankings, define 16 rank-type questions corresponding to the 16 hotels, each with ranks 0, I, 2,..,k, where k is the number of hotels that the respondent has stayed in (lc:;16) and 0 indicates that the respondent has not stayed in that hotel. For instance, you could have: QFlal_1 'Rank of Bosele Hotel Best Western' QFlal_2 'Rank of Botsalo Travel Inn': QFlal_16 'Rank ofCresta Lodge' QFla2_1 'Main Reason for choosing Bosele Hotel Best Western' QFla2_2 'Main Reason for choosing Botsalo Travel Inn' up-to QFla2_16 'Main Reason for choosing Cresta Lodge' To organise information on the reasons given, define 16 nominal variables, one for each hotel, and having 8 categories: 1 = Cost; 2 = Service; until 7 = 'Other'; 8 'Combination of reasons'. For this particular study, it turned out that most people ever stayed only in a few hotels (rarely more than 4). Furthermore, the consultants were interested only in the 81 top three ranked hotels. Thus an alternative approach was to define 3 nominal variables: QFla: 'Top Ranked hotel' QFlb: 'Second ranked hotel' and QFlc 'Third ranked hotel'. These are nominal variables, with possible categories: 1= 'Bosele Best Western', 2 'Botsalo Travel Inn' ... ,16 = 'Cresta Lodge'. In addition, QFI band QFc will include a 17th category; 0 'Not Applicable' for those who only stay in one or two hotels respectively. Open-Ended Questions Survey instruments in the so-called qualitative research studies in social science are usually dominated by open-ended questions. Experts in qualitative data analysis can best defend the merits of using open-ended questions to gather information. For large scale surveys or surveys in which the use of statistical packages for data analysis is envisaged, open-ended questions should be avoided. Although great improvements have been made in the ability of statistical packages to handle open-ended questions, any advantages in using open-ended questions may be outweighed by numerous problems from data gathering to data analysis. Typical problems include: • the time spent to respond to, and to record the response to the questions during data collection, • the increased likelihood of misunderstanding the question, which may lead to inappropriate responses or non-response, • the time taken to capture the information into computer, and likely errors, • the difficulties in organising the responses into meaningful categories and the corresponding costs on data analysis. Thus as much as possible, open-ended questions should be avoided in large-scale surveys. Flexibility could be added by providing, as the last category of a close-ended quantitative question, the category: Other (specify) . The appropriate categories to any qualitative question could be obtained through a pilot study and literature review. Consider the following questionnaire from a 1998 Media Profile and Situation Analysis (MPSAQ) questionnaire: GI. Name three types of TV programmes that you watch regularly: When the data were recorded as three string variables: 1 QGla: '1. programme'; QGlb '2nd Programme'; QGlc '3rd programme'; There were 50 different responses for QG la, 51 responses for QG I band 40 responses for QGlc. These included duplicates such as 'COMDIES', 'COMEDIES' and 'COMEDY', and similar programmes like 'FILMS' and 'MOVIES'; 'LADUMA', 'MABALENG'; 'SPORTS' and 'SOCCER', etc. For the purpose of the study, these fine distinctions were unnecessary. It took several hours of analysis, meetings and computation to recode the variables into 10 categories: 82 I 'Chat Shows'; 2 'Comedies'; 3 'Documentaries" 4 'Drama" 5 'Music'; 6 'Movies'; 7 'News'; 8 'Soaps'; 9 'Sports'; 10 :Others'. If a pilot study had been undertaken, a close-ended multiple-choice question with the above options could have been used, and some time and money saved in data capture and coding. In general, it is advisable to always carry out a pilot study, even for small-scale studies. The size of the questionnaire and the diversity of the respondents should guide the size of the pilot study. A large questionnaire or a large- scale study would require more interviews to reveal all the shortcomings. On the other hand, a questionnaire with only a limited number of items would require only a small number of respondents to validate it. Conclusion An attempt has been made to address the key issues in the organisation of survey data to facilitate proper data capture and analysis using statistical packages. The choice of the issues covered in the paper is based on the demand from numerous researchers that I have interacted with over the years. The issues are particularly relevant for researchers who wish to collect primary information for research in new areas such as HIV/AIDS, were standard instruments are not yet fully developed. Indeed Forms A, B, C and D currently used in Botswana for collecting information in this area need much revision and standardization. Any researcher wishing to undertake this novel responsibility should benefit from the issues raised and the advice provided in this paper. No doubts that not all areas of concern to survey-researchers have been address. To do so requires an entire book, for which this article can only form a chapter. The suggestions provided have been tested on many research studies and have been shared with many colleagues and students involved in survey studies. Very positive reviews have also been obtained from researchers on environmental science and biology/ecology involved in conducting and analysing field experimental studies. Bibliography Czaja R. and Blair J., (1996), Designing Surveys. A guide to decisions and procedures, Pine Forge Press: London. Freund 1. E., Williams FJ and Perles B. M., (1993), Elementary Business Statistics. Prentice-Hall: New Jersey. Stevens S.S., (1951), Handbook of Experimental Psychology, John Wiley and sons, Inc.: New York. Varkevisser C. M., Pathmanathan I. and Brownlee A., (1995), Designing and Conducting Health Systems Research, (unpublished training guide) IDRC,Canada Questionnaires Referred to Botswana (1997), Monitoring the effects of the CBPP eradication in Ngamiland. Ministry of Agriculture, Botswana BWHU (1997), Best Western Hotels users survey. Momarketing Options, Botswana DRP (1999), The 1999 Democracy Research Program Questionnaire, University of Bot5":ana. FSS (1998), A Bachelor of Social Science Project Questionnaire, (personal ~ommumcatlOn). Gaborone (1997), Gaborone City Council Survey of Refuse Collection ServIces.10. Gaboron~. IEC, (1997), The 1997 Information Communication and Education Survey, Mmlstry of Fmance, Botswana. 83 MPSAQ (1998), The 1998 Media Profile and Situation Analysis Questionnaire. Ministry of Finance, Botswana NCSS, (1998), The 1998. National Careers Services Study, Department of Non-formal Basic Education, Botswana. POPIFILE (1996), Population and Family Life Education Opinion Survey, (personal communication). RLS (1998), Radio Listeners Survey. Private radio Consortium, Gaborone. • I wish to acknowledge all those peers and editors who assisted in reviewing this paper. 84