Who knows this political research

Category: Education

Please complete this problem set in R and a word processing program, including graphics and tables where appropriate. The problem set is due by email ([email protected]) at midnight on March 30, 2015. Please attach both the problem set answers and the R code file to the email.

I have provided tips for the necessary R code where appropriate, and additionally remind you that everything in this problem set appears in the code files for the lectures. Feel free to use optional arguments to the functions to do things like customize graphics, if you like. Some arguments will be necessary, e.g. <function>(<object>, na.rm=T) where R cannot calculate a statistic because of missing data (NAs).

Please feel free to ask for feedback as your make progress on the homework. My goal is for everyone to receive full credit and understand the material.

1 Preamble and Data

First, download the data [country2008] from D2L, and save it in the folder you will designate as your working directory. Second, begin your R code, starting with the preamble, and then setting your working directory and loading the data:

## <name> ## <email> ## <date> ## ## Georgia State University ## POLS 3800-<section>, Introduction to Research Methods ## Problem Set 1

## Load libraries library(<package>)

## Set working directory setwd(“<path to folder>”)

## Load data

dat <- dget(“country2008”)

1

2 Summary Statistics

2.1 Overview

Examine a broad summary of your data [dim(<data object>), summary(<data object>)].

1. How many observations are in the dataset? What are the units of observation? Is the data cross-sectional, time-series, or both (TSCS)? How many variables are there?

2. What sorts of summary statistics does summary(<data object>) provide? Why do some variables have different statistics from others?

2.2 Measures of Central Tendency and Variation

1. How many observations are for European countries [dat$europe]? North and South American [dat$americas]? Asian [dat$asia]? African [dat$africa]? Present your answer as a one-way frequency table; use either table(<variable>) for each variable, or summary(<data set>) to find the sums. What kind of variable (in terms of level of measurement) is each regional indicator? If one drew an observation at random from the data, from what region is it most likely to come? What measure of central tendency did you use to determine this?

2. What is the mean [mean(<data object>)] population [dat$pop, in 1,000s] of the countries in the data? The median [median(<data object>)]? Are those two answers the same for this data, or different? Why? (Note that you may need to include the argument na.rm=T in the function if it has missing data – it will return “NA” if this is true: e.g., mean(<data object>, na.rm=T); this applies to almost all the summary statistic functions). What happens if one takes the trimmed mean of the population variable, dropping the smallest and largest two observations? Use: mean(sort(<data object>)[-c(1,2,length(sort(<data object>))-1,length(sort(<data object>)))], na.rm=T) which sorts the variable from smallest to largest, excludes the first two and last two observations, and takes the mean while ignoring missing data. What happens if you take the logarithm of the population variable [mean(log(dat$pop))]? Are the mean and median closer to each other? Plot and include in your answer two histograms1 of the variable, with and without a log-transformation:

1Tip: “portable network graphics,” i.e. .png files, are useful formats for images because such files can be scaled in a document without suffering as much loss of resolution.

2

png(“pophist1.png”,width=600,height=480) # saves a graphic to the work. dir. hist(<data object>) dev.off() # closes the graphical device

png(“pophist2.png”,width=600,height=480) hist(log(<data object>)) dev.off()

What is the interquartile range of the population variable [IQR(<data object>)]? Create, include, and interpret a box-whisker plot in your answer for both the normal and transformed variable:

png(“popboxwhisk1.png”,width=480,height=600) boxplot(<data object>) dev.off()

png(“popboxwhisk2.png”,width=480,height=600) boxplot(log(<data object>)) dev.off()

Are there outliers in either plot? How is an outlier defined in a box-whisker plot?

3. What is the variance [var(<data object>)] of the per capita income variables for mean and women [dat$income.m and dat$income.f]? The standard deviation [sd(<data object>)]? The standard error? (For this last statistic you will need to divide the standard deviation by the square root of the number of observations [length(which(!is.na(<data object>)))]). The data here is a sample. What is the 95% confidence interval for the sample mean [c(mean(<data object>)-qnorm(0.975)*<standard error>,mean(<data object>)-qnorm(0.975)*<standard error>]? The 90% confidence interval?

3 Simple Hypothesis Tests

1. Examine male per capita income [dat$income.m]. Can we reject the null hypothesis (given a significance level α = 0.05) that the population mean for male per capita income is $12,000? $11,500? $11,000? Use the t.test() function to answer this question. What if our significance level is 0.01?

2. What is average (mean) mean age of marriage for men in the data [dat$mean.marr.m]? For women [dat$mean.marr.f]? Do a difference-in-means hypothesis test to determine the probability of observing the sample difference between mean marriage ages for each

3

sex given a null hypothesis that the mean marriage ages are the same (h0 : µm = µf). What is the alternative hypothesis (hA) being tested here? Is the test one-tailed or two-tailed?

3. Define a p-value.

4 Measures of Correlation

4.1 Categorical Variables

Download from D2L and load the Titanic data set:

dat2 <- dget(‘‘titanic’’)2

1. Create a two-way frequency table [table(<data>)] with the sex [dat2$Sex] and survival [dat2$Survived] variables; include the table in your answer to the problem set. Can we reject the null that sex and survival were not associated (h0 : πM = πF)? How can you tell? Use both the chisq.test(<table>) and prop.test(<table>) functions. Have we proven that sex and survival were associated?

2. Create a two-way frequency table with the class [dat2$Class] and survival [dat2$Survived] variables; include the table in your answer to the problem set. If our alternative hypothesis is that class affects the probability of survival, what is the the null hypothesis? What is the test-statistic here? How can we interpret the p-value?

4.2 Continuous Variables

1. Create a scatterplot [plot(<data object 1>,<data object 2>)] with the country variable for female life expectancy [dat$lifexp.f] on the x-axis and the variable for mean marriage age for women [dat$mean.marr.f] on the y-axis; include the plot in your answer. Does it appear that the two variables are related in some way? If so, describe the apparent relationship. What is the correlation coefficient (r) [cor(<data object 1>,<data object 2>, use=‘‘pairwise’’)] for these two variables? Interpret the coefficient.

2. Create a scatterplot with a smoothed line fitting the data [scatter.smooth(<data object 1>,<data object 2>)] for female literacy and the proportion of the labor force that is female [dat$laborforce.f]. What is the correlation coefficient for these variables? Is it a useful statistic here? What does it mean that the relationship between female literacy and the proportion of the labor force that is female is somewhat nonlinear?

2R may not like the quotation marks as printed here; you may need to replace them if you are copying and pasting into your code file.

 

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Pay Someone To Write Essay