Quantitative Methods in Geography: 2015

Wednesday, April 29, 2015

Regression Analysis

Part 1

Study Question: A study on crime rates and poverty was conducted for Town X. The local news station got a hold of some data and made a claim that as the number of kids that get free lunches increase so does crime. You feel this might a bit silly so you run a regression equation with the given data. Using SPSS to do you regression, determine if the news is correct? What percentage of persons will get free lunch with a crime rate of 79.7? How confident are you in your results? Provide the data and a response addressing these questions.

Null Hypothesis: There is no linear relationship between the number of kids getting free lunch and the crime rate.

SPSS Results:

Analysis: The idea of associating number of kids that get free lunches and crime rate seems rather abstract and unlikely to be true. However the regression analysis performed proves this perception to be false. In this instance the null hypothesis will be rejected, because there is a linear relationship between the number of kids getting free lunch and the crime rate. Furthermore, based off of the R value, which indicates correlation, the linear relationship will be a positive one. Thus meaning the news is correct and that as the number of kids that get free lunches increases so does the crime rate.

Although, despite there being a linear relationship and a positive correlation, the relationship is not very strong. This statement is based off of the coefficient of determination(R squared) which is equal to 0.173. This value is a very low coefficient of determination and therefore indicates that despite a linear relationship existing, it is not a strong one.

If the crime rate is 79.7 what percentage of kids will get free lunch?

Regression equation --- y = a + bx

y = dependent variable (Crime Rate)

x = independent variable (Percent Free Lunch)

79.7 = 21.819 + 1.685x

x= 34.35 %

If crime rate is equal to 79.7 the percentage of kids who will get free lunch is 34.35%

I am not very confident in this prediction due to the coefficient of determination. Because the r squared value is so low, indicating a weak linear relationship, it makes the likelihood of the prediction to be accurate low as well.

Part 2

Introduction

Regression analysis is used to investigate the relationships between two variables. The goal of regression analysis is to be able to predict the effect one variable has on the other. In essence regression analysis is used to determine the causation a variable will have on another. For this exercise the University of Wisconsin state school systems is under investigation. The purpose of the exercise is to analyze the enrollment numbers of each UW system school and determine why students choose to go to the schools they end up attending. This of course will be done through regression analysis on the specific UW schools and a few select variables.

Methods

The variables to be assessed to determine a relationship with the UW schools are; percent with a Bachelor's degree for each county in Wisconsin, median household income per county, and the distance each county is from each school. Performing regression analysis on these variables with the enrollment of a specific school will help determine which variable influenced the kids the most in choosing to attend certain schools. The regression analysis was performed in the SPSS statistics software.

For this exercise the task was to select two of the UW system schools and run regression analysis on them via the 3 variables. The results of the regression would show which variables possess a relationship or association with that particular school. After the variables were analyzed and the significant variables were established, the residuals of the results would be mapped. Residuals are essentially z-scores, or a value that represents how far that specific piece of data is off the trend line. Essentially this means that residuals represent how many standard deviations the specific piece of data is away from the mean. Thus, the further the number is away from zero the more of an outlier it is and the more of an outlier it is the more interesting it becomes for analyzing the data.

In order to figure out how much of a factor distance has in a students choice of school, the distance each county is from each school is analyzed. However, this variable must be normalized otherwise it will give false results. Essentially, when analyzing the distance variable the results are going to be skewed by the counties that have a large population. The counties within Wisconsin that contain the large cities are most likely going to have the highest number of students at every school, so it is important to normalize the distance variable by the population of the counties, so the effect distance has on a students decision can truly be assessed.

Results

Regression analysis is run in SPSS to determine which variables are significant for each school. The two schools I choose to analyze is UW- Eau Claire and UW- Superior. After the variables are considered to be significant the residuals are to be mapped to establish which counties are the largest outliers.

Figure 1: SPSS regression results for UW- Eau Claire and the variable bachelor degrees.

Figure 2: Map of the residuals for the bachelor degree variable and UW- Eau Claire

Figure 3: SPSS regression results for UW- Eau Claire and the distance from each county variable.

Figure 4: Map of the residuals for the distance from each county variable and UW- Eau Claire

Figure 5: SPSS regression results for UW- Superior and the distance form each county variable.

Figure 6: Map of the residuals for the distance from each county variable and UW- Superior

Discussion

UW-Eau Claire and Percent with Bachelors Degree per County

The first thing that is noticeable when assessing the relationship between the UW- Eau Claire and the percent with a bachelors degree per county variable is the lack of a strong coefficient of determination. Despite there being a significant linear relationship between the two variables with the weak coefficient of determination it does not provide great confidence that the reason why students go to UW- Eau Claire is because they come from a counties with high percentages of people with bachelors degrees. Perhaps a more telling relationship would be seen between the overall UW system and the counties with high percentages of people with bachelor degrees rather than just UW- Eau Claire. This analysis would then examine the common held belief that people what come from well educated areas are more likely to seek out higher education. As for the results based off of just UW- Eau Claire it appears you are more likely to go to Eau Claire if you come from a higher educated county, but that does not entirely explain why they chose Eau Claire over the other UW schools.

UW- Eau Claire and Distance from each County

On the other hand, not only does the distance from each county have a linear relationship with the number of students that attend UW- Eau Claire, it possess an extremely high coefficient of determination. This signifies that the relationship between the two variables is not only significant, but it also can be used to confidently predict the future. Indicating, that most likely one of the major reasons students choose UW- Eau Claire is based off of there proximity to the school. This is exemplified in the map of the residuals, where the counties surrounding Eau Claire all are in the higher portion of the legend. Thus indicating, the closer you are to Eau Claire the more likely you are to attend. However, there are a few exceptions that can be seen. The counties with major cities, such a Green Bay, Madison, and the suburbs of Milwaukee all have a high portion of people going to UW-Eau Claire. This may be due simply to the normalizing of the data to factor out the weight of the high population areas was not fully effective, or it could be due to a large amount of students wanting to escape the bigger cities. That is up for determination, but it still does not negate the evidence showing how distance is linearly associated with the number of UW- Eau Claire students.

Another couple of counties that are interesting are on the opposite side of the residual spectrum. First is the county that contains the city of Milwaukee, Wisconsin's most populous city, in the southeastern part of the state. This county shows up in the darkest of green colors, indicating that despite it being far away from Eau Claire, it still should have more students attending there based on its population. This could be due to the lower incomes of the city of Milwaukee, as most of the people with decent incomes have moved towards the suburbs.

The final county I would like to point out is the one containing UW- Superior. This county also shows up in a green color, indicating more students from this county should be attending UW- Eau Claire. This is interesting because of how it ties into the next set of regression analysis that showed significant results, number of students attending UW-Superior and distance from each county.

UW- Superior and Distance from each County

The UW- Superior enrollment and distance from each county variable shows a significant linear relationship between the two variables. However, once again the coefficient of determination is very weak. This would suggest that predictions can not or should not be made off of this data, but there does appear to be one extreme outlier that seems to driving all of the statistics to show what they are showing. This outlier is the county in which the city of Superior and the university both are located in. This county shows that it by far, is the leading county in sending students to UW- Superior. These results can then be compared to the ones found when analyzing UW- Eau Claire and the distance variable. Comparing the two will show that perhaps the main reason more students are not attending UW- Eau Claire from the county where Superior is located in, is because of UW- Superior. UW- Superior's residual map clearly shows that besides the actual county it is located in, there is essentially no spatial pattern indicating any other trend. Therefore it may be safe to assume or predict that if a potential student comes from the city of Superior or the county it is located in, they will most likely head to UW- Superior.

Conclusion

Running regression analysis to determine trends and relationships between UW systems schools enrollment and other variables proved to be a very interesting task. What was found was that the distance between a potential student and the college they choose is a huge factor. This of course makes perfect sense to me, since distance was pretty much the number one factor when I made my decision. Most people tend not to travel to far from where they grew up, so if there is a UW school that is located nearby, they most likely will choose it. Running the regression analysis on all UW schools would be a very intriguing task, but the results would be of great use to compare to the results that I obtained above. Regression analysis is a strong use of statistics to establish if two variables are related and if predictions can be made off of them. For this example, the ending results could potentially be used by UW- Eau Claire or UW-Superior to establish which areas would be good to advertise in and which areas would just be a waste of time.

Thursday, April 9, 2015

Correlation and Spatial Autocorrelation

Part 1

The first part of this exercise was focused on correlation. Below are the tasks with the appropriate responses.

Correlation between Distance and Sound Levels

1. Using the data below as well as Excel and SPSS do the following things:

a. In Excel:

i. Create a scatterplot (with Distance on the X axis)

ii. On the scatterplot place the trend line

b. In SPSS find the Pearson Correlation

c. Show the results of the Pearson Correlation

d. What is the hypothesis (remember the Sig. level)

e. Summarize your findings below. Make sure to explain the strength and the direction of the results as well as explain your hypothesis test.

Scatter Plot:

Correlation Chart:

Hypothesis: The null hypothesis would be that there is no linear relationship between the distance and sound level variables. The significance level will be set at 0.05.

Results: The correlation chart indicates that the null hypothesis would be rejected. The null hypothesis would be rejected because the correlation is below the selected significance level of 0.05. In fact even if the significance level were to be set at 0.01 the correlation would still be significant.

It appears that there is a strong negative correlation between the distance and sound level variables based off of the -.896 Pearson Correlation result. The scatter plot would also confirm this. Because the points are so tightly packed to the trend line, it shows that the correlation is strong. Also the direction is apparent as negative, because one variable decreases as the other increases.

Census Tracts and Population in Milwaukee County

Create a correlation matrix with all the data in it:

Perwhite = Percent White Pop. for the 307 Census Tracts in Milwaukee County

PerBlack = Percent Black

PerHis = Percent Hispanic

NO_HS = Percent with No Highschool Diploma

BS = Percent with a Bachelors Degree

Below_Pove = Percent below the Poverty Level

Walk = Percent that walk to work

Results: This correlation matrix contains data representing Milwaukee county. Running correlation tests on all of the variables will show which variables are associated with each other. It is very important to remember that just because two variables may correlate with each other, it does not mean one causes the other. In this example, every time there are two stars shown, there is significant correlation between the two variables. So for example there is a strong negative correlation between the variables percent white and percent black. This means that in general, areas that have higher populations of black people will have lower populations of white people. Another example of correlation found with this data is between the variables percent with a bachelors degree and percent below the poverty level. In this case there is a fairly strong negative correlation indicating that the likelihood of having a bachelor degree and also being below the poverty level is pretty low. An example of strong positive correlation between two variables would be percent below poverty level and percent black. This does not mean that if you are black you will be poor, it just means that there is a statistically significant evidence showing that area with higher amounts of black people also have higher levels of people living in poverty.