What is this Blog About?



This quantitative methods in geography blog will showcase the skills and techniques learned in the course GEOG 328 from the University of Wisconsin Eau Claire. The focus of this class is on relating quantitative methods and statistics to geography.

Wednesday, April 29, 2015

Regression Analysis

Part 1

Study QuestionA study on crime rates and poverty was conducted for Town X.  The local news station got a hold of some data and made a claim that as the number of kids that get free lunches increase so does crime.  You feel this might a bit silly so you run a regression equation with the given data. Using SPSS to do you regression, determine if the news is correct?    What percentage of persons will get free lunch with a crime rate of 79.7?  How confident are you in your results?  Provide the data and a response addressing these questions.   

Null Hypothesis: There is no linear relationship between the number of kids getting free lunch and the crime rate. 

SPSS Results


Analysis: The idea of associating number of kids that get free lunches and crime rate seems rather abstract and unlikely to be true. However the regression analysis performed proves this perception to be false. In this instance the null hypothesis will be rejected, because there is a linear relationship between the number of kids getting free lunch and the crime rate. Furthermore, based off of the R value, which indicates correlation, the linear relationship will be a positive one. Thus meaning the news is correct and that as the number of kids that get free lunches increases so does the crime rate. 

 Although, despite there being a linear relationship and a positive correlation, the relationship is not very strong. This statement is based off of the coefficient of determination(R squared)  which is equal to 0.173. This value is a very low coefficient of determination and therefore indicates that despite a linear relationship existing, it is not a strong one. 

If the crime rate is 79.7 what percentage of kids will get free lunch?

Regression equation ---    y = a + bx
y = dependent variable (Crime Rate)
x = independent variable (Percent Free Lunch)

79.7 = 21.819 + 1.685x

x= 34.35 %

If crime rate is equal to 79.7 the percentage of kids who will get free lunch is 34.35%

I am not very confident in this prediction due to the coefficient of determination. Because the r squared value is so low, indicating a weak linear relationship, it makes the likelihood of the prediction to be accurate low as well. 

Part 2

Introduction

Regression analysis is used to investigate the relationships between two variables. The goal of regression analysis is to be able to predict the effect one variable has on the other. In essence regression analysis is used to determine the causation a variable will have on another. For this exercise the University of Wisconsin state school systems is under investigation. The purpose of the exercise is to analyze the enrollment numbers of each UW system school  and determine why students choose to go to the schools they end up attending. This of course will be done through regression analysis on the specific UW schools and a few select variables.

Methods

The variables to be assessed to determine a relationship with the UW schools are; percent with a Bachelor's degree for each county in Wisconsin, median household income per county, and the distance each county is from each school. Performing regression analysis on these variables with the enrollment of a specific school will help determine which variable influenced the kids the most in choosing to attend certain schools. The regression analysis was performed in the SPSS statistics software.

For this exercise the task was to select two of the UW system schools and run regression analysis on them via the 3 variables. The results of the regression would show which variables possess a relationship or association with that particular school. After the variables were analyzed and the significant variables were established, the residuals of the results would be mapped. Residuals are essentially z-scores, or a value that represents how far that specific piece of data is off the trend line. Essentially this means that residuals represent how many standard deviations the specific piece of data is away from the mean. Thus, the further the number is away from zero the more of an outlier it is and the more of an outlier it is the more interesting it becomes for analyzing the data.

In order to figure out how much of a factor distance has in a students choice of school, the distance each county is from each school is analyzed. However, this variable must be normalized otherwise it will give false results. Essentially, when analyzing the distance variable the results are going to be skewed by the counties that have a large population. The counties within Wisconsin that contain the large cities are most likely going to have the highest number of students at every school, so it is important to normalize the distance variable by the population of the counties, so the effect distance has on a students decision can truly be assessed.

Results

Regression analysis is run in SPSS to determine which variables are significant for each school. The two schools I choose to analyze is UW- Eau Claire and UW- Superior. After the variables are considered to be significant the residuals are to be mapped to establish which counties are the largest outliers. 

Figure 1: SPSS regression results for UW- Eau Claire and the variable bachelor degrees.

Figure 2: Map of the residuals for the bachelor degree variable and UW- Eau Claire

Figure 3: SPSS regression results for UW- Eau Claire and the distance from each county variable.

Figure 4: Map of the residuals for the distance from each county variable and UW- Eau Claire

Figure 5: SPSS regression results for UW- Superior and the distance form each county variable.

Figure 6: Map of the residuals for the distance from each county variable and UW- Superior

Discussion

UW-Eau Claire and Percent with Bachelors Degree per County

The first thing that is noticeable when assessing the relationship between the UW- Eau Claire and the percent with a bachelors degree per county variable is the lack of  a strong coefficient of determination. Despite there being a significant linear relationship between the two variables with the weak coefficient of determination it does not provide great confidence that the reason why students go to UW- Eau Claire is because they come from a counties with high percentages of people with bachelors degrees. Perhaps a more telling relationship would be seen between the overall UW system and the counties with high percentages of people with bachelor degrees rather than just UW- Eau Claire. This analysis would then examine the common  held belief that people what come from well educated areas are more likely to seek out higher education. As for the results based off of just UW- Eau Claire it appears you are more likely to go to Eau Claire if you come from a higher educated county, but that does not entirely explain why they chose Eau Claire over the other UW schools.

UW- Eau Claire and Distance from each County 

On the other hand, not only does the distance from each county have a linear relationship with the number of students that attend UW- Eau Claire, it possess an extremely high coefficient of determination. This signifies that the relationship between the two variables is not only significant, but it also can be used to confidently predict the future. Indicating, that most likely one of the major reasons students choose UW- Eau Claire is based off of there proximity to the school. This is exemplified in the map of the residuals, where the counties surrounding Eau Claire all are in the higher portion of the legend. Thus indicating, the closer you are to Eau Claire the more likely you are to attend. However, there are a few exceptions that can be seen. The counties with major cities, such a Green Bay, Madison, and the suburbs of Milwaukee all have a high portion of people going to UW-Eau Claire. This may be due simply to the normalizing of the data to factor out the weight of the high population areas was not fully effective, or it could be due to a large amount of students wanting to escape the bigger cities. That is up for determination, but it still does not negate the evidence showing how distance is linearly associated with the number of UW- Eau Claire students. 

Another couple of counties that are interesting are on the opposite side of the residual spectrum. First is the county that contains the city of Milwaukee, Wisconsin's most populous city, in the southeastern part of the state. This county shows up in the darkest of green colors, indicating that despite it being far away from Eau Claire, it still should have more students attending there based on its population. This could be due to the lower incomes of the city of Milwaukee, as most of the people with decent incomes have moved towards the suburbs. 

The final county I would like to point out is the one containing UW- Superior. This county also shows up in a green color, indicating more students from this county should be attending UW- Eau Claire. This is interesting because of how it ties into the next set of regression analysis that showed significant results, number of students attending UW-Superior and distance from each county. 

UW- Superior and Distance from each County 

The UW- Superior enrollment and distance from each county variable shows a significant linear relationship between the two variables. However, once again the coefficient of determination is very weak. This would suggest that predictions can not or should not be made off of this data, but there does appear to be one extreme outlier that seems to driving all of the statistics to show what they are showing. This outlier is the county in which the city of Superior and the university both are located in. This county shows that it by far, is the leading county in sending students to UW- Superior. These results can then be compared to the ones found when analyzing UW- Eau Claire and the distance variable. Comparing the two will show that perhaps the main reason more students are not attending UW- Eau Claire from the county where Superior is located in, is because of UW- Superior. UW- Superior's residual map clearly shows that besides the actual county it is located in, there is essentially no spatial pattern indicating any other trend. Therefore it may be safe to assume or predict that if a potential student comes from the city of Superior or the county it is located in, they will most likely head to UW- Superior.

Conclusion

Running regression analysis to determine trends and relationships between UW systems schools enrollment and other variables proved to be a very interesting task. What was found was that the distance between a potential student and the college they choose is a huge factor. This of course makes perfect sense to me, since distance was pretty much the number one factor when I made my decision. Most people tend not to travel to far from where they grew up, so if there is a UW school that is located nearby, they most likely will choose it. Running the regression analysis on all UW schools would be a very intriguing task, but the results would be of great use to compare to the results that I obtained above. Regression analysis is a strong use of statistics to establish if two variables are related and if predictions can be made off of them. For this example, the ending results could potentially be used by UW- Eau Claire or UW-Superior to establish which areas would be good to advertise in and which areas would just be a waste of time. 







Thursday, April 9, 2015

Correlation and Spatial Autocorrelation

Part 1

The first part of this exercise was focused on correlation. Below are the tasks with the appropriate responses.

Correlation between Distance and Sound Levels

1.       Using the data below as well as Excel and SPSS do the following things:
a.       In Excel:
                                                               i.      Create a scatterplot (with Distance on the X axis)
                                                             ii.      On the scatterplot place the trend line
b.      In SPSS find the Pearson Correlation
c.       Show the results of the Pearson Correlation
d.      What is the hypothesis (remember the Sig. level)
e.      Summarize your findings below.  Make sure to explain the strength and the direction of the results as well as explain your hypothesis test. 


Scatter Plot: 

Correlation Chart:
Hypothesis: The null hypothesis would be that there is no linear relationship between the distance and sound level variables. The significance level will be set at 0.05.

Results: The correlation chart indicates that the null hypothesis would be rejected. The null hypothesis would be rejected because the correlation is below the selected significance level of 0.05. In fact even if the significance level were to be set at 0.01 the correlation would still be significant.

It appears that there is a strong negative correlation between the distance and sound level variables based off of the -.896 Pearson Correlation result. The scatter plot would also confirm this. Because the points are so tightly packed to the trend line, it shows that the correlation is strong. Also the direction is apparent as negative, because one variable decreases as the other increases.

Census Tracts and Population in Milwaukee County


Create a correlation matrix with all the data in it:


Perwhite = Percent White Pop. for the 307 Census Tracts in Milwaukee County
PerBlack = Percent Black
PerHis = Percent Hispanic
NO_HS = Percent with No Highschool Diploma
BS = Percent with a Bachelors Degree
Below_Pove = Percent below the Poverty Level
Walk = Percent that walk to work 

Results: This correlation matrix contains data representing Milwaukee county. Running correlation tests on all of the variables will show which variables are associated with each other. It is very important to remember that just because two variables may correlate with each other, it does not mean one causes the other. In this example, every time there are two stars shown, there is significant correlation between the two variables. So for example there is a strong negative correlation between the variables percent white and percent black. This means that in general, areas that have higher populations of black people will have lower populations of white people. Another example of correlation found with this data is between the variables percent with a bachelors degree and percent below the poverty level. In this case there is a fairly strong negative correlation indicating that the likelihood of having a bachelor degree and also being below the poverty level is pretty low. An example of strong positive correlation between two variables would be percent below poverty level and percent black. This does not mean that if you are black you will be poor, it just means that there is a statistically significant evidence showing that area with higher amounts of black people also have higher levels of people living in poverty.

Part 2

Introduction

The second part of this assignment takes a look at not only correlation, but spatial auto-correlation.  The difference between the two is the element of space. Spatial auto-correlation is the same as correlation, but it takes into account how variable are geographically related. Basically spatial auto-correlation examines if certain correlations are happening in areas that are near each other. In other words spatial auto-correlation is looking for a clustering effect of a certain variable. Spatial auto-correlation can be extremely useful in determining not only what areas are different from another, but how they are different.

To learn the importance of spatial auto-correlation and how it can potentially be used a scenario has been created related to voting patterns in the state of Texas. The scenario is the Texas Election Commission(TEC) wants analysis performed on the presidential elections of 1980 and 2008 to see if election patterns have changed over the last 20 years. Given the voting data, the task is to use correlation and spatial auto-correlation to determine what the voting patterns are in the different areas of Texas.

Methods

Spatial auto-correlation can be performed in the software program called OpenGeoDA. This free software is designed to help with spatial data analysis and one of it's strong points is running spatial auto-correlation. In order to run spatial auto-correlation in OpenGeoDa a weights file must first be created. Basically what a weights file does is assign weights to see how the features that are being analyzed border or touch one another. For this example the weights file determined how much each county within the state of Texas touched or bordered another. This means that bigger counties with long borders would have a larger weight. The weights file essentially is what factors in the spatial portion of spatial auto-correlation

With the weights file created, the actual data and variables can then be analyzed based off of the weights. There are two useful analysis techniques in the OpenGeoDa software. The first is the Morans I, which compares the value of the selected variable at any one location with the value at all of the other locations, determining if there is any spatial auto-correlation. The result from this technique is a value between -1 and 1. The higher the value the more clustered the variable is. This technique also uses a quadrant system, much like a typical X,Y graph. On his graph a scatter plot is generated and a trend line can be placed. This trend line can be a very telling sign of the direction of how the data may be spatially auto-correlated.

The second spatial auto-correlation technique is called local indicators of spatial auto-correlation(LISA). A LISA will essentially output a map that is similar to the X,Y graph output from the Morans I. On this map areas will be highlighted that represent;

  • Areas that have a high value of the selected area surrounded by other areas that have high values
  • Areas that have a high value but are surrounded by areas with low values
  • Areas that have a low value but are surrounded by areas with high values
  • Areas that have a low value and are also surrounded by other areas with high values
Using this technique will give a visual representation of what the spatial auto-correlation patterns actually look like.

Results


Figure 1: Percent Hispanic Moran's I 

Figure 2: Percent Hispanic LISA
Running the spatial auto-correlation tools on the percent Hispanic data for the state of Texas gives a strong idea of where the Hispanic people are located within the state. Obviously the southern part of Texas is where the highest amounts of Hispanic people are clustered, with the NorthEast containing the most clustered non Hispanic populations.  Interpreting the Hispanic population is important because it may eventually shed some light as to why some of the voting patterns of Texas exist.

Figure 3: Percent Democratic Vote in 1980 Moran's I

Figure 4: Percent Democratic Vote in 1980 LISA


Figure 5: Percent Democratic Vote in 2008 Moran's I

Figure 6: Percent Democratic Vote in 2008 LISA
The first takeaway from the spatial auto-correlation results on the percent democratic vote in both 1980 and 2008 is the Moran's I result. In 2008 the percent democratic vote has gotten more clustered. Interpreting the LISA maps based off of the Moran's I result shows the greater clustering effect appear similar to that of the Hispanic population clustering. The southern counties of Texas are highly democratic which goes along with the tendency of Hispanics and other minorities to vote democrat. One interesting difference between the two LISA maps of 1980 and 2008 is the NorthEast part of the state. There does not appear to be significant clustering of democratic vote or lack of democratic vote in that area despite showing strong clustering of non-Hispanic populations.

Figure 7: Voter Turnout 1980 Moran's I

Figure 8: Voter Turnout 1980 LISA

Figure 9: Voter Turnout 2008 Moran's I

Figure 10: Voter Turnout 2008 LISA
The spatial auto-correlation results for voter turnout in 1980 and 2008 show significantly less of a clustering effect. Based off the Moran's I result both years that the data was collected showed relatively weak signs of clustering throughout the state. Looking at the LISA map the few areas that do show signs of clustering are quite interesting. The areas that have high voter turnout clustering just so happen to be two of the major metropolitan areas in the state of Texas. Both the Dallas/Fort Worth area and the San Antonio/Austin areas of Texas have high clustering of voter turnout. This of course makes sense because the higher a population of an area the higher the voter turnout will most likely to be. What may be even more significant, is the blue portion of the LISA map in the southern part of the state.  This shows that there is a clustering of low voter turnout counties in this part of Texas. This is a telling sign because that is where the clustering stronghold of the percent democratic vote is. That means that even though these areas largely vote democratic, the voter turnout is relatively low, so it most likely will not sway an election. This of course makes sense regarding the state of Texas, which is considered a dominant republican state.
Figure 11: Correlation Matrix with Percent Democratic Vote and Voter Turnout for both 1980 and 2008
-VTP80 = Voter Turnout 1980
-VTP08 = Voter Turnout 2008
-PRES08D = % Democratic Vote 1980
-PRES08D = % Democratic Vote 2008
                                  
This correlation matrix echoes what was stated in the previous paragraph. Looking at the voter turnout variables and the percent democratic vote variables there is an apparent negative correlation. This indicates a pattern where the larger the voter turnout the lesser percent of the vote will be democratic. This correlation is moderately strong indicating that not always will the percent of democratic vote go down due to higher voter turnout, but it is still likely. 

Conclusion

Texas is generally considered to be a predominantly republican state. The voting data records in both 1980 and 2008 back this up by showing areas of high voter turnout generally representing a lack of democrat percentage and areas of low voter turnout generally representing an increase in democrat percentage. This is a telling sign for the Texas Election Comission and clearly shows significant patterns in the voting data. Depending on the party affiliation of the TEC and the current governor (who requested the analysis to be done through the TEC) this analysis can be used in different ways. If the governor is republican, which judging by these voting patterns he/she is, this data would indicate the major population hubs in the state. As long as the major population centers continue to turnout to vote, most likely the republican will continue to win elections. However if the governor were to be a democrat, the data would be used in a completely different direction. Instead of focusing on major population areas, the democrat candidate would focus on the southern portion of the state and specifically getting the Hispanic portion of the population to get out and vote. If the democrats could turn the southern counties into not just strongholds for democratic vote, but also stronghold of high voter turnout their chances in elections would greatly increase.

This assignment shows the importance of not just correlation and spatial auto-correlation, but of statistics in general.  Specifically when it relates to politics, running statistics is essential to any campaign strategy, which also means a lot of money is sure to be spent on political statistics analysis. However, regular statistics is not enough, incorporating space and geography into the mix, such as with spatial auto-correlation, is essential to understand which specific areas are important. Relating statistics to space allows for a greater in depth analysis of the numbers rather than just the numbers themselves. This allows for the people performing the analysis to ask deeper questions, such as rather than just what the data is, but why it is the way it is.