What is this Blog About?



This quantitative methods in geography blog will showcase the skills and techniques learned in the course GEOG 328 from the University of Wisconsin Eau Claire. The focus of this class is on relating quantitative methods and statistics to geography.

Wednesday, April 29, 2015

Regression Analysis

Part 1

Study QuestionA study on crime rates and poverty was conducted for Town X.  The local news station got a hold of some data and made a claim that as the number of kids that get free lunches increase so does crime.  You feel this might a bit silly so you run a regression equation with the given data. Using SPSS to do you regression, determine if the news is correct?    What percentage of persons will get free lunch with a crime rate of 79.7?  How confident are you in your results?  Provide the data and a response addressing these questions.   

Null Hypothesis: There is no linear relationship between the number of kids getting free lunch and the crime rate. 

SPSS Results


Analysis: The idea of associating number of kids that get free lunches and crime rate seems rather abstract and unlikely to be true. However the regression analysis performed proves this perception to be false. In this instance the null hypothesis will be rejected, because there is a linear relationship between the number of kids getting free lunch and the crime rate. Furthermore, based off of the R value, which indicates correlation, the linear relationship will be a positive one. Thus meaning the news is correct and that as the number of kids that get free lunches increases so does the crime rate. 

 Although, despite there being a linear relationship and a positive correlation, the relationship is not very strong. This statement is based off of the coefficient of determination(R squared)  which is equal to 0.173. This value is a very low coefficient of determination and therefore indicates that despite a linear relationship existing, it is not a strong one. 

If the crime rate is 79.7 what percentage of kids will get free lunch?

Regression equation ---    y = a + bx
y = dependent variable (Crime Rate)
x = independent variable (Percent Free Lunch)

79.7 = 21.819 + 1.685x

x= 34.35 %

If crime rate is equal to 79.7 the percentage of kids who will get free lunch is 34.35%

I am not very confident in this prediction due to the coefficient of determination. Because the r squared value is so low, indicating a weak linear relationship, it makes the likelihood of the prediction to be accurate low as well. 

Part 2

Introduction

Regression analysis is used to investigate the relationships between two variables. The goal of regression analysis is to be able to predict the effect one variable has on the other. In essence regression analysis is used to determine the causation a variable will have on another. For this exercise the University of Wisconsin state school systems is under investigation. The purpose of the exercise is to analyze the enrollment numbers of each UW system school  and determine why students choose to go to the schools they end up attending. This of course will be done through regression analysis on the specific UW schools and a few select variables.

Methods

The variables to be assessed to determine a relationship with the UW schools are; percent with a Bachelor's degree for each county in Wisconsin, median household income per county, and the distance each county is from each school. Performing regression analysis on these variables with the enrollment of a specific school will help determine which variable influenced the kids the most in choosing to attend certain schools. The regression analysis was performed in the SPSS statistics software.

For this exercise the task was to select two of the UW system schools and run regression analysis on them via the 3 variables. The results of the regression would show which variables possess a relationship or association with that particular school. After the variables were analyzed and the significant variables were established, the residuals of the results would be mapped. Residuals are essentially z-scores, or a value that represents how far that specific piece of data is off the trend line. Essentially this means that residuals represent how many standard deviations the specific piece of data is away from the mean. Thus, the further the number is away from zero the more of an outlier it is and the more of an outlier it is the more interesting it becomes for analyzing the data.

In order to figure out how much of a factor distance has in a students choice of school, the distance each county is from each school is analyzed. However, this variable must be normalized otherwise it will give false results. Essentially, when analyzing the distance variable the results are going to be skewed by the counties that have a large population. The counties within Wisconsin that contain the large cities are most likely going to have the highest number of students at every school, so it is important to normalize the distance variable by the population of the counties, so the effect distance has on a students decision can truly be assessed.

Results

Regression analysis is run in SPSS to determine which variables are significant for each school. The two schools I choose to analyze is UW- Eau Claire and UW- Superior. After the variables are considered to be significant the residuals are to be mapped to establish which counties are the largest outliers. 

Figure 1: SPSS regression results for UW- Eau Claire and the variable bachelor degrees.

Figure 2: Map of the residuals for the bachelor degree variable and UW- Eau Claire

Figure 3: SPSS regression results for UW- Eau Claire and the distance from each county variable.

Figure 4: Map of the residuals for the distance from each county variable and UW- Eau Claire

Figure 5: SPSS regression results for UW- Superior and the distance form each county variable.

Figure 6: Map of the residuals for the distance from each county variable and UW- Superior

Discussion

UW-Eau Claire and Percent with Bachelors Degree per County

The first thing that is noticeable when assessing the relationship between the UW- Eau Claire and the percent with a bachelors degree per county variable is the lack of  a strong coefficient of determination. Despite there being a significant linear relationship between the two variables with the weak coefficient of determination it does not provide great confidence that the reason why students go to UW- Eau Claire is because they come from a counties with high percentages of people with bachelors degrees. Perhaps a more telling relationship would be seen between the overall UW system and the counties with high percentages of people with bachelor degrees rather than just UW- Eau Claire. This analysis would then examine the common  held belief that people what come from well educated areas are more likely to seek out higher education. As for the results based off of just UW- Eau Claire it appears you are more likely to go to Eau Claire if you come from a higher educated county, but that does not entirely explain why they chose Eau Claire over the other UW schools.

UW- Eau Claire and Distance from each County 

On the other hand, not only does the distance from each county have a linear relationship with the number of students that attend UW- Eau Claire, it possess an extremely high coefficient of determination. This signifies that the relationship between the two variables is not only significant, but it also can be used to confidently predict the future. Indicating, that most likely one of the major reasons students choose UW- Eau Claire is based off of there proximity to the school. This is exemplified in the map of the residuals, where the counties surrounding Eau Claire all are in the higher portion of the legend. Thus indicating, the closer you are to Eau Claire the more likely you are to attend. However, there are a few exceptions that can be seen. The counties with major cities, such a Green Bay, Madison, and the suburbs of Milwaukee all have a high portion of people going to UW-Eau Claire. This may be due simply to the normalizing of the data to factor out the weight of the high population areas was not fully effective, or it could be due to a large amount of students wanting to escape the bigger cities. That is up for determination, but it still does not negate the evidence showing how distance is linearly associated with the number of UW- Eau Claire students. 

Another couple of counties that are interesting are on the opposite side of the residual spectrum. First is the county that contains the city of Milwaukee, Wisconsin's most populous city, in the southeastern part of the state. This county shows up in the darkest of green colors, indicating that despite it being far away from Eau Claire, it still should have more students attending there based on its population. This could be due to the lower incomes of the city of Milwaukee, as most of the people with decent incomes have moved towards the suburbs. 

The final county I would like to point out is the one containing UW- Superior. This county also shows up in a green color, indicating more students from this county should be attending UW- Eau Claire. This is interesting because of how it ties into the next set of regression analysis that showed significant results, number of students attending UW-Superior and distance from each county. 

UW- Superior and Distance from each County 

The UW- Superior enrollment and distance from each county variable shows a significant linear relationship between the two variables. However, once again the coefficient of determination is very weak. This would suggest that predictions can not or should not be made off of this data, but there does appear to be one extreme outlier that seems to driving all of the statistics to show what they are showing. This outlier is the county in which the city of Superior and the university both are located in. This county shows that it by far, is the leading county in sending students to UW- Superior. These results can then be compared to the ones found when analyzing UW- Eau Claire and the distance variable. Comparing the two will show that perhaps the main reason more students are not attending UW- Eau Claire from the county where Superior is located in, is because of UW- Superior. UW- Superior's residual map clearly shows that besides the actual county it is located in, there is essentially no spatial pattern indicating any other trend. Therefore it may be safe to assume or predict that if a potential student comes from the city of Superior or the county it is located in, they will most likely head to UW- Superior.

Conclusion

Running regression analysis to determine trends and relationships between UW systems schools enrollment and other variables proved to be a very interesting task. What was found was that the distance between a potential student and the college they choose is a huge factor. This of course makes perfect sense to me, since distance was pretty much the number one factor when I made my decision. Most people tend not to travel to far from where they grew up, so if there is a UW school that is located nearby, they most likely will choose it. Running the regression analysis on all UW schools would be a very intriguing task, but the results would be of great use to compare to the results that I obtained above. Regression analysis is a strong use of statistics to establish if two variables are related and if predictions can be made off of them. For this example, the ending results could potentially be used by UW- Eau Claire or UW-Superior to establish which areas would be good to advertise in and which areas would just be a waste of time. 







Thursday, April 9, 2015

Correlation and Spatial Autocorrelation

Part 1

The first part of this exercise was focused on correlation. Below are the tasks with the appropriate responses.

Correlation between Distance and Sound Levels

1.       Using the data below as well as Excel and SPSS do the following things:
a.       In Excel:
                                                               i.      Create a scatterplot (with Distance on the X axis)
                                                             ii.      On the scatterplot place the trend line
b.      In SPSS find the Pearson Correlation
c.       Show the results of the Pearson Correlation
d.      What is the hypothesis (remember the Sig. level)
e.      Summarize your findings below.  Make sure to explain the strength and the direction of the results as well as explain your hypothesis test. 


Scatter Plot: 

Correlation Chart:
Hypothesis: The null hypothesis would be that there is no linear relationship between the distance and sound level variables. The significance level will be set at 0.05.

Results: The correlation chart indicates that the null hypothesis would be rejected. The null hypothesis would be rejected because the correlation is below the selected significance level of 0.05. In fact even if the significance level were to be set at 0.01 the correlation would still be significant.

It appears that there is a strong negative correlation between the distance and sound level variables based off of the -.896 Pearson Correlation result. The scatter plot would also confirm this. Because the points are so tightly packed to the trend line, it shows that the correlation is strong. Also the direction is apparent as negative, because one variable decreases as the other increases.

Census Tracts and Population in Milwaukee County


Create a correlation matrix with all the data in it:


Perwhite = Percent White Pop. for the 307 Census Tracts in Milwaukee County
PerBlack = Percent Black
PerHis = Percent Hispanic
NO_HS = Percent with No Highschool Diploma
BS = Percent with a Bachelors Degree
Below_Pove = Percent below the Poverty Level
Walk = Percent that walk to work 

Results: This correlation matrix contains data representing Milwaukee county. Running correlation tests on all of the variables will show which variables are associated with each other. It is very important to remember that just because two variables may correlate with each other, it does not mean one causes the other. In this example, every time there are two stars shown, there is significant correlation between the two variables. So for example there is a strong negative correlation between the variables percent white and percent black. This means that in general, areas that have higher populations of black people will have lower populations of white people. Another example of correlation found with this data is between the variables percent with a bachelors degree and percent below the poverty level. In this case there is a fairly strong negative correlation indicating that the likelihood of having a bachelor degree and also being below the poverty level is pretty low. An example of strong positive correlation between two variables would be percent below poverty level and percent black. This does not mean that if you are black you will be poor, it just means that there is a statistically significant evidence showing that area with higher amounts of black people also have higher levels of people living in poverty.

Part 2

Introduction

The second part of this assignment takes a look at not only correlation, but spatial auto-correlation.  The difference between the two is the element of space. Spatial auto-correlation is the same as correlation, but it takes into account how variable are geographically related. Basically spatial auto-correlation examines if certain correlations are happening in areas that are near each other. In other words spatial auto-correlation is looking for a clustering effect of a certain variable. Spatial auto-correlation can be extremely useful in determining not only what areas are different from another, but how they are different.

To learn the importance of spatial auto-correlation and how it can potentially be used a scenario has been created related to voting patterns in the state of Texas. The scenario is the Texas Election Commission(TEC) wants analysis performed on the presidential elections of 1980 and 2008 to see if election patterns have changed over the last 20 years. Given the voting data, the task is to use correlation and spatial auto-correlation to determine what the voting patterns are in the different areas of Texas.

Methods

Spatial auto-correlation can be performed in the software program called OpenGeoDA. This free software is designed to help with spatial data analysis and one of it's strong points is running spatial auto-correlation. In order to run spatial auto-correlation in OpenGeoDa a weights file must first be created. Basically what a weights file does is assign weights to see how the features that are being analyzed border or touch one another. For this example the weights file determined how much each county within the state of Texas touched or bordered another. This means that bigger counties with long borders would have a larger weight. The weights file essentially is what factors in the spatial portion of spatial auto-correlation

With the weights file created, the actual data and variables can then be analyzed based off of the weights. There are two useful analysis techniques in the OpenGeoDa software. The first is the Morans I, which compares the value of the selected variable at any one location with the value at all of the other locations, determining if there is any spatial auto-correlation. The result from this technique is a value between -1 and 1. The higher the value the more clustered the variable is. This technique also uses a quadrant system, much like a typical X,Y graph. On his graph a scatter plot is generated and a trend line can be placed. This trend line can be a very telling sign of the direction of how the data may be spatially auto-correlated.

The second spatial auto-correlation technique is called local indicators of spatial auto-correlation(LISA). A LISA will essentially output a map that is similar to the X,Y graph output from the Morans I. On this map areas will be highlighted that represent;

  • Areas that have a high value of the selected area surrounded by other areas that have high values
  • Areas that have a high value but are surrounded by areas with low values
  • Areas that have a low value but are surrounded by areas with high values
  • Areas that have a low value and are also surrounded by other areas with high values
Using this technique will give a visual representation of what the spatial auto-correlation patterns actually look like.

Results


Figure 1: Percent Hispanic Moran's I 

Figure 2: Percent Hispanic LISA
Running the spatial auto-correlation tools on the percent Hispanic data for the state of Texas gives a strong idea of where the Hispanic people are located within the state. Obviously the southern part of Texas is where the highest amounts of Hispanic people are clustered, with the NorthEast containing the most clustered non Hispanic populations.  Interpreting the Hispanic population is important because it may eventually shed some light as to why some of the voting patterns of Texas exist.

Figure 3: Percent Democratic Vote in 1980 Moran's I

Figure 4: Percent Democratic Vote in 1980 LISA


Figure 5: Percent Democratic Vote in 2008 Moran's I

Figure 6: Percent Democratic Vote in 2008 LISA
The first takeaway from the spatial auto-correlation results on the percent democratic vote in both 1980 and 2008 is the Moran's I result. In 2008 the percent democratic vote has gotten more clustered. Interpreting the LISA maps based off of the Moran's I result shows the greater clustering effect appear similar to that of the Hispanic population clustering. The southern counties of Texas are highly democratic which goes along with the tendency of Hispanics and other minorities to vote democrat. One interesting difference between the two LISA maps of 1980 and 2008 is the NorthEast part of the state. There does not appear to be significant clustering of democratic vote or lack of democratic vote in that area despite showing strong clustering of non-Hispanic populations.

Figure 7: Voter Turnout 1980 Moran's I

Figure 8: Voter Turnout 1980 LISA

Figure 9: Voter Turnout 2008 Moran's I

Figure 10: Voter Turnout 2008 LISA
The spatial auto-correlation results for voter turnout in 1980 and 2008 show significantly less of a clustering effect. Based off the Moran's I result both years that the data was collected showed relatively weak signs of clustering throughout the state. Looking at the LISA map the few areas that do show signs of clustering are quite interesting. The areas that have high voter turnout clustering just so happen to be two of the major metropolitan areas in the state of Texas. Both the Dallas/Fort Worth area and the San Antonio/Austin areas of Texas have high clustering of voter turnout. This of course makes sense because the higher a population of an area the higher the voter turnout will most likely to be. What may be even more significant, is the blue portion of the LISA map in the southern part of the state.  This shows that there is a clustering of low voter turnout counties in this part of Texas. This is a telling sign because that is where the clustering stronghold of the percent democratic vote is. That means that even though these areas largely vote democratic, the voter turnout is relatively low, so it most likely will not sway an election. This of course makes sense regarding the state of Texas, which is considered a dominant republican state.
Figure 11: Correlation Matrix with Percent Democratic Vote and Voter Turnout for both 1980 and 2008
-VTP80 = Voter Turnout 1980
-VTP08 = Voter Turnout 2008
-PRES08D = % Democratic Vote 1980
-PRES08D = % Democratic Vote 2008
                                  
This correlation matrix echoes what was stated in the previous paragraph. Looking at the voter turnout variables and the percent democratic vote variables there is an apparent negative correlation. This indicates a pattern where the larger the voter turnout the lesser percent of the vote will be democratic. This correlation is moderately strong indicating that not always will the percent of democratic vote go down due to higher voter turnout, but it is still likely. 

Conclusion

Texas is generally considered to be a predominantly republican state. The voting data records in both 1980 and 2008 back this up by showing areas of high voter turnout generally representing a lack of democrat percentage and areas of low voter turnout generally representing an increase in democrat percentage. This is a telling sign for the Texas Election Comission and clearly shows significant patterns in the voting data. Depending on the party affiliation of the TEC and the current governor (who requested the analysis to be done through the TEC) this analysis can be used in different ways. If the governor is republican, which judging by these voting patterns he/she is, this data would indicate the major population hubs in the state. As long as the major population centers continue to turnout to vote, most likely the republican will continue to win elections. However if the governor were to be a democrat, the data would be used in a completely different direction. Instead of focusing on major population areas, the democrat candidate would focus on the southern portion of the state and specifically getting the Hispanic portion of the population to get out and vote. If the democrats could turn the southern counties into not just strongholds for democratic vote, but also stronghold of high voter turnout their chances in elections would greatly increase.

This assignment shows the importance of not just correlation and spatial auto-correlation, but of statistics in general.  Specifically when it relates to politics, running statistics is essential to any campaign strategy, which also means a lot of money is sure to be spent on political statistics analysis. However, regular statistics is not enough, incorporating space and geography into the mix, such as with spatial auto-correlation, is essential to understand which specific areas are important. Relating statistics to space allows for a greater in depth analysis of the numbers rather than just the numbers themselves. This allows for the people performing the analysis to ask deeper questions, such as rather than just what the data is, but why it is the way it is. 



Monday, March 16, 2015

Significance and Chi-Squared Testing

Part 1

Part one of this exercise was centered on significance testing. Below are the questions with their answers. 
1.       



Fill out the chart above! (10 pts)
α = Significance level
z or t = is it a z or t test
z or t Value = Critical Value


2.       A Department of the interior in Washington D.C. estimates that the number of particular invasive species in a certain county (Bucks County) should number as follows (averages based on data from the whole state of Pennsylvania) per acre: Asian-Long Horned Beetle, 4; Emerald Ash Borer Beetle, 10; and Golden Nematode, 75.  A survey of 50 fields had the following results: (10 pts)

μ               σ
                Asian-Long Horned Beetle           3.2          0.73
                Emerald Ash Borer Beetle            11.7        1.3
                Golden Nematode                          77           5.71
               
a.       Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
b.      Be sure to present the null and alternative hypotheses for each as well as conclusions
c.       What can ascertained pertaining to the findings about these invasive species in Buck County?

Asian-Long Horned Beetle:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Asian-long horned beetles collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Asian-long horned beetles collected
Results: A z-test is used resulting in a test statistic = -7.75,  which is less than the critical value of -1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly below the hypothesized amount.

Emerald Ash Borer Beetle:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Emerald Ash Borer beetles collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Emerald Ash Borer beetles collected
Results: : A z-test is used resulting in a test statistic = 9.25, which is greater than the critical value of 1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly above the hypothesized amount.

Golden Nematode:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Golden Nematodes collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Golden Nematodes collected
Results: A z-test is used resulting in a test statistic = 2.48, which is greater than the critical value of 1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly above the hypothesized amount.


3.       An exhaustive survey of all users of a wilderness park taken in 1960 revealed that the average number of persons per party was 2.1.  In a random sample of 25 parties in 1985, the average was 3.4 persons with a standard deviation of 1.32 (one tailed test, 95% Con. Level) (5 pts)

d        a.       Test the hypothesis that the number of people per party has changed in the intervening years.  (State null and alternative hypotheses)
Null Hypothesis- The average number of people per party within the park has not changed between 1960 and 1985
Alternative Hypothesis- The average number of people per party within the park has changed between 1960 and 1985
Results: Due to the small sample size a t test will be conducted resulting in a t-value of 4.92. Therefore, the null hypothesis would be rejected, meaning the average number of people per party within the park has statistically changed between 1960 and 1985 and it has increased.
b.      What is the corresponding probability value

0.000026 or 0.0026%

Part 2

Introduction

Part two of this assignment focused on Chi-Squared testing. The Chi-Squared testing dealt with the concept of the phrase "up north" being used in Wisconsin. The phrase "up north" is often used to describe the differences between the northern and southern people of the state. Along with that phrase comes a series of judgments or even prejudices. Usually the main difference that deals with that phrase is the people of the north being more rural and the people of the south being from the city. This part of the lab's goal was to use Chi-Squared testing to determine if the phrase "up north" actually has any real data to back up the differences it implies. 

Methods

The first step to determining if there are major differences between the northern and southern portions of Wisconsin is to develop a dividing line between the north and south. The idea would be to find the direct line where people would begin to describe the areas as "up north". This can be a rather arbitrary idea considering there is no fine line dividing north and south in Wisconsin and the concept of "up north" varies depending on the person. However, for simplicity and in order to find results I used highway 29 as the rough dividing line between north and south. Highway 29 runs east-west for the most part across the entire state. Highway 29 provides a pretty decent dividing marker, where counties north of it are considered "up north" and counties south of it are the southern counties.

Figure 1: Map of Wisconsin counties split roughly along Highway 29
The next step of the lab was to choose three variables that most people generally associate with being up north. The three variables I choose were snowmobile trails, designated natural areas, and population over 65. All three are variables I typically assume would be higher in the northern parts of the state. Snowmobile trails and natural areas because of the more rural nature of the landscape in the north. The population over 65 because I would consider most of the cities in the south to contain more young people than old. 

Once the variables were selected I could now begin to move into the statistical testing. The program used for statistical testing is called SPSS or Statistical Package for the Social Sciences. Within SPSS Chi-Squared testing can be performed. Chi-Squared testing is a statistical process that compares the observed distribution to the expected distribution. The best way to do the Chi-Squared testing with the three variables is to revalue the data within the variables. Revaluing the data, by giving each value a 1-4 value will make it easier for understanding distributions and comparisons. For example each value within the snowmobile trails data set was revalued as either a 1, 2, 3, or 4 depending on its initial value. To do this, the range was found for the data then divided by four creating an increment to be used to calculate the range of the of each of the new value categories (1, 2, 3, or 4). 

So for the snowmobile data the maximum initial value is 641 and the minimum is 0, making the range of the data 641. Dividing the range of 641 by 4 gives the increment value, which is ~160, of what each new value category will be based off of. The next step would be to take the maximum, 641, and subtract the increment value, ~160, resulting in 481. Therefore, any value that falls between 481- 641 will be given a new category value of 4, meaning these are the highest of values for the snowmobile data set. The next step would be to subtract ~160 from 481, resulting in 321. Thus any value in the 321-481 range will be given a new category value of 3. Following this process for to find the categories 1 and 2, as well as doing the same process for each variable will then allow for proper Chi-Squared testing to be done. 

The reason behind doing this for the Chi-Squared testing is because each category value can be considered a distribution and a Chi-Squared test specializes in comparing distributions. Meaning, the Chi-Squared testing can determine what the expected number of values should fall into each distribution and compare it to what the actual number of values fall into each distribution. This is especially important because there is a greater amount of counties in the southern portion of the state below Highway 29. Using this testing method will allow the comparison between north and south still to take place despite, having a difference in the total number of counties.

Results

Maps can be made for each variable and differences can be noted. However, making the maps without the statistical testing does not prove there is a real statistical difference. So the best route is to compare the maps with the Chi-Squared statistics and find instances where they compliment each other. Note: For each Chi-Squared results table the "1" category represents the north and the "2" category represents the south.

Snowmobile Trails in Miles

Figure 2: Map of Snowmobile trails in miles by county
 This map does appear to show that the northern counties of the state do have a greater number of miles of snowmobile trails. The majority of the darker colored purple counties do indeed fall north of Highway 29.
Figure 3: Chi-Squared testing results for then snowmobile trails variable
The results of the Chi-Squared testing to the map above does agree with what the above map portrays. For the northern counties in the 3 and 4 categories, the categories that represent the higher number of snowmobile trail miles, the observed distribution is higher than the expected (if the distribution were even) . Meaning, there are actually more northern counties in the categories of high snowmobile trail miles than expected. And the opposite is also true, southern counties have a fewer number of counties than what would be expected in the categories of high snowmobile trail miles. The southern counties also have more a more than expected number of counties in the categories of low snowmobile trail miles.

Designated Natural Areas

Figure 4: State natural areas in acres by county
 Looking at this map it appears that there is no real difference between the north and south when it comes to state natural areas. In fact it almost appears that the southern counties may have more natural areas.
Figure 5: Chi-Squared testing results for the state natural areas variable
The Chi-Squared testing results disagree with what the map above displays. The testing results show that in fact, the northern counties actually posses a larger number of counties with high acreage of natural areas than what they are expected to. This is a sign that despite appearing in the map to not have a large number of state natural areas, statistically speaking they do have a good proportion of state natural areas. Basically, this means the northern counties, compared to the south, do have a large number of natural areas for how little land they actually cover. 

Population over 65

Figure 6: Map of Population over age 65 by county
 This map displays that the total population over 65 is hugely dictated by overall population numbers. This can be seen because the counties in darker shades of red are represent the counties closest to Wisconsin's two biggest cities Milwaukee and Madison.
Figure 7: Chi-Squared testing results for the variable of population over 65
The Chi-Squared testing results confirm what the above map displays. That being that the major cities dictate the population over 65 variable, and most likely every other total population variable for Wisconsin. Basically there is no meaningful Chi-Squared statistical analysis that can really be done due to the extremely huge outliers of total population found in the cities of Milwaukee and Madsion.

Conclusion

The major connotation that goes along with the phrase "up north" is that the it is more rural and country compared to the city driven south. Of the three variable I selected, it would appear that this phrase would be somewhat accurate. According to my Chi-Squared testing results the north has proportionately higher amounts of snowmobile trails and state natural areas. Two variables that I consider to be a representation of the "up north" phrase. The total population variable was not able to provide anything meaningful, due to the large city populations. In retrospect this was a poor variable for me to choose to attempt to find the difference between northern and southern Wisconsin. Overall it appears that there is some relevance to the phrase "up north", but of course more variable should be studied than just three before claiming that phrase to be totally correct. 



Tuesday, February 24, 2015

Using Z-Scores, Mean Center, and Standard Distance to Study Tornadoes in Kansas and Oklahoma

Introduction

Tornadoes in the states of Kansas and Oklahoma are a common occurrence. Because they are so common there is an increasingly popular idea to build tornado shelters. But not all of the citizens of these states agree with the need for shelters or disagree upon where the shelters should be placed. Using GIS and quantitative methods such as standard distance, mean center, and z-scores a logical analysis can be performed on the area.The outcome of performing this analysis should help predict where tornado shelters should be located if they are necessary at all.

Methods

To examine the placement of tornado shelters, I must analyze related data. The data that I will use to study is:
  • Shapefile of the counties of Kansas and Oklahoma with a total number of tornado occurrences per county.
  • Tornado Locations with there widths from the years of 1995-2006
  • Tornado Locations with there widths from the years 2007-2012
Note: For the purpose of this exercise, I will assume that the wider a tornado the more powerful it is.

The analysis tools I will perform on this data are as follows:
  • Z-Scores- A z-score gives a specific value as it relates to the standard deviation curve or the mean. If a z-score is positive it is above the mean and if it is negative it is below the mean. Z-scores show you how close the value is to the mean and can be of great use determining probability of events.
  • Mean Center- The mean center of a set of data is its spatial center. This is the same concept as an average or a mean, but the only difference is it relates to a spatial location rather than just a number. The mean center can be used much like a mean can be. The mean center can be found via a tool within ArcMap
  • Weighted Mean Center- A weighted mean center is the same as a mean center, but places a weight to it. This means instead of giving the exact spatial center of a set of data, it can give the center of a spatial data weighted by another factor.
  • Standard Distance- If the mean center of a set of data is basically a spatially located mean, then the standard distance of a set of data is basically the spatial equivalent of standard deviation. What a standard distance does is take the mean center and creates a circle around it that resembles the first standard deviation of data points. This is a tool that can also be found within ArcMap.
  • Weighted Standard Distance- The weighted standard distance is the same as the normal standard distance, but instead of using the mean center of a set of data, it uses a weighted mean center. 

Results


Figure 1: 6 maps representing the tornadoes in Kansas and Oklahoma
The above results were found by running the standard distance and mean center tools within ArcMap. The maps with red data points indicate the years between 1995-2006 and the blue data points represent the years 2007-2012.

Mean Center Results

Comparing the mean center of the data and the weighted mean center of the data was a useful way of analyzing the data. The weight that was placed upon the mean center was the widths of the tornadoes. What this indicates is that the wider the tornadoes the bigger and more destructive they were. When this weight was applied to the mean center, the average spatial location was shifted slightly towards the South. This means that there are more dangerous tornadoes towards Oklahoma than there are in Kansas. 

The other interesting thing to note from the mean center calculations is the difference in the years. 

Figure 2: Mean Center and Weighted Mean Centers of Tornadoes from the two different sets of years

Although both mean centers move Southward when weighted, the data that represented the years 2007-2012 had a weighted mean center slightly more to the North and East compared to the earlier set of years.. Indicating the stronger presence of wider tornadoes to the North and East.

Standard Distance Results

The standard distance showed more of what the mean center analysis portrayed. Again the years of 2007-2012 had an area that was slightly more to the Northeast. This simply just means that the strongest tornadoes were recorded more in that direction when compared to past years. The reasoning behind those shifts is unclear.

What is clear is that the mean center and standard distance circles are both centrally located within the area. Even with the changes when the weight was placed on the mean center, it still only affected it slightly. What this shows is that there is no particular area that tends to be effected more by tornadoes. The results are rather inconclusive in determining where tornado shelters should be built. 

Standard Deviation Mapping Results

Figure 3: Standard deviation of total number of tornadoes per county

Because the initial results did not produce conclusive results it is important to look deeper to discover patterns within the data. Above is a map that represents the standard deviation of the total number of tornadoes per county in the years between 2007-2012. What it details is the counties that are well above the mean when it comes to total number of tornadoes. The counties in the darkest shade of red and with the highest number of yellow dots are the counties that would benefit most from tornado shelters. 

It is also important to compare these results to the mean center and standard distance results. Because there was no severe change when it came to determining where the more dangerous tornadoes were occurring, it is fairly safe to presume that the more dangerous tornadoes occur all over the two states. Therefore, the counties with a higher amount of tornadoes should be the ones that receive the highest amount of tornado shelters.

Z-Scores Results

Z-scores can also be used to determine which counties experience a higher amount of tornadoes. To find the z-score, the mean and standard deviation must be found. In this case, I found the mean and standard deviation of the total number of tornadoes from the years 2007-2012 per county.

Mean= 4, meaning the average county has had 4 tornadoes
Standard Deviation = 4 

For examples purpose the z-scores were found for 3 counties

Russel= 5.25
Caddo= 2.25
Alfalfa= 0.25

These z-scores can be used to compare with other counties to determine which counties are experiencing a high or low amount of tornadoes. In this case, all three counties were above the mean, since they were all positive. Russel county's z-score is so extremely high it represents an outlier in the data. This would be one of the darker shade of red counties from above that could definitely use tornado shelters.

Z-scores can also be used to determine probability, such as finding the likelihood that a certain number of tornadoes can be expected within same time frame in the future. So for this example the time frame is 5 years, so predictions can be made for the next 5 years.

For example:

70% of the counties over the next 5 years will experience over  1.92 tornadoes.
20% of the counties over the next 5 years will experience over  7.36 tornadoes

This can be calculated for any percentage desired as long as you know the mean and standard deviation of the data. However, with this set of data these numbers do not mean much. They do not lead towards the research question and in no way help determine where tornado shelters should be placed. 

Conclusion

The results, specifically the standard deviation map, showed that there were definitely areas that had a higher volume of tornadoes. If the states decided to build tornado shelters, placing them within those counties would be the best place to start. However, it is also important to consider other factors such as population when building tornado shelters. Having a tornado shelter where history has shown lots of tornadoes happen, but that area contains no people would be a pointless place to build. It is always important to dig deeper into your data and results and realize exactly what they mean. In this exercise some z-scores were useful while other predictability numbers were not. Understanding your results is more important than just knowing how to find them.