What is this Blog About?



This quantitative methods in geography blog will showcase the skills and techniques learned in the course GEOG 328 from the University of Wisconsin Eau Claire. The focus of this class is on relating quantitative methods and statistics to geography.

Monday, March 16, 2015

Significance and Chi-Squared Testing

Part 1

Part one of this exercise was centered on significance testing. Below are the questions with their answers. 
1.       



Fill out the chart above! (10 pts)
α = Significance level
z or t = is it a z or t test
z or t Value = Critical Value


2.       A Department of the interior in Washington D.C. estimates that the number of particular invasive species in a certain county (Bucks County) should number as follows (averages based on data from the whole state of Pennsylvania) per acre: Asian-Long Horned Beetle, 4; Emerald Ash Borer Beetle, 10; and Golden Nematode, 75.  A survey of 50 fields had the following results: (10 pts)

μ               σ
                Asian-Long Horned Beetle           3.2          0.73
                Emerald Ash Borer Beetle            11.7        1.3
                Golden Nematode                          77           5.71
               
a.       Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
b.      Be sure to present the null and alternative hypotheses for each as well as conclusions
c.       What can ascertained pertaining to the findings about these invasive species in Buck County?

Asian-Long Horned Beetle:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Asian-long horned beetles collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Asian-long horned beetles collected
Results: A z-test is used resulting in a test statistic = -7.75,  which is less than the critical value of -1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly below the hypothesized amount.

Emerald Ash Borer Beetle:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Emerald Ash Borer beetles collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Emerald Ash Borer beetles collected
Results: : A z-test is used resulting in a test statistic = 9.25, which is greater than the critical value of 1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly above the hypothesized amount.

Golden Nematode:
Null Hypothesis- There is no statistical difference between the sample number and the expected number of Golden Nematodes collected
Alternative Hypothesis- There is a statistical difference between the sample number and the expected number of Golden Nematodes collected
Results: A z-test is used resulting in a test statistic = 2.48, which is greater than the critical value of 1.96 therefore The null hypothesis would be rejected as the test statistic shows with 95% confidence the sample collected is significantly above the hypothesized amount.


3.       An exhaustive survey of all users of a wilderness park taken in 1960 revealed that the average number of persons per party was 2.1.  In a random sample of 25 parties in 1985, the average was 3.4 persons with a standard deviation of 1.32 (one tailed test, 95% Con. Level) (5 pts)

d        a.       Test the hypothesis that the number of people per party has changed in the intervening years.  (State null and alternative hypotheses)
Null Hypothesis- The average number of people per party within the park has not changed between 1960 and 1985
Alternative Hypothesis- The average number of people per party within the park has changed between 1960 and 1985
Results: Due to the small sample size a t test will be conducted resulting in a t-value of 4.92. Therefore, the null hypothesis would be rejected, meaning the average number of people per party within the park has statistically changed between 1960 and 1985 and it has increased.
b.      What is the corresponding probability value

0.000026 or 0.0026%

Part 2

Introduction

Part two of this assignment focused on Chi-Squared testing. The Chi-Squared testing dealt with the concept of the phrase "up north" being used in Wisconsin. The phrase "up north" is often used to describe the differences between the northern and southern people of the state. Along with that phrase comes a series of judgments or even prejudices. Usually the main difference that deals with that phrase is the people of the north being more rural and the people of the south being from the city. This part of the lab's goal was to use Chi-Squared testing to determine if the phrase "up north" actually has any real data to back up the differences it implies. 

Methods

The first step to determining if there are major differences between the northern and southern portions of Wisconsin is to develop a dividing line between the north and south. The idea would be to find the direct line where people would begin to describe the areas as "up north". This can be a rather arbitrary idea considering there is no fine line dividing north and south in Wisconsin and the concept of "up north" varies depending on the person. However, for simplicity and in order to find results I used highway 29 as the rough dividing line between north and south. Highway 29 runs east-west for the most part across the entire state. Highway 29 provides a pretty decent dividing marker, where counties north of it are considered "up north" and counties south of it are the southern counties.

Figure 1: Map of Wisconsin counties split roughly along Highway 29
The next step of the lab was to choose three variables that most people generally associate with being up north. The three variables I choose were snowmobile trails, designated natural areas, and population over 65. All three are variables I typically assume would be higher in the northern parts of the state. Snowmobile trails and natural areas because of the more rural nature of the landscape in the north. The population over 65 because I would consider most of the cities in the south to contain more young people than old. 

Once the variables were selected I could now begin to move into the statistical testing. The program used for statistical testing is called SPSS or Statistical Package for the Social Sciences. Within SPSS Chi-Squared testing can be performed. Chi-Squared testing is a statistical process that compares the observed distribution to the expected distribution. The best way to do the Chi-Squared testing with the three variables is to revalue the data within the variables. Revaluing the data, by giving each value a 1-4 value will make it easier for understanding distributions and comparisons. For example each value within the snowmobile trails data set was revalued as either a 1, 2, 3, or 4 depending on its initial value. To do this, the range was found for the data then divided by four creating an increment to be used to calculate the range of the of each of the new value categories (1, 2, 3, or 4). 

So for the snowmobile data the maximum initial value is 641 and the minimum is 0, making the range of the data 641. Dividing the range of 641 by 4 gives the increment value, which is ~160, of what each new value category will be based off of. The next step would be to take the maximum, 641, and subtract the increment value, ~160, resulting in 481. Therefore, any value that falls between 481- 641 will be given a new category value of 4, meaning these are the highest of values for the snowmobile data set. The next step would be to subtract ~160 from 481, resulting in 321. Thus any value in the 321-481 range will be given a new category value of 3. Following this process for to find the categories 1 and 2, as well as doing the same process for each variable will then allow for proper Chi-Squared testing to be done. 

The reason behind doing this for the Chi-Squared testing is because each category value can be considered a distribution and a Chi-Squared test specializes in comparing distributions. Meaning, the Chi-Squared testing can determine what the expected number of values should fall into each distribution and compare it to what the actual number of values fall into each distribution. This is especially important because there is a greater amount of counties in the southern portion of the state below Highway 29. Using this testing method will allow the comparison between north and south still to take place despite, having a difference in the total number of counties.

Results

Maps can be made for each variable and differences can be noted. However, making the maps without the statistical testing does not prove there is a real statistical difference. So the best route is to compare the maps with the Chi-Squared statistics and find instances where they compliment each other. Note: For each Chi-Squared results table the "1" category represents the north and the "2" category represents the south.

Snowmobile Trails in Miles

Figure 2: Map of Snowmobile trails in miles by county
 This map does appear to show that the northern counties of the state do have a greater number of miles of snowmobile trails. The majority of the darker colored purple counties do indeed fall north of Highway 29.
Figure 3: Chi-Squared testing results for then snowmobile trails variable
The results of the Chi-Squared testing to the map above does agree with what the above map portrays. For the northern counties in the 3 and 4 categories, the categories that represent the higher number of snowmobile trail miles, the observed distribution is higher than the expected (if the distribution were even) . Meaning, there are actually more northern counties in the categories of high snowmobile trail miles than expected. And the opposite is also true, southern counties have a fewer number of counties than what would be expected in the categories of high snowmobile trail miles. The southern counties also have more a more than expected number of counties in the categories of low snowmobile trail miles.

Designated Natural Areas

Figure 4: State natural areas in acres by county
 Looking at this map it appears that there is no real difference between the north and south when it comes to state natural areas. In fact it almost appears that the southern counties may have more natural areas.
Figure 5: Chi-Squared testing results for the state natural areas variable
The Chi-Squared testing results disagree with what the map above displays. The testing results show that in fact, the northern counties actually posses a larger number of counties with high acreage of natural areas than what they are expected to. This is a sign that despite appearing in the map to not have a large number of state natural areas, statistically speaking they do have a good proportion of state natural areas. Basically, this means the northern counties, compared to the south, do have a large number of natural areas for how little land they actually cover. 

Population over 65

Figure 6: Map of Population over age 65 by county
 This map displays that the total population over 65 is hugely dictated by overall population numbers. This can be seen because the counties in darker shades of red are represent the counties closest to Wisconsin's two biggest cities Milwaukee and Madison.
Figure 7: Chi-Squared testing results for the variable of population over 65
The Chi-Squared testing results confirm what the above map displays. That being that the major cities dictate the population over 65 variable, and most likely every other total population variable for Wisconsin. Basically there is no meaningful Chi-Squared statistical analysis that can really be done due to the extremely huge outliers of total population found in the cities of Milwaukee and Madsion.

Conclusion

The major connotation that goes along with the phrase "up north" is that the it is more rural and country compared to the city driven south. Of the three variable I selected, it would appear that this phrase would be somewhat accurate. According to my Chi-Squared testing results the north has proportionately higher amounts of snowmobile trails and state natural areas. Two variables that I consider to be a representation of the "up north" phrase. The total population variable was not able to provide anything meaningful, due to the large city populations. In retrospect this was a poor variable for me to choose to attempt to find the difference between northern and southern Wisconsin. Overall it appears that there is some relevance to the phrase "up north", but of course more variable should be studied than just three before claiming that phrase to be totally correct.