Email
or contact me via:
Internet Protocol (IP) addresses are frequently used as a method of locating web users by researchers in several different fields. However, there are competing reports concerning the accuracy of those locations, and little research has been done in manually comparing the IP geolocation databases and web page geographic information. Some research has been done using a variety of analytical techniques in order to estimate IP address accuracy, but there has not been a large scale manual investigation. Members of the GIScience community have developed techniques for visualizing point data, but there has not been much research in how to apply these practices to the phenomena of IP address point data. This thesis research worked in the intersection of GIScience, Internet research, and Geostatistics in order to examine IP accuracy (IPv4) to the city-level and to determine the better methods for visualizing IP address point values.
The first part of this thesis examined IP geolocation address accuracy in more depth, and processed several datasets for locational accuracy. Using a previously built custom searching tool that uses popular search engine APIs to extract a list of web pages for a keyword, six keywords were gathered. These keywords were “Mitt Romney”, “Rick Santorum”, “Michael McGinn”, “Jerry Sanders”, “Flu”, “HPV Vaccine”. When manually visiting each web page gathered by the searching tool, three types of data were gathered. First, I categorized each web page into one of twelve categories, ranging from “Blog” and “News” to “Education” and “Governmental”. Second, the slant of the web page was examined; answering whether is it supporting the subject at hand or attacking the subject. Third, and most important, this research looked to find the mailing or street address of the web page’s content creator and compare this address to the given IP address.
The second part of this thesis has attempted to answer the question of how to visualize IP address geolocation data, using the processed IP address data. Framing IP addresses as points and using a kernel density function in order to create a surface, how do different input parameters affect the final surface, and therefore pattern recognition? This study examined the kernel radius value, the method of calculating a population value, the method used to normalize the data, and the interactions of the different categories discussed above. By comparing the resulting maps side by side, the spatial patterns identified using IP addresses are better understood while recognizing better techniques to remove unwanted IP address spatial inaccuracies. The resulting spatial patterns with relevant properties of the keywords were also compared.
This research has attempted to find the optimal method for identifying the ‘signal’ of meaningful data and spatial patterns from the background ‘noise’ of insignificant results. With a better understanding of the signal, IP address information becomes more robust for statistical research and can be used to comprehend multiple sources of online data. By manually extracting the accurate locational information and then processing the results using a series of methods, a more proper technique for displaying IP address information in spatial analysis and GIScience has been shown.
Some maps are included below, but detailed explanations or a full copy of the thesis will not be available as it is being reviewed for submission to a journal.