Statistical Analysis and Visualization Services for Spatially Integrated Social Science Datasets Irfan Azeezullah, Friska Pambudi, Tung-Kai Shyy, Imran Azeezullah, Nigel Ward, Jane Hunter eResearch Lab, School of ITEE The University of Queensland Brisbane, Australia {s.azeezullah1, f.pambudi, t.shyy, s.azeezullah, n.ward4, j.hunter}@ uq.edu.au Robert J. Stimson Director of AURIN (Australian Urban Research Infrastructure Network) The University of Melbourne Melbourne, Australia rstimson@unimelb.edu.au Abstract— The field of Spatially Integrated Social Science (SISS) recognizes that much data of interest to social scientists has an associated geographic location. SISS systems use geographic location as the basis for integrating heterogeneous social science data sets and for visualizing and analyzing the integrated results through mapping interfaces. However, sourcing data sets, aggregating data captured at different spatial scales, and implementing statistical analysis techniques over the data are highly complex and challenging steps, beyond the capabilities of many social scientists. The aim of the UQ SISS eResearch Facility (SISS-eRF) is to remove this burden from social scientists by providing a Web interface that allows researchers to quickly access relevant Australian socio-spatial datasets (e.g. census data, voting data), aggregate them spatially, conduct statistical modeling on the datasets and visualize spatial distribution patterns and statistical results. This paper describes the technical architecture and components of SISS-eRF and discusses the reasons that underpin the technological choices. It describes some case studies that demonstrate how SISS-eRF is being applied to prove hypotheses that relate particular voting patterns with socio-economic parameters (e.g., gender, age, housing, income, education, employment, religion/culture). Finally we outline our future plans for extending and deploying SISS-eRF across the Australian Social Science Community. Keywords— spatial social science; data integration; statistical analysis; geospatial information systems. I. INTRODUCTION Social scientists have long recognized that adopting a spatial approach to understanding and analysing social science data is important in many fields including demographic research, population science, understanding socio-economic inequalities across regions, population health and urban planning. The use of Geographic Information Systems (GIS) and spatial statistics in the social sciences enable researchers to identify geographical patterns and processes that are critical for infoming decisions made by resource managers, infrastructure planners and policy makers, particularly in government agencies. Many tools that support a spatial approach to social science have emerged over the past decade (e.g., Esri ArcGIS geospatial visualisation and analysis tools [18], SpaceStat statistical packages with a spatial functionality [19]) but these tend to be either specialized and sophisticated with a steep learning curve or commercial off-the-shelf products that are not user friendly, problem-specific, or flexible enough to be tailored to the demands of researchers. Spatially integrated social science requires a new kind of tool, capable of powerful analysis but comparatively accessible, extensible, flexible and user friendly. In this paper, we describe the UQ SISS eResearch Facility (SISS-eRF) 1 a Web-based extensible framework that enables social scientists with little-no prior programming or statistical analysis skills to easily access social science data sets of interest to them, aggregate them at different geographic scales, analyse and model them statistically and generate visualizations and graphics that provide empirical evidence of spatial social science patterns or trends. II. OBJECTIVES There are already a wide variety of Web-based GIS applications in the fields of social sciences and planning. These range from applications that enable the analysis of economic development, ecotourism, crime, voting patterns, service demands, and urban quality of life to spatial decision support systems to underpin decision-making in local and regional planning ([3, 6, 9, 10, 11, 12, 13, 14, 15]). However exploring and identifying spatial patterns and relationships between socio-economic parameters and other indicators (such as health, behaviour, crime, quality of life etc.) is not easy for social scientists [7], given the large volumes of data involved, the need to understand and encode relationships between data and geography, and the need to implement appropriate statistical analysis techniques. For this reason, many of these existing projects have focused on the use of Web-based GIS for accessing and integrating spatial social science data on-line. Some projects have also augmented Web-based GIS capabilities with simple statistical modeling tools. But as far as we are aware, there are no examples of Web-based spatial social science applications that provide researchers with the full range of tools needed to prove hypotheses about particular spatial patterns in social science data. Additionally, although some of the previously 1 http://www.esocialscience.org/ cited systems are available online, most use proprietary technology, making them difficult to adapt or repurpose. The objective of the work described here is to provide a Web-based system that combines geospatial interfaces and statistical methods and analytics to enable social scientists, to easily access, correlate, visualize and explore spatially integrated social science data. In particular, we want to build a viable system that includes a critical mass of geospatial and statistical analysis tools, visualisation tools and useful data to prove hypotheses about spatial patterns. Moreover our aim is to build an extensible framework that can easily incorporate new statistical modeling tools as they become available and that is built on open source technologies. III. CASE STUDY A highly topical research area in spatial social science is exploring relationships between voting patterns and demographic and socio-economic characteristics of polling booth catchments and electorates across a region. Understanding such patterns is extremely valuable information for social scientists as well as for political parties. For example, Taylor [17] showed through aggregation of 2001 census and polling booth data that there is a strong correlation between voter support for the Australian Greens Party and voters’ tertiary education and secularity. Stimson, Chhetri, and Shyy [16] applied discriminant analysis over 2001 census and voting data to determine: that voter support for the Labor Party leaned toward asset- poor, multicultural areas, and ; that the National Party and the One Nation Party tended to compete for votes in areas that were characterised as asset- rich, monocultural, low income and low education areas. These patterns were then used to accurately identify heartlands of voter support for both the Coalition parties (Liberal Party, National Party and Country Liberal Party) and the Labor Party in the 2004 federal election. Zones in transition between those Coalition and Labor voter heartlands (marginal voting areas) were also identified. Our goal is to evaluate SISS-eRF initially, by collaborating with social scientists who are analysing relationships between voting patterns and socio-economic factors using the 2010 Australian Federal election data and 2006 Australian Bureau of Statistics Census data. IV. SYSTEM ARCHITECTURE Figure 1 illustrates the overall system architecture. It is built on an open source software stack, that comprises four key components: Backend Data Storage (PostGreSQL + PostGIS); Statistical analysis & visualisation services (Java + R2); Geospatial selection & visualisation services (Geoserver plus OpenLayers and GeoExt); User interface (JQuery plus metadata services). 2 http://www.r-project.org/ A. Backend Data and Metadata Storage Data variables and geographies are stored in a PostgreSQL database, extended with PostGIS support for indexing geographic objects. PostGIS follows the SQL specification from the Open Geospatial Consortium (OGC), and forms the basis for many open-source GIS projects and community including the OpenStreetMap project. Figure 1: Overall System Architecture of the Spatially Integrated Social Science eResearch Facility (SISS-eRF) Currently there are two methods to access the large data sets stored in the database: a direct JDBC query and a mediated data access mechanism using the Java Hibernate interface. To enable distributed data access and computation we keep the compute modules separate from the data and specify interface APIs for access. Raw data is manipulated in CSV to create new “derived” variables that answer common socio-spatial science questions (e.g., no. of Generation Y voters). CSV files were manually manipulated in Microsoft Excel to: Select and extract just the data of interest to our target researchers from large datasets; Create percentage figures comparing the value of a variable against the total population in a region (the raw data usually only contained absolute counts for a region); Create location quotients comparing each variable's local value against the national benchmark for that variable. These converted files were then combined with ESRI Shape files defining geographic regions. The ESRI Shape files were then transformed in SQL format using a tool named “shp2pgsql” and ingested into the database using “psql”. In order to support selection and analysis of variables within the user interface, the storage layer exposes services that list: available geographical locations, their type (division/metro/state) and co-ordinates (latitude, longitude) available electoral and socio-economic variables, their type (to enable a decision about which statistical algorithms can be applied to the variable), and display names. B. Statistical Analysis Services Statistical and classification computation capabilities are implemented as restful services in Java. Statistical results and associated metadata are exposed by the Java services as XML using the Xstream Java library [20] to enable easy integration into other services. Initially we implemented three data classification algorithms in Java to test the performance: Equal interval, Quantile, and Natural breaks. Based on this experience, we decided to implement the following data classification algorithms in R [21] and call the R routines from Java: Equal Interval Quantile Jenks Breaks Fisher Standard Deviation Hierarchical clustering Block clustering K-means clustering In addition to the data classification algorithms, we have integrated R implementations of Regression and Regression Line Fitting and are currently working on the following: Generalized linear; Multinomial logistic regression; Proportional-odds logistic regression; ANOVA (ANalysis Of Variance). Java servlets interface to the R routines using Rserve [22] to translate Java objects to R objects and vice-versa. The Java servlets pass both data and commands to the R algorithms. The commands specify the type of the algorithm requested, the specific type of implementation required, and any additional options needed for the algorithm to perform the computation. The results from R are provided as an R-object with embedded list responses. These embedded list responses are parsed by the REngine library [23] and the required data and metadata are extracted into POJOs (Plain Old Java Objects), for interoperability with other Java modules in a workflow environment. This approach enables the Java routines to take advantage of sophisticated routines implemented in R. One of the key reasons for implementing the algorithms in R is to take advantage of the existing cache of validated and trusted algorithms, rather than implementing such algorithms from scratch in pure Java. The current R-project distribution provides 70 dedicated social science ‘task-views’ packages [24] providing statistical analysis algorithms and more than 1500 packages targeting different domains. Another advantage of using R, is that many of the algorithms are implemented in C/C++ and optimized for analyzing large scale datasets, resulting in performances that are 2-10 times faster than the classification algorithms we initially implemented in Java. All statistical results including the geo-spatial data classification are exported as XML. These XML results are then transformed into the following formats for display via the frontend Web interface: images (PNG/JPEG) using Processing [26], interactive charts using Processing JS [25], and PDF representations using the iText Library [27]. C. Delivering Maps and Features The GIS components in the SISS-eRF system, including some of the back-end systems and services, rely heavily on the widely adopted OpenGeo stack framework and architecture [28]. This architecture underpins many open-source GIS projects and communities, including OpenStreetMap. The key components of the OpenGeo stack and architecture that are used in the SISS-eRF system are: Storage: PostGIS spatial database Application server: GeoServer map/feature server User interface map component: OpenLayers User interface framework: GeoExt GeoServer is map/feature/transactional open-source server for serving GIS data that is written in Java. It functions as the reference implementation of the Open Geospatial Consortium specifications. Each one of the spatial databases in SISS-eRF is connected through a separate GeoServer namespace or workspace. GeoServer has the ability to automatically add all the tables from a particular database to be exposed as separate layers within a namespace. In the SISS-eRF system, we expose geospatial and social science data using the Web Map Service (WMS) and Web Feature Service (WFS). We use the OpenLayers JavaScript API to display maps and layers in the browser that have been served as PNG images from the GeoServer's Web Map Service (WMS). We also use OpenLayers to convert the output from our classification services into maps by dynamically generating Styled Layer Descriptors (SLD) for the classification and passing these in requests to the GeoServer WMS. GeoExt provides customisable mapping widgets, applications and data handling support. The GeoExt is mainly used for interfacing the map with GeoServer's Mapfish printing module. It also enables a panel that shows the map legends and a slider for map zooming functionality. D. User Interface Framework We use the JQueryUI Javascript API library to deliver interactive widgets such as tabs and accordions in our user interface. Additionally, JavaScript in the user interface interacts with the metadata service mentioned previously to dynamically create the user interface components based on the available variables and geographies. Processing JS is used to dynamically generate static and interactive charts and graphs in the user’s web browser. If the system detects the user’s browser does not support HTML 5 canvases, then a server-side Processing library produces a JPEG representation of the result and serves that to the web browser instead. V. USER INTERFACE AND FUNCTIONALITY Consider again our case study from Section III. A social scientist is interested in identifying those demographic and socio-economic factors that are associated with people from the State of Victoria who vote for the Coalition (Liberal and National) party. For this test case we use the following data: primary votes cast for Coalition candidate standing for the House of Representatives at the 2010 Australian federal election from the 1719 polling booths in Victoria, Australia. Derived data from the 2006 census [1] provides 48 variables that represent the demographic and socio-economic characteristics of the population living within those polling booth catchments (see TABLE I. ) Multiple regression modeling is applied to gain an indication of which demographic and socio-economic factors are significant. Applying step-wise regression analysis identified 28 variables that are statistically significant (see TABLE II. ), with 95% confidence intervals level and an adjusted R 2 = 0.774. Thus those variables account for 77.4 percent of the variation across the 1719 polling booths in Victoria in the primary vote for Coalition. TABLE I. VARIABLES DERIVED FROM THE 2006 AUSTRALIAN CENSUS REPRESENTING THE DEMOGRAPHIC AND SOCIO-ECONIMIC CHARACTERISTICS OF POPULATIONS LIVING IN POLLING BOOTH CATCHMENTS Age and sex % population males (MALES) % population age 0-17 years children and youth (YOUTH) % po pulation age 18-22 years first voters (FIRST) % population age 23-34 years (GEN Y) % population age 35-44 years (GEN X) % population age 45-59 years boomer (BOOMERS) % population age 60-74 years (Post Depression Wartime Generation) (WW2GEM) % population age 75+ years (Pre Depression Generation) (DEPGEN) Family and household structure % single person households (SINGLES) % couple without children households (COUPLES) % one parent family households (ONEPARENT) % couples with children households (COUPCHILD) Housing tenure % households that are home owners (HOMEOWN) % households that are home purchasers (MORTGAGEES) % households that are private renters (RENTERS) % households that are public housing tenants (PUBHOUS) Ethnicity/race % indigenous persons (INDIG) % born overseas (IMMIG) % born in UK (UK) % born in Southern and Eastern Europe (SEEUROPE) % born in Middle East (MIDEAST) % born in Asia (ASIA) Religious affiliation % Catholic (CATH) % Anglican (ANG) % Pentecostal (PENT) % other Christian (OTHCHRIST) % Islamic (ISLAM) % other non-Christian religion (ONCHREL) % with no religion (NORELIG) Residential stability/Mobility % of population not at the same address 5 years ago (MOBILE) Digital divide % dwellings (not population) using Internet (INTERNET) Engagement in work Labour force participation rate (INWORK) Unemployment rate (UNEMPLOY) Industry of work % employed in Extractive Industries (EXTRACT) % employed in Transformative Industries (TRANSFORM) % employed in Distributive Services (DISTRIB) % employed in Producer/Business Services (BUSSERV) % employed in Social Services (SOCSERV) % employed in Administrative & support services (ADSS) % employed in Personal Services (PERSERV) Occupation* (Robert Reich’s categories) % employed as routine production workers (ROUTPROD) % employed as in-person service workers (INPERS) % employed as symbolic analyst (SYMBA) Human capital % persons age 15 years and over with a degree or higher qualification (DEGREE) % persons age 15 and over with a certificate, diploma or advanced diploma (CERTDIP) Income# Low income category – % households in the lowest quintile for household weekly income (less than $650) (LOWINC) Middle income category –% households in the middle three quintiles for household weekly income ($650-$1,999) (MIDINC) High income category -% households in the highest quintile for household weekly income ($2,000+) (HIGHINC) This analysis indicates that polling booth catchments which have a positive relationship to voting for the Coalition in the 2010 Federal election tend to have populations characterized by: employment in extractive, distributive, business services, or administrative industries; having Anglican, Pentecostal or other Christian religious affiliation; coming from high income households; being indigenous or of Asian descent; renting a house; having a paid job; and having moved house in the last 5 years. Polling booth catchments which have a negative relationship to voting for the Coalition tend to have populations characterized by a greater incidence of Generation Y, Generation X, Baby Boomers and Youths; persons age 15 years and over with a degree; routine production workers or employed social services or personal services; unemployed workers; single parent households; those born in UK or southern and eastern Europe; Catholics; and first-time voters. TABLE II. RESULTS OF A STEP-WISE REGRESSION MODEL INVESTIGATING THE RELATIONSHIP BETWEEN THE COALITION PRIMARY VOTE AND THE CHARACTERISTICS OF POPULATIONS LIVING IN POLLING BOOTH CATCHMENTS 30th model solution (Adjusted R2 = 0.774) Polling Booth Catchment Demographic and Socio- economic Variable Standardized Beta coefficient t Significance (Constant) 67.979 7.291 .000 EXTRACT .205 4.114 .000 ANG .239 10.904 .000 UNEMPLOY -.141 -8.206 .000 OTHCHRIST .051 2.890 .004 ONEPARENT -.128 -6.585 .000 GEN X -.161 -8.992 .000 HIGHINC .157 5.090 .000 INDIG .089 6.707 .000 GEN Y -.416 -13.126 .000 BOOMERS -.118 -5.013 .000 PENT .056 4.433 .000 RENTERS .130 5.626 .000 DISTRIB .057 2.376 .018 SEEUROPE -.065 -4.746 .000 DEGREE -.325 -8.569 .000 ROUTPROD -.236 -6.591 .000 ASIA .101 5.062 .000 INWORK .153 6.797 .000 BUSSERV .089 2.195 .028 UK -.109 -6.370 .000 CATH -.093 -5.574 .000 FIRST -.064 -3.948 .000 MALES .060 3.604 .000 ADSS .031 1.983 .048 PERSERV -.059 -3.565 .000 SOCSERV -.063 -2.737 .006 MOBILE .056 2.990 .003 YOUTH -.053 -2.433 .015 A. Statistical and geospatial visualisation Based on the trends uncovered by the multiple regression analysis tools available within SISS-eRF, we can visualize the relationship between specific variables. For example, between the percentage of votes for Coalition candidates and : The percentage of voters who are Anglican; The percentage of voters who are Generation Y (age); Figures 2 & 3 below illustrate statistical visualizations of regression line fitting of these variables. There is a positive correlation between people who are Anglican and people who vote for the Coalition. There is a negative correlation between Generation Y voters and people who vote for the Coalition. In addition, users are also able to visualize these relationships spatially by displaying the layers, overlaid and color coded in the mapping interface (juxtaposed alongside the statistical graphs). Figures 4 & 5 below show how the relationships can be visualized on maps by color-coding the centroid of polling booth locations to represent the percentage of Coalition vote, and color-coding the polling booth catchment regions to represent the variable range (e.g., percentage of Anglicans). Figure 2. Regression line fitting of percentage primary vote for the Coalition parties versus percentage of voters who are Anglican Figure 3. Regression line fitting of percentage primary vote for the Coalition parties versus percentage of voters who are Generation Y VI. USER INTERFACE The SISS-eRF user interface is shown in the bottom part of Figure 1 as well as Figures 2-5. It has three main components, exposed as separate tabs. A. Map Selection and Data Classification In this tab (shown in Figures 4 and 5), there are two main components: the map controls (on LHS) and the map display panel (RHS). The map controls interface supports: Area and Layers selection; and Data Classification selection. In the Area and Layers section, users may choose a particular region to display on the map (State or Electorate) and toggle the map's base and overlay layers. For example the user can choose to display different levels of geography e.g., electoral boundaries, polling booth catchments, Local Government Areas (LGAs) or suburb boundaries etc. Figure 4. Percent primary votes for Coalition (color-coded polling booth locations) overlaid on percent of Anglicans (color-coded polling booth catchment regions) Figure 5. Percent primary votes for Coalition (color-coded polling booth locations) overlaid on percent of GenerationYs (color-coded polling booth catchment regions). One of the major functions supported by SISS-eRF is the generation of thematic map displays of variables in either regions (polygons) or points. When users select the Data Classification interface, they are presented with a drop down menu which provides the following choices for thematic geographic display of the data: Equal interval, which classifies the features into equally divided ranges of attribute values; Quantile classification, in which each class contains approximately the same number of features; The natural breaks approach, which is a median-based natural breaks classification that optimises attribute similarity. This method is used in Figures 4 and 5. Two statistical measures for comparing the performance among equal interval, quantile and natural breaks approaches are also presented: one is total within group variance (TWGV) (referred to as group variation in [2]) associated with the grouping model of Fisher [4] and Jenks [5] optimisation. the other is total within group difference (TWGD) (referred to as absolute deviation in [2]), which is the measure structured in the median clustering objective [8]. Users are able to choose between TWGD, TWGV or no statistical comparison of the performance of the classification approaches. B. Statistical Analysis This tab enables users to conduct statistical analysis. It provides buttons enabling users to choose between the different statistical analysis algorithms supported by the system: Regression Generalized linear Multinomial logistic regression Proportional-odds logistic regression ANOVA (ANalysis Of Variance) After one of the buttons is clicked, users are presented with a form to choose those variables that they wish to investigate further and a “Run Statistical Analysis” button to actually run the analysis. The output result is shown on the “Results and Offline Download” tab. The Statistical Analysis tab includes Metadata Information that describes each of the variables that the system is hosting. C. Results and Offline Download This tab (on the bottom RHS of Figure 1) displays to the user, the results of the statistical analysis. It can display static charts, linear or bar graphs, interactive charts, as well as text based statistical descriptions of the results (see Figures 2 and 3). Users can also choose to create and download PDF versions of the visualisations or (CSV files) of the results for offline use. VII. EVALUATION A. Testing Framework To test the system from a user’s perspective, we integrated Selenium [29], a testing framework for Web applications into a Web browser interface. This enabled us to validate the frontend, to parse the results from the backend and to verify the display and visualisation of data, input options and results. For example, we added test suites to check that all variables in our metadata catalog were exposed in the user interface (some geographies have over 300 statistical variables and 50 location quotients - making it easy to miss variables without automated testing). One example of a test suite is the example described above. The Web application server is tasked to determine the statistical linear regression dependence of a single dependent variable upon a single independent variable, a subset of independent variables and a random selection of independent variables. Test results are logged to determine the failed tests and the number of failures. To enable test-driven development and to support continuous integration of the Web frontend, middleware and backend – Selenium [29] Java unit-tests and R unit-tests are used. B. User Feedback Throughout the project, we collaborated with social scientists from the University of Queensland School of Geography Planning and Urban Environment, and from the University of Melbourne Australian Urban Research Infrastructure (AURIN) Project, who provided ongoing feedback about the evolving system. Based on their feedback, we implemented the following extensions and refinements: Included support for scatter plots to display the data points associated with linear regression visualisations (as shown in Figures 1 & 2); Added the ability to generate grayscale maps that can be embedded in papers and presentations that do not support color; Parallelized the multiple regresssion algorithm in order to improving performance from O(log N) to O(1); Translated the classification algorithms from Java into R in order to improve their performance by up to 10 times; Added test suites to check that all variables in our metadata catalog were exposed in the user interface ; We also prioritized a list of additional statistical algorithms to add to the system in order to test new hypotheses suggested by these researchers. The spatio-social scientists also uncovered some interesting artefacts in the underlying data: a few polling booth catchments had zero populations. On closer inspection, some of these artefacts represented polling booths in national parks, and one represented a polling booth located within an airport. After consultation with the researchers, some of these catchments were removed from the data, while others were merged with surrounding catchments. VIII. DISCUSSION A. Support for Multiple Analysis Approaches The SISS-eRF has proven to be a versatile toolkit, able to support the different analytical methods required by socio- spatial scientists, geographers and regional scientists. Such researchers typically use the system as described above: they first perform statistical analyses in order to test hypotheses and then create statistical and geographic visualisations to support their hypotheses and to communicate their results. Geographers and regional scientists, on the other hand often begin by using the mapping interface to uncover perceived geospatial trends in the data, and then use the statistical analysis tools to confirm (or disprove) those trends. Our system is designed to support both approaches. B. Benefits of the Revised Web Architecture This development extended and refined earlier efforts to establish an e-Research Facility for Socio-Spatial Analysis at the University of Queensland [7]. This pre-existing facility was built using proprietary technologies, and one of the primary goals of the work described here was to migrate this previous system over to open source technologies. As part of this migration we also incorporated modern Web technologies into the system and improved modularization: The new system separates presentation of statistical results from the construction of the results. In the previous system R routines performed statistical analysis and created graphs representing the results. In the new system, R routines produce an R-Object consumed by a Java wrapper to produce an XML representation of the results. This representation can then be independently interpreted to create images, javascript driven interactive charts and PDF representations of the results. We moved the definition of variables outside the user interface. JavaScript now interacts with a metadata service to dynamically create the user interface components. This means that new data can more easily be incorporated into the system. Map image tiles are now served from GeoServer via a Web Map Service (WMS). They are then rendered in the Web browser using a combination of OpenLayers and GeoExt JavaScript libraries. This allows easier switching between map layers in the user interface. C. Computing Challenges The greatest challenge in implementing the described system was associated with the integration of a wide range of existing open source applications and tools that were not designed for the Web-based, interactive use cases that we have described. The system integrated: R for statistical analysis, Processing.js for visualisation, iText for PDF processing, PostGIS and GeoServer for handling geographic spatial information and Java for handling the Web services, data and metadata. Testing and debugging the system was a challenge due to the difficulty associated with pin-pointing the origin of the error and the condition that caused it. Validation and cleaning of the input data was also very important to prevent errors being propagated into our derived results. Validity, accuracy and repeatability of the results are a core requirement of the project so accurately capturing the provenance of the derived visualizations and graphics was another critical and challenging aspect of this project. IX. FUTURE WORK One of our project aims was to build a viable system that includes enough tools and useful data to allow social scientists to test and prove hypotheses about spatial patterns. Any viable system needs to contain a critical mass of value- added (derived) data of interest to researchers. Deriving variables from raw demographic, socio-economic and voting data and importing them into the system is currently a labour intensive, time consuming and potentially error-prone process. To ease the generation of value-added data, workflow tools and ingest tools to support specification and automatic generation of derived variables are a necessity. These tools could also automatically record the provenance / lineage of variables derived through this process. A streamlined method to enable addition of new statistical algorithms and the ability to chain results into multiple analysis tools are some of the key next steps to provide a complete tool for online geo-spatial statistical analysis. We are also keen to incorporate new data sets (associated with housing, transport, labour force and crime) to evaluate the tools in the context of other social science sub-disciplines. Finally the AURIN (Australian Urban Research Infrastructure Network) project is a national initiative that is developing eResearch infrastructure to be shared by the Australian social science research community. We are currently working with AURIN to integrate our services into their framework. We also plan to integrate the 2010 Census data recently published by the Australian Bureau of Statistics. X. CONCLUSIONS This paper describes SISS-eRF: a Web-based extensible framework that enables social scientists with little-no prior programming or statistical analysis skills to easily access social science data sets of interest to them, aggregate them at different geographic scales, analyse and model them statistically and generate visualizations and graphics that demonstrate empirically, spatial social science patterns or trends. We aimed to build a viable system that includes a critical mass of geospatial, statistical analysis and visualisation tools and useful data to prove hypotheses about spatial patterns. In order to address this “viability” goal, we Involved social scientists in the development of the system, and iteratively developed the system in response to their feedback; Used R and Rserve to build an extensible framework that can easily incorporate new statistical modeling tools as they become available; Used only open source technologies so that others could adapt and repurpose our work; Derived hundreds of statistical variables and associated them with geographies of interest to researchers. The case studies in this paper show that SISS-eRF can be used to quickly and easily identify trends in voting patterns, demographic and socio-economic variables, and to prove or disprove researchers’ hypotheses. Based on feedback from spatio-social scientists we are actively incorporating more datasets in the system (e.g., housing, crime, transport) and integrating a wider range of statistical fit algorithms that can test more complex hypotheses. ACKNOWLEDGMENT Development of the SISS-eRF was supported by the ARC Research Network in Spatially Integrated Social Science and the Australian National Data Service (ANDS) through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative. REFERENCES [1] Australian Bureau of Statistics (ABS), “2006 Census DataPacks,” from the 2006 Census of Population and Housing, Canberra ACT, 2007. [2] R. G. Cromley, “A comparison of optimal classification strategies for choroplethic displays of spatially aggregated data,” International Journal of Geographical Information Systems, 10(4):405-424, 1996. [3] S. Doyle, M. Dodge, and A. Smith, “The potential of Web-based mapping and virtual reality technologies for modelling urban environments,” Computers, Environment and urban Systems, 22(2):137-155, 1998. [4] W. D. Fisher, “On grouping for maximum homogeneity,” Journal of American Statistical Association, 53:789-798, 1958. [5] G. F. Jenks, “Generalization in statistical mapping,” Annals of the Association of American Geographers, 53(1):15-26, 1963. [6] S. D. Kirkby and S.E.P. Pollitt, “Distributing spatial information to geographically disparate users: a case study of ecotourism and environmental management,” Australian Geographical Studies, 36(3):262-272, 1998. [7] E. Liao, T-K. Shyy, and R. J. Stimson, “Developing a web-based e-research facility for socio-spatial analysis to investigate relationships between voting patterns and local population characteristics,” Journal of Spatial Science, 54(2):63-88, 2009. [8] A. T. Murray and T-K. Shyy, “Integrating Attribute and Space Characteristics in Choropleth Display and Spatial Data Mining,” International Journal of Geographical Information Science, 14(7):649-667, 2000. [9] S. Openshaw, I. Turton, J. Macgill, and J. Davy, “Putting the geographical analysis machine on the Internet,” in Innovations in GIS, vol. VI, B. Gittings, Eds. London: Taylor & Francis, 1999, pp. 121-131. [10] Z-R. Peng, “An assessment framework for the development of Internet GIS,” Environment and Planning B: Planning and Design, 26(1):117–132, 1999. [11] Z-R. Peng and M-H. Tsou, Internet GIS: distributed geographic information services for the Internet and wireless networks. New Jersey: John Wiley & Sons, 2003. [12] T-K. Shyy, R. Stimson, and A. T. Murray, “An Internet GIS and spatial model to benchmark local government socioeconomic performance,” Australasian Journal of Regional Studies, 9(1):31-47, 2003. [13] T-K. Shyy, R. Stimson, J. Western, A. T. Murray, A. T. and L. Mazerolle, “Web GIS for mapping community crime rates: approaches and challenges,” in Geographic Information Systems and Crime Analysis, F. Wang, Eds. Hershey: Idea Group Publishing, 2005, pp. 236-252. [14] T-K. Shyy, R. Stimson, P. Chhetri, “Web-based GIS for mapping voting patterns at the 2004 Australian federal election,” Applied GIS, 3(11):1-20, 2007. [15] T-K. Shyy, R. Stimson, P. Chhetri, and J. Western, “Mapping quality of life in the south east Queensland region with a Web- based application,” Journal of Spatial Science, 52(2):13-22, 2007. [16] R. J. Stimson, P. Chhetri, and T-K. Shyy, “Typology of local patterns of voter support for political parties at the 2004 federal election,” People and Place, 15(1):1-12, 2007. [17] S. Taylor, Identifying Greens Voters with GIS, Honour thesis, Melbourne: University of Melbourne, 2003. [18] ESRI ArcGIS [Online]. Available: http://www.esri.com/software/arcgis [19] BioMedware SpaceStat [Online]. Available: http://www.biomedware.com/?module=Page&sID=spacestat [20] XStream Java library [Online]. Available: http://xstream.codehaus.org [21] R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 [22] S. Urbanek, “Rserve: A Fast Way to Provide R Functionality to Applications.” in K Hornik, F Leisch, A Zeileis (eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria, 2003. ISSN 1609-395X [23] REngine Java library [Online]. Available http://www.rforge.net/org/docs/org/rosuda/REngine/REngine.ht ml [24] R Project social sciences web task view [Online]. Available http://cran.r-project.org/web/views/SocialSciences.html [25] Processing.js A port of the Processing Visualization Language [Online]. Available: http://www.processingjs.org [26] Processing: A visualisation language [Online]. Available: http://www.processing.org [27] iText PDF library [Online]. Available: http://itextpdf.com [28] OpenGeo Suite [Online]. Available http://opengeo.org/ [29] SeleniumHQ Web application testing system [Online]. Available: http://seleniumhq.org