Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Statistical Analysis and Visualization Services for 
Spatially Integrated Social Science Datasets 
Irfan Azeezullah, Friska Pambudi, Tung-Kai Shyy, 
Imran Azeezullah, Nigel Ward, Jane Hunter 
eResearch Lab, School of ITEE 
The University of Queensland  
Brisbane, Australia 
{s.azeezullah1, f.pambudi, t.shyy, s.azeezullah, n.ward4, 
j.hunter}@ uq.edu.au 
Robert J. Stimson 
Director of AURIN  
(Australian Urban Research Infrastructure Network) 
The University of Melbourne 
Melbourne, Australia 
rstimson@unimelb.edu.au
 
 
Abstract— The field of Spatially Integrated Social Science 
(SISS) recognizes that much data of interest to social scientists 
has an associated geographic location. SISS systems use 
geographic location as the basis for integrating heterogeneous 
social science data sets and for visualizing and analyzing the 
integrated results through mapping interfaces. However, 
sourcing data sets, aggregating data captured at different spatial 
scales, and implementing statistical analysis techniques over the 
data are highly complex and challenging steps, beyond the 
capabilities of many social scientists. The aim of the UQ SISS 
eResearch Facility (SISS-eRF) is to remove this burden from 
social scientists by providing a Web interface that allows 
researchers to quickly access relevant Australian socio-spatial 
datasets (e.g. census data, voting data), aggregate them spatially, 
conduct statistical modeling on the datasets and visualize spatial 
distribution patterns and statistical results. This paper describes 
the technical architecture and components of SISS-eRF and 
discusses the reasons that underpin the technological choices. It 
describes some case studies that demonstrate how SISS-eRF is 
being applied to prove hypotheses that relate particular voting 
patterns with socio-economic parameters (e.g., gender, age, 
housing, income, education, employment, religion/culture). 
Finally we outline our future plans for extending and deploying 
SISS-eRF across the Australian Social Science Community. 
Keywords— spatial social science; data integration; statistical 
analysis; geospatial information systems. 
I. INTRODUCTION 
Social scientists have long recognized that adopting a 
spatial approach to understanding and analysing social science 
data is important in many fields including demographic 
research, population science, understanding socio-economic 
inequalities across regions, population health and urban 
planning. The use of Geographic Information Systems (GIS) 
and spatial statistics in the social sciences enable researchers to 
identify geographical patterns and processes that are critical for 
infoming decisions made by resource managers, infrastructure 
planners and policy makers, particularly in government 
agencies. 
Many tools that support a spatial approach to social science 
have emerged over the past decade (e.g., Esri ArcGIS 
geospatial visualisation and analysis tools [18], SpaceStat 
statistical packages with a spatial functionality [19]) but these 
tend to be either specialized and sophisticated with a steep 
learning curve or commercial off-the-shelf products that are not 
user friendly, problem-specific, or flexible enough to be 
tailored to the demands of researchers. Spatially integrated 
social science requires a new kind of tool, capable of powerful 
analysis but comparatively accessible, extensible, flexible and 
user friendly. 
In this paper, we describe the UQ SISS eResearch Facility 
(SISS-eRF)
1
 a Web-based extensible framework that enables 
social scientists with little-no prior programming or statistical 
analysis skills to easily access social science data sets of 
interest to them, aggregate them at different geographic scales, 
analyse and model them statistically and generate 
visualizations and graphics that provide empirical evidence of 
spatial social science patterns or trends. 
II. OBJECTIVES 
There are already a wide variety of Web-based GIS 
applications in the fields of social sciences and planning. These 
range from applications that enable the analysis of economic 
development, ecotourism, crime, voting patterns, service 
demands, and urban quality of life to spatial decision support 
systems to underpin decision-making in local and regional 
planning  ([3, 6, 9, 10, 11, 12, 13, 14, 15]). 
However exploring and identifying spatial patterns and 
relationships between socio-economic parameters and other 
indicators (such as health, behaviour, crime, quality of life etc.) 
is not easy for social scientists [7], given the large volumes of 
data involved, the need to understand and encode relationships 
between data and geography, and the need to implement 
appropriate statistical analysis techniques. 
For this reason, many of these existing projects have 
focused on the use of Web-based GIS for accessing and 
integrating spatial social science data on-line. Some projects 
have also augmented Web-based GIS capabilities with simple 
statistical modeling tools. But as far as we are aware, there are 
no examples of Web-based spatial social science applications 
that provide researchers with the full range of tools needed to 
prove hypotheses about particular spatial patterns in social 
science data. Additionally, although some of the previously 
                                                          
1
 http://www.esocialscience.org/ 
cited systems are available online, most use proprietary 
technology, making them difficult to adapt or repurpose. 
The objective of the work described here is to provide a 
Web-based system that combines geospatial interfaces and 
statistical methods and analytics to enable social scientists, to 
easily access, correlate, visualize and explore spatially 
integrated social science data. In particular, we want to build a 
viable system that includes a critical mass of geospatial and 
statistical analysis tools, visualisation tools and useful data to 
prove hypotheses about spatial patterns. Moreover our aim is to 
build an extensible framework that can easily incorporate new 
statistical modeling tools as they become available and that is 
built on open source technologies. 
III. CASE STUDY 
A highly topical research area in spatial social science is 
exploring relationships between voting patterns and 
demographic and socio-economic characteristics of polling 
booth catchments and electorates across a region. 
Understanding such patterns is extremely valuable information 
for social scientists as well as for political parties. For example, 
Taylor [17] showed through  aggregation of 2001 census and 
polling booth data that there is a strong correlation between 
voter support for the Australian Greens Party and voters’ 
tertiary education and secularity.  
Stimson, Chhetri, and Shyy [16] applied discriminant 
analysis over 2001 census and voting data to determine: 
 that voter support for the Labor Party leaned toward asset-
poor, multicultural areas, and ; 
 that the National Party and the One Nation Party tended to 
compete for votes in areas that were characterised as asset-
rich, monocultural, low income and low education areas. 
These patterns were then used to accurately identify heartlands 
of voter support for both the Coalition parties (Liberal Party, 
National Party and Country Liberal Party) and the Labor Party 
in the 2004 federal election. Zones in transition between those 
Coalition and Labor voter heartlands (marginal voting areas) 
were also identified. 
Our goal is to evaluate SISS-eRF initially, by collaborating 
with social scientists who are analysing relationships between 
voting patterns and socio-economic factors using the 2010 
Australian Federal election data and 2006 Australian Bureau of 
Statistics Census data. 
IV. SYSTEM ARCHITECTURE 
Figure 1 illustrates the overall system architecture. It is built 
on an open source software stack,  that comprises four key 
components: 
 Backend  Data Storage (PostGreSQL + PostGIS); 
 Statistical analysis & visualisation services (Java + R2); 
 Geospatial selection & visualisation services (Geoserver  
plus OpenLayers and GeoExt); 
 User interface (JQuery plus metadata services). 
                                                          
2
 http://www.r-project.org/ 
A. Backend Data and Metadata Storage 
Data variables and geographies are stored in a PostgreSQL 
database, extended with PostGIS support for indexing 
geographic objects. PostGIS follows the SQL specification 
from the Open Geospatial Consortium (OGC), and forms the 
basis for many open-source GIS projects and community 
including the OpenStreetMap project.  
 
Figure 1: Overall System Architecture of the Spatially Integrated Social 
Science eResearch Facility (SISS-eRF) 
Currently there are two methods to access the large data sets 
stored in the database: a direct JDBC query and a mediated 
data access mechanism using the Java Hibernate interface. To 
enable distributed data access and computation we keep the 
compute modules separate from the data and specify interface 
APIs for access. 
Raw data is manipulated in CSV to create new “derived” 
variables that answer common socio-spatial science questions 
(e.g., no. of Generation Y voters). CSV files were manually 
manipulated in Microsoft Excel to: 
 Select and extract just the data of interest to our target 
researchers from large datasets; 
 Create percentage figures comparing the value of a 
variable against the total population in a region (the raw 
data usually only contained absolute counts for a region); 
 Create location quotients comparing each variable's local 
value against the national benchmark for that variable. 
These converted files were then combined with ESRI Shape 
files defining geographic regions. The ESRI Shape files were 
then transformed in SQL format using a tool named 
“shp2pgsql” and ingested into the database using “psql”.  
In order to support selection and analysis of variables 
within the user interface, the storage layer exposes services that 
list: 
 available geographical locations, their type 
(division/metro/state) and co-ordinates (latitude, longitude) 
 available electoral and socio-economic variables, their 
type (to enable a decision about which statistical 
algorithms can be applied to the variable), and display 
names. 
B. Statistical Analysis Services 
Statistical and classification computation capabilities are 
implemented as restful services in Java. Statistical results and 
associated metadata are exposed by the Java services as XML 
using the Xstream Java library [20] to enable easy integration 
into other services. Initially we implemented three data 
classification algorithms in Java to test the performance: 
 Equal interval, 
 Quantile, and 
 Natural breaks. 
Based on this experience, we decided to implement the 
following data classification algorithms in R [21] and call the R 
routines from Java: 
 Equal Interval 
 Quantile 
 Jenks Breaks 
 Fisher 
 Standard Deviation 
 Hierarchical clustering 
 Block clustering 
 K-means clustering 
In addition to the data classification algorithms, we have 
integrated R implementations of Regression and Regression 
Line Fitting and are currently working on the following: 
 Generalized linear;  
 Multinomial logistic regression;  
 Proportional-odds logistic regression; 
 ANOVA (ANalysis Of Variance). 
Java servlets interface to the R routines using Rserve [22] 
to translate Java objects to R objects and vice-versa. The Java 
servlets pass both data and commands to the R algorithms. The 
commands specify the type of the algorithm requested, the 
specific type of implementation required, and any additional 
options needed for the algorithm to perform the computation. 
The results from R are provided as an R-object with embedded 
list responses. These embedded list responses are parsed by the 
REngine library [23] and the required data and metadata are 
extracted into POJOs (Plain Old Java Objects), for 
interoperability with other Java modules in a workflow 
environment. 
This approach enables the Java routines to take advantage 
of sophisticated routines implemented in R. One of the key 
reasons for implementing the algorithms in R is to take 
advantage of the existing cache of validated and trusted 
algorithms, rather than implementing such algorithms from 
scratch in pure Java. The current R-project distribution 
provides 70 dedicated social science ‘task-views’ packages [24]  
providing statistical analysis algorithms and more than 1500 
packages targeting different domains.  
Another advantage of using R, is that many of the 
algorithms are implemented in C/C++ and optimized for 
analyzing large scale datasets, resulting in performances that 
are 2-10 times faster than the classification algorithms we 
initially implemented in Java.  
All statistical results including the geo-spatial data 
classification are exported as XML. These XML results are 
then transformed into the following formats for display via the 
frontend Web interface: 
 images (PNG/JPEG) using Processing [26],  
 interactive charts using Processing JS [25], and  
 PDF representations using the iText Library [27]. 
C. Delivering Maps and Features 
The GIS components in the SISS-eRF system, including 
some of the back-end systems and services, rely heavily on the 
widely adopted OpenGeo stack framework and architecture 
[28]. This architecture underpins many open-source GIS 
projects and communities, including OpenStreetMap.  
The key components of the OpenGeo stack and architecture 
that are used in the SISS-eRF system are: 
 Storage: PostGIS spatial database 
 Application server: GeoServer map/feature server 
 User interface map component: OpenLayers 
 User interface framework: GeoExt 
GeoServer is map/feature/transactional open-source server 
for serving GIS data that is written in Java. It functions as the 
reference implementation of the Open Geospatial Consortium 
specifications. Each one of the spatial databases in SISS-eRF is 
connected through a separate GeoServer namespace or 
workspace. GeoServer has the ability to automatically add all 
the tables from a particular database to be exposed as separate 
layers within a namespace. In the SISS-eRF system, we expose 
geospatial and social science data using the Web Map Service 
(WMS) and Web Feature Service (WFS). 
We use the OpenLayers JavaScript API to display maps 
and layers in the browser that have been served as PNG images 
from the GeoServer's Web Map Service (WMS). We also use 
OpenLayers to convert the output from our classification 
services into maps by dynamically generating Styled Layer 
Descriptors (SLD) for the classification and passing these in 
requests to the GeoServer WMS. 
GeoExt provides customisable mapping widgets, 
applications and data handling support. The GeoExt is mainly 
used for interfacing the map with GeoServer's Mapfish printing 
module. It also enables a panel that shows the map legends and 
a slider for map zooming functionality. 
D. User Interface Framework 
We use the JQueryUI Javascript API library to deliver 
interactive widgets such as tabs and accordions in our user 
interface. 
Additionally, JavaScript in the user interface interacts 
with the metadata service mentioned previously to dynamically 
create the user interface components based on the available 
variables and geographies.  
Processing JS is used to dynamically generate static and 
interactive charts and graphs in the user’s web browser. If the 
system detects the user’s browser does not support  HTML 5 
canvases, then a server-side Processing library produces a 
JPEG representation of the result and serves that to the web 
browser instead. 
V. USER INTERFACE AND FUNCTIONALITY 
Consider again our case study from Section III. A social 
scientist is interested in identifying those demographic and 
socio-economic factors that are associated with people from 
the State of Victoria who vote for the Coalition (Liberal and 
National) party. 
For this test case we use the following data: primary votes cast 
for Coalition candidate standing for the House of 
Representatives at the 2010 Australian federal election from the 
1719 polling booths in Victoria, Australia.  
Derived data from the 2006 census [1] provides 48 variables 
that represent the demographic and socio-economic 
characteristics of the population living within those polling 
booth catchments (see TABLE I. ) 
Multiple regression modeling is applied to gain an indication of 
which demographic and socio-economic factors are significant. 
Applying step-wise regression analysis identified 28 variables 
that are statistically significant (see TABLE II. ), with 95% 
confidence intervals level and an adjusted R
2 
= 0.774. Thus 
those variables account for 77.4 percent of the variation across 
the 1719 polling booths in Victoria in the primary vote for 
Coalition. 
TABLE I.  VARIABLES DERIVED FROM THE 2006 AUSTRALIAN CENSUS 
REPRESENTING THE DEMOGRAPHIC AND SOCIO-ECONIMIC CHARACTERISTICS 
OF POPULATIONS LIVING IN POLLING BOOTH CATCHMENTS 
Age and sex 
% population males (MALES) 
% population age 0-17 years children and youth 
(YOUTH) 
% po pulation age 18-22 years first voters (FIRST) 
% population age 23-34 years (GEN Y) 
% population age 35-44 years (GEN X) 
% population age 45-59 years boomer 
(BOOMERS) 
% population age 60-74 years (Post Depression 
Wartime Generation) (WW2GEM) 
% population age 75+ years (Pre Depression 
Generation) (DEPGEN) 
Family and household structure 
% single person households (SINGLES) 
% couple without children households (COUPLES) 
% one parent family households (ONEPARENT) 
% couples with children households 
(COUPCHILD) 
Housing tenure 
% households that are home owners 
(HOMEOWN) 
% households that are home purchasers 
(MORTGAGEES) 
% households that are private renters (RENTERS) 
% households that are public housing tenants 
(PUBHOUS) 
Ethnicity/race 
% indigenous persons (INDIG) 
% born overseas (IMMIG) 
% born in UK (UK) 
% born in Southern and Eastern Europe 
(SEEUROPE) 
% born in Middle East (MIDEAST) 
% born in Asia (ASIA) 
Religious affiliation 
% Catholic (CATH) 
% Anglican (ANG) 
% Pentecostal (PENT) 
% other Christian (OTHCHRIST) 
% Islamic (ISLAM) 
% other non-Christian religion (ONCHREL) 
% with no religion (NORELIG) 
Residential stability/Mobility 
% of population not at the same address 5 years 
ago (MOBILE) 
Digital divide 
% dwellings (not population) using Internet 
(INTERNET) 
Engagement in work 
Labour force participation rate (INWORK) 
Unemployment rate (UNEMPLOY) 
Industry of work 
% employed in Extractive Industries (EXTRACT) 
% employed in Transformative Industries 
(TRANSFORM) 
% employed in Distributive Services (DISTRIB) 
% employed in Producer/Business Services 
(BUSSERV) 
% employed in Social Services (SOCSERV) 
% employed in Administrative & support services 
(ADSS) 
% employed in Personal Services (PERSERV) 
Occupation* (Robert Reich’s categories) 
% employed as routine production workers 
(ROUTPROD) 
% employed as in-person service workers (INPERS) 
% employed as symbolic analyst (SYMBA) 
Human capital 
% persons age 15 years and over with a degree or 
higher qualification (DEGREE) 
% persons age 15 and over with a certificate, 
diploma or advanced diploma (CERTDIP) 
Income# 
Low income category – % households in the 
lowest quintile for household weekly income 
(less than $650) (LOWINC) 
Middle income category –% households in the 
middle three quintiles for household weekly 
income ($650-$1,999) (MIDINC) 
High income category -% households in the highest 
quintile for household weekly income ($2,000+) 
(HIGHINC) 
This analysis indicates that polling booth catchments which 
have a positive relationship to voting for the Coalition in the 
2010 Federal election tend to have populations characterized by: 
employment in extractive, distributive, business services, or 
administrative industries; having Anglican, Pentecostal or other 
Christian religious affiliation; coming from high income 
households; being indigenous or of Asian descent; renting a 
house; having a paid job; and having moved house in the last 5 
years.  
Polling booth catchments which have a negative 
relationship to voting for the Coalition tend to have populations 
characterized by a greater incidence of Generation Y, 
Generation X, Baby Boomers and Youths; persons age 15 
years and over with a degree; routine production workers or 
employed social services or personal services; unemployed 
workers; single parent households; those born in UK or 
southern and eastern Europe; Catholics; and first-time voters. 
TABLE II.  RESULTS OF A STEP-WISE REGRESSION MODEL INVESTIGATING 
THE RELATIONSHIP BETWEEN THE COALITION PRIMARY VOTE AND THE 
CHARACTERISTICS OF POPULATIONS LIVING IN POLLING BOOTH CATCHMENTS 
30th model solution (Adjusted R2 = 0.774) 
Polling Booth Catchment 
Demographic and Socio-
economic Variable 
Standardized 
Beta 
coefficient 
t Significance 
(Constant) 67.979 7.291 .000 
EXTRACT .205 4.114 .000 
ANG .239 10.904 .000 
UNEMPLOY -.141 -8.206 .000 
OTHCHRIST .051 2.890 .004 
ONEPARENT  -.128 -6.585 .000 
GEN X -.161 -8.992 .000 
HIGHINC .157 5.090 .000 
INDIG .089 6.707 .000 
GEN Y -.416 -13.126 .000 
BOOMERS -.118 -5.013 .000 
PENT .056 4.433 .000 
RENTERS .130 5.626 .000 
DISTRIB .057 2.376 .018 
SEEUROPE -.065 -4.746 .000 
DEGREE -.325 -8.569 .000 
ROUTPROD -.236 -6.591 .000 
ASIA .101 5.062 .000 
INWORK .153 6.797 .000 
BUSSERV .089 2.195 .028 
UK -.109 -6.370 .000 
CATH -.093 -5.574 .000 
FIRST -.064 -3.948 .000 
MALES .060 3.604 .000 
ADSS .031 1.983 .048 
PERSERV -.059 -3.565 .000 
SOCSERV -.063 -2.737 .006 
MOBILE .056 2.990 .003 
YOUTH -.053 -2.433 .015 
A. Statistical and geospatial visualisation 
Based on the trends uncovered by the multiple regression 
analysis tools available within SISS-eRF, we can visualize the 
relationship between specific variables. For example, between 
the  percentage of votes for Coalition candidates and : 
 The percentage of voters who are Anglican; 
 The percentage of voters who are Generation Y (age); 
 
Figures 2 & 3 below illustrate statistical visualizations of 
regression line fitting of these variables. There is a positive 
correlation between people who are Anglican and people who 
vote for the Coalition. There is a negative correlation between 
Generation Y voters and people who vote for the Coalition. 
In addition, users are also able to visualize these 
relationships spatially by displaying the layers, overlaid and 
color coded in the mapping interface (juxtaposed alongside the 
statistical graphs). 
Figures 4 & 5 below show how the relationships can be 
visualized on maps by color-coding the centroid of polling 
booth locations to represent the percentage of Coalition vote, 
and color-coding the polling booth catchment regions to 
represent the variable range (e.g., percentage of Anglicans). 
 
 
Figure 2. Regression line fitting of percentage primary vote for the Coalition 
parties versus percentage of voters who are Anglican 
 
Figure 3. Regression line fitting of percentage primary vote for the Coalition 
parties versus percentage of voters who are Generation Y 
VI. USER INTERFACE 
The SISS-eRF user interface is shown in the bottom part of 
Figure 1 as well as Figures 2-5. It has three main components, 
exposed as separate tabs. 
A. Map Selection and Data Classification 
In this tab (shown in Figures 4 and 5), there are two main 
components: the map controls (on LHS) and the map display 
panel (RHS). The map controls interface supports: Area and 
Layers selection; and Data Classification selection. 
In the Area and Layers section, users may choose a particular 
region to display on the map (State or Electorate) and toggle 
the map's base and overlay layers. For example the user can 
choose to display different levels of geography e.g., electoral 
boundaries, polling booth catchments, Local Government 
Areas (LGAs) or suburb boundaries etc. 
 
Figure 4. Percent primary votes for Coalition (color-coded polling booth 
locations) overlaid on percent of Anglicans (color-coded polling booth 
catchment regions) 
 
Figure 5. Percent primary votes for Coalition (color-coded polling booth 
locations) overlaid on percent of GenerationYs (color-coded polling booth 
catchment regions). 
One of the major functions supported by SISS-eRF is the 
generation of thematic map displays of variables in either 
regions (polygons) or points. When users select the Data 
Classification interface, they are presented with a drop down 
menu which provides the following choices for thematic 
geographic display of the data: 
 Equal interval, which classifies the features into equally 
divided ranges of attribute values; 
 Quantile classification, in which each class contains 
approximately the same number of features; 
 The natural breaks approach, which is a median-based 
natural breaks classification that optimises attribute 
similarity. This method is used in Figures 4 and 5. 
Two statistical measures for comparing the performance among 
equal interval, quantile and natural breaks approaches are also 
presented: 
 one is total within group variance (TWGV) (referred to as 
group variation in [2]) associated with the grouping model 
of Fisher [4] and Jenks [5] optimisation.  
 the other is total within group difference (TWGD) 
(referred to as absolute deviation in [2]), which is the 
measure structured in the median clustering objective [8].  
Users are able to choose between TWGD, TWGV or no  
statistical comparison of the performance of the classification 
approaches. 
B. Statistical Analysis 
This tab enables users to conduct statistical analysis. It 
provides buttons enabling users to choose between the different 
statistical analysis algorithms supported by the system: 
 Regression 
 Generalized linear  
 Multinomial logistic regression  
 Proportional-odds logistic regression 
 ANOVA (ANalysis Of Variance) 
After one of the buttons is clicked, users are presented with a 
form to choose those variables that they wish to investigate 
further and a “Run Statistical Analysis” button to actually run 
the analysis. The output result is shown on the “Results and 
Offline Download” tab. 
The Statistical Analysis tab includes Metadata Information 
that describes each of the variables that the system is hosting. 
C. Results and Offline Download 
This tab (on the bottom RHS of Figure 1) displays to the 
user, the results of the statistical analysis. It can display static 
charts, linear or bar graphs, interactive charts, as well as text 
based statistical descriptions of the results (see Figures 2 and 3).  
Users can also choose to create and download PDF versions 
of the visualisations or (CSV files) of the results for offline use. 
VII. EVALUATION 
A. Testing Framework 
To test the system from a user’s perspective, we integrated 
Selenium [29], a testing framework for Web applications into a 
Web browser interface. This enabled us to validate the frontend, 
to parse the results from the backend and to verify the display 
and visualisation of data, input options and results. For 
example, we added test suites to check that all variables in our 
metadata catalog were exposed in the user interface (some 
geographies have over 300 statistical variables and 50 location 
quotients - making it easy to miss variables without automated 
testing). 
One example of a test suite is the example described above. 
The Web application server is tasked to determine the 
statistical linear regression dependence of a single dependent 
variable upon a single independent variable, a subset of 
independent variables and a random selection of independent 
variables. Test results are logged to determine the failed tests 
and the number of failures. 
To enable test-driven development and to support 
continuous integration of the Web frontend, middleware and 
backend – Selenium [29] Java unit-tests and R unit-tests are 
used. 
B. User Feedback 
Throughout the project, we collaborated with social 
scientists from the University of Queensland School of 
Geography Planning and Urban Environment, and from the 
University of Melbourne Australian Urban Research 
Infrastructure (AURIN) Project, who provided ongoing 
feedback about the evolving system. Based on their feedback, 
we implemented the following extensions and refinements: 
 Included support for scatter plots to display the data 
points associated with linear regression visualisations 
(as shown in Figures 1 & 2); 
 Added the ability to generate grayscale maps that can 
be embedded in papers and presentations that do not 
support color;   
 Parallelized the multiple regresssion algorithm in order 
to improving performance from O(log N) to O(1); 
 Translated the classification algorithms from Java into  
R in order to improve their performance by up to 10 
times; 
 Added test suites to check that all variables in our 
metadata catalog were exposed in the user interface ; 
We also prioritized a list of additional statistical algorithms to 
add to the system in order to test new hypotheses suggested by 
these researchers. 
The spatio-social scientists also uncovered some interesting 
artefacts in the underlying data: a few polling booth catchments 
had zero populations. On closer inspection, some of these 
artefacts represented polling booths in national parks, and one 
represented a polling booth located within an airport. After 
consultation with the researchers, some of these catchments 
were removed from the data, while others were merged with 
surrounding catchments. 
VIII. DISCUSSION 
A. Support for Multiple Analysis Approaches 
The SISS-eRF has proven to be a versatile toolkit, able to 
support the different analytical methods required by socio-
spatial scientists, geographers and regional scientists. Such 
researchers typically use the system as described above: they 
first perform statistical analyses in order to test hypotheses and 
then create statistical and geographic visualisations to support 
their hypotheses and to communicate their results.  
Geographers and regional scientists, on the other hand often 
begin by using the mapping interface to uncover perceived 
geospatial trends in the data, and then use the statistical 
analysis tools to confirm (or disprove) those trends. Our system 
is designed to support both approaches. 
B. Benefits of the Revised Web Architecture  
This development extended and refined earlier efforts to 
establish an e-Research Facility for Socio-Spatial Analysis at 
the University of Queensland [7]. This pre-existing facility was 
built using proprietary technologies, and one of the primary 
goals of the work described here was to migrate this previous 
system over to open source technologies. As part of this 
migration we also incorporated modern Web technologies into 
the system and improved modularization: 
 The new system separates presentation of statistical results 
from the construction of the results. In the previous system 
R routines performed statistical analysis and created 
graphs representing the results.  In the new system, R 
routines produce an R-Object consumed by a Java wrapper 
to produce an XML representation of the results. This 
representation can then be independently interpreted to 
create images, javascript driven interactive charts and PDF 
representations of the results.  
 We moved the definition of variables outside the user 
interface. JavaScript now interacts with a metadata service 
to dynamically create the user interface components. This 
means that new data can more easily be incorporated into 
the system. 
 Map image tiles are now served from GeoServer via a 
Web Map Service (WMS). They are then rendered in 
the Web browser using a combination of OpenLayers and 
GeoExt JavaScript libraries. This allows 
easier switching between map layers in the user interface. 
C. Computing Challenges 
The greatest challenge in implementing the described 
system was associated with the integration of a wide range of 
existing open source applications and tools that were not 
designed for the Web-based, interactive use cases that we have 
described. The system integrated: R for statistical analysis, 
Processing.js for visualisation, iText for PDF processing, 
PostGIS and GeoServer for handling geographic spatial 
information and Java for handling the Web services, 
data and metadata. Testing and debugging the system was a 
challenge due to the difficulty associated with pin-pointing the 
origin of the error and the condition that caused it. 
Validation and cleaning of the input data was also very 
important to prevent errors being propagated into our derived 
results. Validity, accuracy and repeatability of the results are a 
core requirement of the project so accurately capturing the 
provenance of the derived visualizations and graphics was 
another critical and challenging aspect of this project. 
IX. FUTURE WORK 
One of our project aims was to build a viable system that 
includes enough tools and useful data to allow social scientists 
to test and prove hypotheses about spatial patterns. 
Any viable system needs to contain a critical mass of value-
added (derived) data of interest to researchers. Deriving 
variables from raw demographic, socio-economic and voting  
data and importing them into the system is currently a labour 
intensive, time consuming and potentially error-prone process. 
To ease the generation of value-added data, workflow tools and 
ingest tools to support specification and automatic generation 
of derived variables are a necessity. These tools could also 
automatically record the provenance / lineage of variables 
derived through this process. 
A streamlined method to enable addition of new statistical 
algorithms and the ability to chain results into multiple analysis 
tools are some of the key next steps to provide a complete tool 
for online geo-spatial statistical analysis. We are also keen to 
incorporate new data sets (associated with housing, transport, 
labour force and crime) to evaluate the tools in the context of 
other social science sub-disciplines. 
Finally the AURIN (Australian Urban Research 
Infrastructure Network) project is a national initiative that is 
developing eResearch infrastructure to be shared by the 
Australian social science research community. We are currently 
working with AURIN to integrate our services into their 
framework. We also plan to integrate the 2010 Census data 
recently published by the Australian Bureau of Statistics. 
X. CONCLUSIONS 
This paper describes SISS-eRF: a Web-based extensible 
framework that enables social scientists with little-no prior 
programming or statistical analysis skills to easily access social 
science data sets of interest to them, aggregate them at different 
geographic scales, analyse and model them statistically and 
generate visualizations and graphics that demonstrate 
empirically, spatial social science patterns or trends. We aimed 
to build a viable system that includes a critical mass of 
geospatial, statistical analysis and visualisation tools and useful 
data to prove hypotheses about spatial patterns. In order to 
address this “viability” goal, we  
 Involved social scientists in the development of the system, 
and iteratively developed the system in response to their 
feedback;  
 Used R and Rserve to build an extensible framework that 
can easily incorporate new statistical modeling tools as 
they become available; 
 Used only open source technologies so that others could 
adapt and repurpose our work; 
 Derived hundreds of statistical variables and associated 
them with geographies of interest to researchers. 
The case studies in this paper show that SISS-eRF can be used 
to quickly and easily identify trends in voting patterns, 
demographic and socio-economic variables, and to prove or 
disprove researchers’ hypotheses. Based on feedback from 
spatio-social scientists we are actively incorporating more 
datasets in the system (e.g., housing, crime, transport) and 
integrating a wider range of statistical fit algorithms that can 
test more complex hypotheses. 
ACKNOWLEDGMENT 
Development of the SISS-eRF was supported by the ARC 
Research Network in Spatially Integrated Social Science and 
the Australian National Data Service (ANDS) through the 
National Collaborative Research Infrastructure Strategy 
Program and the Education Investment Fund (EIF) Super 
Science Initiative. 
REFERENCES 
[1] Australian Bureau of Statistics (ABS), “2006 Census 
DataPacks,” from the 2006 Census of Population and Housing, 
Canberra ACT, 2007. 
[2] R. G. Cromley, “A comparison of optimal classification 
strategies for choroplethic displays of spatially aggregated data,” 
International Journal of Geographical Information Systems, 
10(4):405-424, 1996. 
[3] S. Doyle, M. Dodge, and A. Smith, “The potential of Web-based 
mapping and virtual reality technologies for modelling urban 
environments,” Computers, Environment and urban Systems, 
22(2):137-155, 1998. 
[4] W. D. Fisher, “On grouping for maximum homogeneity,” 
Journal of American Statistical Association, 53:789-798, 1958. 
[5] G. F. Jenks, “Generalization in statistical mapping,” Annals of 
the Association of American Geographers, 53(1):15-26, 1963. 
[6] S. D. Kirkby and S.E.P. Pollitt, “Distributing spatial information 
to geographically disparate users: a case study of ecotourism and 
environmental management,” Australian Geographical Studies, 
36(3):262-272, 1998. 
[7] E. Liao, T-K. Shyy, and R. J. Stimson, “Developing a web-based 
e-research facility for socio-spatial analysis to investigate 
relationships between voting patterns and local population 
characteristics,” Journal of Spatial Science, 54(2):63-88, 2009. 
[8] A. T. Murray and T-K. Shyy, “Integrating Attribute and Space 
Characteristics in Choropleth Display and Spatial Data Mining,” 
International Journal of Geographical Information Science, 
14(7):649-667, 2000. 
[9] S. Openshaw, I. Turton, J. Macgill, and J. Davy, “Putting the 
geographical analysis machine on the Internet,” in Innovations 
in GIS, vol. VI, B. Gittings, Eds. London: Taylor & Francis, 
1999, pp. 121-131. 
[10] Z-R. Peng, “An assessment framework for the development of 
Internet GIS,” Environment and Planning B: Planning and 
Design, 26(1):117–132, 1999. 
[11] Z-R. Peng and M-H. Tsou, Internet GIS: distributed geographic 
information services for the Internet and wireless networks. New 
Jersey: John Wiley & Sons, 2003. 
[12] T-K. Shyy, R. Stimson, and A. T. Murray, “An Internet GIS and 
spatial model to benchmark local government socioeconomic 
performance,” Australasian Journal of Regional Studies, 
9(1):31-47, 2003. 
[13] T-K. Shyy, R. Stimson, J. Western, A. T. Murray, A. T. and L. 
Mazerolle, “Web GIS for mapping community crime rates: 
approaches and challenges,” in Geographic Information Systems 
and Crime Analysis, F. Wang, Eds. Hershey: Idea Group 
Publishing, 2005, pp. 236-252. 
[14] T-K. Shyy, R. Stimson, P. Chhetri, “Web-based GIS for 
mapping voting patterns at the 2004 Australian federal election,” 
Applied GIS, 3(11):1-20, 2007. 
[15] T-K. Shyy, R. Stimson, P. Chhetri, and J. Western, “Mapping 
quality of life in the south east Queensland region with a Web-
based application,” Journal of Spatial Science, 52(2):13-22, 
2007. 
[16] R. J. Stimson, P. Chhetri, and T-K. Shyy, “Typology of local 
patterns of voter support for political parties at the 2004 federal 
election,” People and Place, 15(1):1-12, 2007. 
[17] S. Taylor, Identifying Greens Voters with GIS, Honour thesis, 
Melbourne: University of Melbourne, 2003. 
[18] ESRI ArcGIS [Online]. Available: 
http://www.esri.com/software/arcgis 
[19] BioMedware SpaceStat [Online]. Available: 
http://www.biomedware.com/?module=Page&sID=spacestat 
[20] XStream Java library [Online]. Available: 
http://xstream.codehaus.org  
[21] R Development Core Team (2011). R: A Language and 
Environment for Statistical Computing. R Foundation for 
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 
[22] S. Urbanek, “Rserve: A Fast Way to Provide R Functionality to 
Applications.” in K Hornik, F Leisch, A Zeileis (eds.), 
Proceedings of the 3rd International Workshop on Distributed 
Statistical Computing, Vienna, Austria, 2003. ISSN 1609-395X 
[23] REngine Java library [Online]. Available 
http://www.rforge.net/org/docs/org/rosuda/REngine/REngine.ht
ml  
[24] R Project social sciences web task view [Online]. Available 
http://cran.r-project.org/web/views/SocialSciences.html  
[25] Processing.js A port of the Processing Visualization Language 
[Online]. Available: http://www.processingjs.org  
[26] Processing: A visualisation language [Online]. Available:  
http://www.processing.org  
[27] iText PDF library [Online]. Available: http://itextpdf.com  
[28] OpenGeo Suite [Online]. Available http://opengeo.org/  
[29] SeleniumHQ Web application testing system [Online]. 
Available: http://seleniumhq.org