Big spatio-temporal datasets, available through both open and administrative data sources, offer significant potential for social science research. The magnitude of the data allows for increased resolution and analysis at individual level. While there are recent advances in forecasting techniques for highly granular temporal data, little attention is given to segmenting the time series and finding homogeneous patterns. In this paper, it is proposed to estimate behavioral profiles of individuals' activities over time using Gaussian Process-based models. In particular, the aim is to investigate how individuals or groups may be clustered according to the model parameters. Such a Bayesian non-parametric method is then tested by looking at the predictability of the segments using a combination of models to fit different parts of the temporal profiles. Model validity is then tested on a set of holdout data. The dataset consists of half hourly energy consumption records from smart meters from more than 100,000 households in the UK and covers the period from 2015 to 2016. The methodological approach developed in the paper may be easily applied to datasets of similar structure and granularity, for example social media data, and may lead to improved accuracy in the prediction of social dynamics and behavior.
Research on customer satisfaction has increased substantially in recent years. However, the relative importance and relationships between different determinants of satisfaction remains uncertain. Moreover, quantitative studies to date tend to test for significance of pre-determined factors thought to have an influence with no scalable means to identify other causes of user satisfaction. The gaps in knowledge make it difficult to use available knowledge on user preference for public service improvement. Meanwhile, digital technology development has enabled new methods to collect user feedback, for example through online forums where users can comment freely on their experience. New tools are needed to analyze large volumes of such feedback. Use of topic models is proposed as a feasible solution to aggregate open-ended user opinions that can be easily deployed in the public sector. Generated insights can contribute to a more inclusive decision-making process in public service provision. This novel methodological approach is applied to a case of service reviews of publicly-funded primary care practices in England. Findings from the analysis of 145,000 reviews covering almost 7,700 primary care centers indicate that the quality of interactions with staff and bureaucratic exigencies are the key issues driving user satisfaction across England.
We present a database of parliamentary debates that contains the complete record of parliamentary speeches from Dáil Éireann, the lower house and principal chamber of the Irish parliament, from 1919 to 2013. In addition, the database contains background information on all TDs (Teachta Dála, members of parliament), such as their party affiliations, constituencies and office positions. The current version of the database includes close to 4.5 million speeches from 1,178 TDs. The speeches were downloaded from the official parliament website and further processed and parsed. Background information on TDs was collected from the member database of the parliament website. Data on cabinet positions (ministers and junior ministers) was collected from the official website of the government. A record linkage algorithm and human coders were used to match TDs and ministers.
- “Topology Analysis of International Networks Based on Debates in the United Nations” (with Stefano Gurciullo), arXiv:1707.09491 [cs.CL], 29 July 2017.
In complex, high dimensional and unstructured data it is often difficult to extract meaningful patterns. This is especially the case when dealing with textual data. Recent studies in machine learning, information theory and network science have developed several novel instruments to extract the semantics of unstructured data, and harness it to build a network of relations. Such approaches serve as an efficient tool for dimensionality reduction and pattern detection. This paper applies semantic network science to extract ideological proximity in the international arena, by focusing on the data from General Debates in the UN General Assembly on the topics of high salience to international community. UN General Debate corpus (UNGDC) covers all high-level debates in the UN General Assembly from 1970 to 2014, covering all UN member states. The research proceeds in three main steps. First, Latent Dirichlet Allocation (LDA) is used to extract the topics of the UN speeches, and therefore semantic information. Each country is then assigned a vector specifying the exposure to each of the topics identified. This intermediate output is then used in to construct a network of countries based on information theoretical metrics where the links capture similar vectorial patterns in the topic distributions. Topology of the networks is then analyzed through network properties like density, path length and clustering. Finally, we identify specific topological features of our networks using the map equation framework to detect communities in our networks of countries.
- “Detecting Policy Preferences and Dynamics in the UN General Debate with Neural Word Embeddings” (with Stefano Gurciullo), arXiv:1707.03490 [cs.CL], 11 July 2017.