Useful links

Note

As of February 2011 the live RDF copy of the Young Lives data is currently offline. Future developments to create a stable set of RDF data for Young Lives are being planned.

This demonstrator explores the process of putting development-focussed social science research data online using linked data conventions.

Young Lives

Young Lives is an international study of childhood poverty, involving 12,000 children in 4 countries over 15 years. It is led by at team in the Department of International Development at the University of Oxford in association with research and policy partners in the 4 study countries: Ethiopia, India, Peru and Vietnam.

This exploration/demonstrator project has worked with a recent health dataset containing young people's responses to a wide range of questions about their health situation and behaviours. The dataset covers responses from Peru.

The demonstrator has sought to:


Modelling and representing the data

The dataset being modelled contained two main kinds of data:

  • The survey responses (microdata);
  • The survey questions (meta-data);

The original dataset was available as a flat-structured SPSS table, with one row for each child/young person, with the columns (variables) giving their responses to each question. However, for some questions, the response was split across multiple columns (e.g. where one question was asked, but had a multiple-choice answer, and so the response would be recorded as a yes/no under a variable for each option in the multiple-choice list). In modelling the data we were looking for a structure which provided

  • Simplicity & flat structure - to make querying the data easier and to help those unfamiliar with the graph structures possible in RDF to quickly find and use full sets of the data;
  • Comparability with SPSS data - keeping the same variable names;
  • Ease of annotation - allowing additional information to be attached to questions;
  • Re-use of other vocabularies and ontologies - increasing the chance of existing tool-chains being able to operate against the dataset;
  • Making linkages - allowing data from the wider web of linked data to be used in the demonstrator;

In looking for vocabularies to re-use we explored the work of the Data Documentation Initiative (DDI) which offers an XML Schema for recording meta-data about studies and their components. Whilst an intermediate DDI file was used to generate the RDF representation of our variables, the focus of DDI on the meta-data for whole studies (beyond the scope of this demonstrator), and the lack of a widely available RDF version of the DDI standards (the standard is currently only available as an XML Schema), we turned instead to the developing DataCube vocabulary.

Whilst the YoungLives microdata is not strictly statistical data (statistics are generated from calculations upon it - but the raw data records individual observations), adapting the DataCube vocabulary appeared to provide a well designed flat (or flattenable) structure with plenty of ways to annotate the data. The responses from each individual child can be modelled as an 'Observation' (the microdata), and each question as a 'MeasureProperty' of that observation. Both Observations and MeasureProperties are resources which can be annotated.

Observations are given an additional foaf:Agent property, using the FOAF (Friends of a Friend) vocabulary. This can be used to distinguish Observations about individual children in our datastore from aggregate statistical observations. It would also provide the basis for articulating the relationships between agents which other areas of the YoungLives study ask about (the dataset being modelled contained only individual Children; however, other datasets from the study contain parents and details of family relationships). The deprecated foaf:dateOfBirth property is also used in our model, although discussions around the foaf project suggest an alternative modelling should be used. A new variable named LOCATION has been added to the dataset, based on identifying the region in which the child is located (or showing Peru if the region cannot be revealed for anonymisation reasons). Locations are modelled by reference to resources from http://ontologi.es/place/ which returns RDF data about the names of places (from their short ISO code names) and some basic information about their relationships.

MeasureProperties are given URIs based upon the original variable name from the SPSS file. They are annotated with labels (again from the original SPSS file), and are linked to a codeList which gives all the possible response values where relevant. CodeLists are modelled using SKOS concepts, though no effort has been undertaken to link these concept lists to concept lists published elsewhere at present. CodeLists are re-used, making it possible to see which variable re-use the same codes (e.g. see Instances Linking Here for this list).

To identify groups of questions we created our own set of QuestionGroup properties to allow annotation of a set of questions. For example, the questions on smoking which have an attached comment and note the collection of questions. (The smoking group was built from hand-written RDF, with one-way properties asserted (i.e. that the group hasQuestions), and then run through the cwm reasoning engine to expand out the opposite directional property (isInQuestionGroup) to make it easy to navigate to a question group from a question itself).

Issues

Provenance

Only minimal provenance information is provided in this dataset. Further work would want to consider use of the Open Provenance Model to clearly indicate what has happened at each step of the data processing.

URIs

We have used http://data.younglives.org.uk/ as the main URI for the dataset (graph) we have created - although there is no server live there at present.

Licenses

We have not addressed issues of licensing of data in our work.

Stable vocabularies?

The DataCube vocabulary, and the representations of SDMX in RDF are currently under development. The URIs used in our data for SDMX terms were taken from a mailing list posting (and appear relatively reliable, if not clearly 'officially sanctioned'), and there was no URI for the DataCube vocabulary clearly available at the time of development. This means in the case of the DataCube classes and properties, we created temporary URIs at http://data.younglives.org.uk/ontologies/datacube (generally using the prefix 'qb') to refer to DataCube elements. However, we have since discovered this definition of the DataCube vocabulary in RDF that makes use of a PURL redirect to provide a stable URI for the DataCube vocabulary.

Whose vocabulary?

The vocabularies chosen are in many cases still evolving. It is possible that properties we are using will cease to be part of their specification, or new conventions for, or restrictions on, modelling will be introduced. What impacts will this have on our attempt to create a stable rendering of our data?

Measures or dimensions?

Statistical table showing one measure and three dimensions Most semantic representations of statistical data would divide the columns in a flat spreadsheet into dimensions and measures. A cell in the table contains either a 'dimension' by which measurements can be displayed/sorted/explored (such as Gender; Location; Timeperiod), or a 'measure' which presents the result of some measurement process or calculation. A single statistic will consist of one measure and any number of dimensions. For example, the table here shows two statistics (measures), each with three dimensions (gender; age; location). However, the microdata from the Young Lives survey, with each individual child's responses to survey questions consists of a large number of measures, and which measures constitute dimensions may depend on the analysis to which the data is being put. The Statistical Core Vocabulary (SCOVO) can only represent data with a single measure and multiple dimensions. The DataCube vocabulary, currently under development, provides a more flexible way of modelling statistical data, allowing for a version of data representation with multiple measures against certain dimensions. In the microdata modelling we have broken from the strict DataCube model and created Observations consisting entirely of MeasureProperties with no dimensions. (Question: Is this bad practice? Should we have created some new vocabulary subclassing DataCube classes and properties in order to create something different for our purposes - rather than using, and breaking the strict interpretation of, the DataCube vocabulary?).


Publishing the data online

With our model decided upon we created a range of PHP scripts which converted the original data into RDF formats. It would be possible to publish the RDF linked data simply by creating a range of files and a URL structure and placing these on a web server. For example, we could create a file for each child called CHILDNAME.RDF and create some sort of index of these allowing them to be easily found, with questions articulated in files at their individual URIs. However, to allow the data to be queried more easily, it's advantageous to load it into an RDF store.

To give us an interface for exploring and manipulating the RDF data we chose OntoWiki using a Virtuoso datastore. OntoWiki provides for collaborative browser-based display and editing of RDF. It can also be configured (though is no in our case) to sit on a linked-data domain (e.g. the (imaginary at present) http://data.younglives.org.uk/ domain) and to return human-readable information to human browsers, and machine-readable RDF to computers, for any resource within that domain.

You can browse the data in the Young Lives model here.


Linking data

During the modelling of our data we have used some properties and classes from existing vocabularies, such as the SDMX terms for the sex of an individual. If the questions from which some Young Lives questions are derived, or the code-lists re-used by the Young Lives survey, were available in linked data form on the web, we could have chosen to link against these also (and in theory to import any linked data meta-data about them into our data store).

Using the Linked Open Data cloud diagram we identified possible places where literal strings, or local resources, in our model could be replaced by resources from the growing web of linked data. In particular, this was possible for the location identifiers for child observations. The http://ontologi.es/places service provided the ability to link against ISO 3166-2 regional country subdivisions. The http://ontologi.es/places server returns machine-readable data about the country a region is in (and the regions within a country) using Dublin Core terms. The OntoWiki platform makes it easy for us to import this linked data into our own datastore for local querying using 'Linked Data Wrappers' (which go off and fetch RDF data into the local store). We can also take advantage of the 'sponger' capability of the RDF database and query engine we are using (Virtuoso) which can be instructed to dynamically grab linked resources at query-time. For example, the query below, which can be issued on the SPARQL endpoint here (click here for results), will fetch labels for countries and regions from the relevant ontologi.es/places at run-time (i.e. this data is not stored locally, but is fetched across the web of linked data).

	define input:grab-all "yes" 
	define input:grab-intermediate "yes" 
	define input:grab-depth 5 
	define input:grab-limit 500 
	define input:grab-seealso  
	define input:grab-seealso  	
	
	PREFIX rdf: 
	PREFIX foaf: 
	PREFIX yl: 
	PREFIX rdfs: 
    PREFIX owl: 

	SELECT DISTINCT * 
	WHERE {
	  ?agent rdf:type foaf:Agent . 
	  ?agent yl:LOCATION ?location.
	  ?location rdfs:label ?label.
	}
		

The SameAs.org was developed to indicate relationships of identity between different resources in the linked data web. The Ontologi.es/place service provides a seeAlso property for each country or region pointing to the sameAs.org records for that particular country. In our data, good records only appear to exist for Peru itself - pointing to an RDF version of the CIA World Factbook entry for Peru which we can import into our dataset. sameAs.org was built as a result of a research project, and not easy way for users to input new relationships into it is provided.

When we import linked data via a seeAlso or sameAs link into our datastore we either need to (a) merge the statements about this seeAlso / sameAs resource into the original resource (using some form of reasoning, such as cwm); or (b) when querying the data, to check for properties of seeAlso and sameAs resources as well. Our datastore and query engine (Virtuoso) allows us to fetch back all the statements made about 'sameAs' resources at run-time in a query using the 'DEFINE input:same-as "yes"' flag. So, the query below (results via this link), fetches all the things known about the resource for Peru and any resources which are indicated as the same as this resource.

		DEFINE input:same-as "yes"
		SELECT *
		WHERE
		 {
		    ?p ?o.
		 }
		

The sameAs.org service does not provide any good suggestions for related resources for our regional locations. Two resources can provide us with additional data however: (1) dbpedia - a machine-readable rendering of information from Wikipedia pages; (2) geonames.org - providing basic geodata about locations. To use these sources we manually add 'seeAlso' or 'sameAs' statements and then use OntoWiki to import the linked data to our datastore (or leave the data to be sponged up by Virtuoso at query time...). We have only made explicit links for Lima in the current dataset.

Comparison data

It became clear during the demonstrator that there was a lack of data in linked-data form which would be 'comparable' to the Young Lives data. Whilst questions in the Young Lives study are often based upon questions from other studies, none of the results of those studies are openly published. This meant that to explore the potential of linked data not only for pulling in extra information to be used in research (and providing access to data), but in allowing easier exploration and comparison between datasets, we needed to generate some demonstration comparison data. We found some data available for Young Smoking Prevalence rates in Latin America, and so focussed on a comparison of smoking prevalence in the Young Lives regions in the dataset, and wider statistics for Latin America.

This involved created a new set of aggregate statistic DataCubes from our dataset, identifying the single measure (Smoking Prevalence), by a range of different dimensions. The data we generated (see all the instances linking to this dataset resources) was created relatively crudely and should not be considered accurate, and we note that there did not seem to be an established way in the DataCube vocabulary of articulating the statistical significance of a value.

To get comparable data we also had to crawl data from the Pan American Tobacco Online Information System (PATIOS) to find the 13 - 15 year olds smoking prevalence rate for a range of Latin American Countries. We had to build custom-mappings from the internal Country IDs used by PATIOS to ISO 3166-1 codes, and again, the accuracy of these in the demonstrator should not be considered full-checked and verified.

By articulating both the Young Lives aggregate data, and the data from PATIOS using the same 'data structure definition' (DSD) in the DataCube vocabulary we can query them together. This is the approach taken by our visualisation demonstrator that looks to (a) find any data structure definitions in a datastore; (b) look for dataset that share that definition; (c) show ways of comparing them. In an environment where linked data was more widespread, we might envisage Data Structure Definitions for certain key global indicators being published on the web (for example, the WHO website), and anyone then being able to generate their comparable datasets re-using that shared DSD.

Visualisation

We commissioned a simple visualisation to show the relationship between the data from our datastore, and comparable data from third parties. You can find the Comparator visualisation here.