Thursday, 21 April 2011

Recommendation service

Do I really want to write a SOAP service (SUSHI use SOAP but don't seem to have mature Java open-source client/servers available for hacking!)? How about a REST service? This could be a neat solution using something like Apache CXF.

Either of these would be great but as I only have a few simple data requests to execute, I've decided to go for a quick and easy solution - Apache XML-RPC deployed in a servlet.

Behind this I'm using a MySQL database with Apache DBCP handling connections and queries. I was going to use the lightweight mybatis data mapper framework (formerly known as iBatis) but again, as I'm only using a couple of queries it isn't really worth the overheads for the flexibility it provides.

The test set-up is working so now all I've got to do is tidy it up, deploy the server as a service and deploy clients within the six DSpace institutional repositories. How long have I got?


Consuming and Querying data

So what service should I use? My first thoughts were to push data to a SQL database, then I thought of Solr. Solr is fast and efficient, it's great for powerful full-text and faceted search, hit highlighting and rich document (e.g., Word, PDF) handling. So I pushed OpenURL Context Objects from DSpace to Solr and used simple queries to view the captured activity data.

Then I thought again. What data do I want returned from a recommendation service? I just want a few item handles and some metadata as suggestions to view alongside the current resource. I found I could do this using an SQL query on a test database but wasn't sure if I could construct queries with inner joins using Solr. I'm not that familiar with Solr and couldn't find what I wanted. A patch was available for the latest release that could do this but then again, maybe this isn't one of Solr's strengths ..or maybe I don't have the right data structure.

My thoughts turned back to basing a service on a SQL database.


Hunting and Gathering data

The PIRUS2 project has conveniently produced a patch for DSpace (and EPrints) for capturing activity data and either making it available via OAI-PMH or pushing it to a tracker service. I'm grateful to Paul Needham from Cranfield who gave me an insight in to the architecture they were using.

I patched the DSpace code and was soon making usage data available for harvesting. However, I wanted to avoid the hassle of harvesting via OAI-PMH so looked closer at the tracker code. This is a neat solution and uses Spring injection to create a listener on the DSpace Event service to capture downloads. With a little hacking to also capture item views I created an AEIOU activity class. The beauty of this is that all that is required to update the DSpace code is a configuration of the Spring context (an XML file) and the addition of a Java jar file.


Who What Why and When?

How best to represent the activity data we're gathering and passing around? Several projects (PIRUS2, OA-Statistics, SURE, NEEO) have already considered this and based their exchange of data (as XML) on the OpenURL Context Object - the standard was recommended in the JISC Usage Statistics Final Report. Knowledge Exchange have produced international guidelines for the aggregation and exchange of usage statistics (from a repository to a central server using OAI-PMH) in an attempt to harmonise any subtle differences.

Obviously then, OpenURL Context Objects are the way to go but how far can I bend the standard without breaking it? Should I encrypt the Requester IP address and do I really need to provide the C-class Subnet address and country code? If we have the IP addresses we can determine subnet and country code. Fortunately the recommendations from Knowledge Exchange realised this and don't require it.

So for the needs of this project where we're concerned with a closed system within a National context, I think I can bend the standard a little and not lose any information. I can use an authenticated service. I also want to include some metadata - the resource title and author maybe.

So here's the activity data mapped to a Context Object
  • Timestamp (Request time) Mandatory
  • Referent identifier (The URL of the object file or the metadata record that is requested) Mandatory
  • Referent other identifier (The URI of the object file or the metadata record that is requested) Mandatory if applicable
  • Referring Entity (Referrer URL) Mandatory if applicable
  • Requester Identifier (Request IP address - encrypted possibly!) Mandatory
  • Service type (objectFile or descriptiveMetadata) Mandatory
  • Resolver identifier (The baseURL of the repository) Mandatory


Friday, 15 April 2011


We hypothesise that "The provision of a shared recommendation service will increase the visibility and usage of Welsh research outputs".

This will be demonstrated through quantitative and qualitative assessments:

  1. By a [significant] increase in attention and usage data for items held within the six core institutional repositories
  2. By establishing a user focus group to explore the potential of the recommendation service and its impact on repository users