Thursday, 28 July 2011

AEIOU Report

AEIOU Outputs

Software - released under Apache foundation license
  • XML-RPC client/server
    • Java source code - svn://
    • XML-RPC server uses a MySQL data source (database schema included)
    • Includes implementation for Apache Mahout Recommender

  • DSpace Activity Exporter
    • Java source code - svn://

  • DSpace Activity Exporter (Binary)
    • This includes a binary distribution with required Java libraries (jar files) and Howto for deploying and configuring with DSpace
Recipes - released under a Creative Commons license

Next Steps

We would like to encourage further community use of a generic Recommendation service to leverage information in large activity data sets. Along similar lines, we would also like to develop a generic Reporting service based on activity data.
  • Refine Recommendation service
    • Further develop and explore Apache Mahout implementation
    • Provide flexible configuration for easy deployment to exploit data models with simple User / Item preferences
    • Requires offline processing for building User / Item similarities with large data sets
    • Mahout Includes API for assessing relevance of recommendations
    • Develop simple REStful service for (collaborating) institutions
    • Open-source release for community with full documentation
  • Develop Reporting Service based on activity data aggregations

How can other institutions benefit?

If you'd like to start tracking user activity in your DSpace repository and exploit the activity data to provide recommended items, follow the steps below.
  • Download and build the AEIOU xml-rpc server for tracking user activity (requires MySQL database) - AEIOU Recipe 1
  • Download the AEIOU Activity exporter software and configure for use with your DSpace repository - AEIOU Recipe 2
  • Add a link to a Data Privacy Policy from your DSpace repository to inform users of how their data is being collected and used

Most significant lessons

  • Think Big! Your activity data will grow very quickly and things (e.g. SQL Queries) that took a few milliseconds will take tens of seconds. Use open source technologies that are tuned for Big Data (e.g. Apache Solr and Mahout) or process data offline and optimise your database and code - see Deploying a massively scalable recommender system with Apache Mahout
  • Clean up! Your data may contain a lot of noise which makes processing less efficient and results less relevant e.g. filter robots and web crawlers used by search engines for indexing web sites and exclude double clicks. Try using queries to identify unusually high frequencies of events generated by servers and flag these.
  • Use a simple data model with core elements that can be aggregated with (or mapped to) other activity data sets - see Who What Why and When

Processing Activity Data - Recommending items

Initially we used SQL queries to identify items that had been viewed or downloaded by users within specific 'windows' or session times (10, 15 and 30 minutes). Refinements were made to rank the list of recommended items by ascending time and number of views and downloads. We have a requirement that the service should respond within 750 milliseconds, if not the client connection (from the repository) will timeout and no recommended items are displayed. The connection timeout is configured at the repository and is intended to avoid delays when viewing items.

Unsurprisingly, queries took longer to run as the data set grew (over 150,000 events) and query time was noticeably influenced by the number of events per user (IP address). Filtering out IP addresses from robots, optimising the database and increasing the timeout to 2 seconds temporarily overcame this problem.

However, it was clear that this would not be scalable and that other algorithms for generating recommended items maybe required. A little research suggested that Apache Mahout Recommender / Collaborative filtering techniques were worth exploring. We are currently testing Recommenders based on item preferences determined by whether or not an item has been viewed (boolean preference) or the total number of views per item. Item recommenders use similarities which require pre-processing using a variety of algorithms (including correlations). An API also exists for testing the relevance of the recommended items and we will be using this over the next few weeks to assess and refine the service.

Privacy and Anonymisation

see Licensing & Data Privacy Policy statement.


Licensing & reuse of Software & Data

The AEIOU project is aggregating activity data generated by users (both registered and anonymous) who download or view an item held in an institutional repository. The data used to describe this activity is represented by an OpenURL Context Object (see previous post) which is stored and processed to provide the shared Recommendation Service and includes the Request IP address.

Data Protection & Privacy Issues
The IP Address identifies the computer from which the request originated and is used to provide the notion of a user session. Although this may not directly identify a user (e.g. the computer maybe shared publicly), in terms of Data Protection Act (DPA), IP addresses may constitute personal data if an individual user can be identified by using a combination of that IP address and other information. This applies even when personal data are anonymised after collection.

New European legislation came into force from May 26th 2011 and The Information Commissioner's Office (ICO) Code of Practice has been revised. The Code now clearly states that in many cases IP addresses will be personal data, and that the DPA will therefore apply. These changes also apply to the use of cookies and methods for collecting and processing information about how a user might access and use a website. An exception exists for the use of cookies that are deemed "strictly necessary" for a service "explicitly" requested by a user. In general, the regulations advise that an assessment should be made on impact to privacy, whether this is strictly necessary and that the need to obtain meaningful consent should reflect this.

We also need to consider that the AEIOU project is aggregating and processing data (that includes IP Addresses) originating from other institutional Repositories with no direct end-user relationship. The Using OpenURL Activity Data project has addressed this by notifying institutions that sign up for their OpenURL resolver service. We have no explicit agreement with the partners involved in the current project but aim to review their existing privacy policies should the service be continued. For example, do policies for storing and processing user data include repository reporting software and Google analytics and should users be made aware of this through the repository website?

The current cookie policy for Aberystwyth University can be found here

In order to comply with recent changes to ICO code of practice we have been advised that as a minimum requirement we should include text in the header or footer of repository web pages and a link to a Data Privacy Policy that clearly informs users about how their data is being used and whether it is passed to third parties (e.g. Google). Where possible, they should also be given the option to opt out of supplying personal information (IP address) to the Recommendation service. This would not affect them receiving recommendations but their information would not be stored or processed as part of the service.

Anonymisation & Re-use of data
We will make data available to individual partners and hope to provide a reporting service (based on the activity data) so that institutions can view usage statistics in a National context. We also hope to publicly release the data with regard to personal data encryption and licensing outlined below. Ideally, we would like to release OpenURL Context Object data as XML but in the short term this will be made available in CSV format.

The JISC Usage Statistics Review looked at European legal constraints for recording and aggregating log files and noted that the processing of IP-addresses is strongly regulated in certain countries (e.g. Germany) and that current interpretation maybe ambiguous. In such cases, they advise that "To avoid legal problems it would be best to pseudonymize IP-addresses shortly after the usage event or not to use IP-addresses at all but to promote the implementation of some sort of session identification, which does not record IP-addresses"

Currently, we are encrypting the IP addresses using an MD5 hash algorithm recommended in Knowledge Exchange Usage Statistics Guidelines so that personal data is anonymised. Although MD5 is a relatively fast and efficient algorithm it has been shown to have security vulnerabilities and other more secure methods for encryption (e.g. SHA-1 & SHA-2) are recommended. If this becomes an issue we could release data with stronger encryption or replace the IP address with a system identifier as suggested above. Removing the IP address would, however, compromise the ability to aggregate data.

The Knowledge Exchange Usage Statistics Guidelines also point out that when the IP address is obfuscated, information about the geographic location is lost. They therefore recommend using the C-Class subnet part of the IP address which will give a regional (network) location but can not identify a personal computer. This would be appropriate where activity data is used for reporting statistics.

Document outputs, software and any data that is released will be licensed according to the IPR section in the project plan.


Thursday, 30 June 2011

Anticipated benefits

Anticipated benefits

The AEIOU project’s aim is to increase the visibility and usage of academic research taking place in Wales by aggregating Welsh institutional repository activity data to provide a “Frequently viewed together” recommendation service, such as those used by Amazon and many other e-commerce websites and also encourage the greater use of repositories across Wales.

Anticipated benefits of the project are:

Reporting (using the core activity data)

The activity data generated through use of the recommendation service will provide a central, reliable information data source for institutional reporting within a national context. We have found that Administrators are not always aware of the statistics available through repository software and that often the log files required for this are missing or not configured correctly. Further more, some repositories are not enabled with Google Analytics. Reports can be generated to identify ‘hits’, items most viewed, items downloaded, etc, and can assist Senior Management when formulating research policy.

Promotion (via the recommendation service)

  • Raising the profile of Welsh research and enhancing the reputation of the individual institutions: The repository acts as a showcase for the Institutions’ research projects and outputs
  • Increasing collaboration between institutions and individual researchers by directing those who are interested in a specific area of research they have identified in a repository to other similar research articles in the same repository and in other repositories across Wales. This upholds the HEFCW and WAG strategies outlined below:

The Higher Education Funding Council for Wales (HEFCW) Corporate Strategy states its intention as to promote a culture of collaboration within Wales and to work towards the expectation outlined in the Welsh Assembly Government document ‘For Our Future’ to “increase the impact of university research, through targeting support on areas of strength and national priority, and promoting collaboration.” Strategic collaboration in Wales is evidenced through existing and emerging research partnerships, such as the Aber-Bangor partnership and the St David’s Day declaration.
  • Attracting postgraduates: Institutions are increasingly trying to attract postgraduates and the recommendation service should encourage potential students to look for related research topics and be directed towards those Welsh HEIs where the research is being carried out.
  • Helping to support and sustain the existing repository network and infrastructure by making research more visible, attracting more visitors, and increasing repository usage.

Success criteria
This will be measured by an increase in activity data for items held in the institutional repositories and through a user focus group to explore the impact on repository users.

Future Sustainability
The project is directly aligned with the HEFCW corporate strategy, in particular their aims to secure “sustainable excellent research in higher education by building up quality and quantity to strengthen the research base in Wales.”  By increasing both the visibility and critical mass of the research output from Wales the long term impact of the project is naturally embedded into national policy and institutional strategies for research. 

Potential Funding sources
The following could be approached for financial support:

  • HEFCW - see strategy document
  • Institutions’ Research Offices – the repository and the recommendation service can serve to promote research within and across institutions in Wales.
  • Research Councils – offering a solution to multi-disciplinary and cross institutional research output retrieval


Thursday, 9 June 2011

AEIOU First Focus Group meeting

Yesterday we held the first AEIOU Focus Group meeting to demonstrate the Recommendation service, and it was trialled by the Group for the first time. Four of the six core repositories had the service deployed - and the remaining two will be available shortly once some upgrade issues are resolved. The Focus Group searched on a provided list of words and it was fascinating to see the recommendation service start to work - with user activity triggering recommendations across repositories as users moved from one search item to another. There was quite a bit of 'noise' to begin with but gradually the viewed item lists started to become meaningful. An online questionnaire was provided and participants were asked to complete this before leaving the meeting (
Feedback on the testing exercise and the recommendation service was very positive and group discussions led to several useful suggestions on how the service could be improved to provide added value. See below for key suggestions:

  • Identify repository an item is from - should include this in results
  • Views vs downloads - consider weighting, ie. Number of times viewed
  • Number of recommendations. Differences of opinion on this – 5 or 6 seem about right.
  • Location of recommendations on item page - above or below? Maybe useful to have a link at the top of the record to take you to the recommendations at the bottom of the record.
  • Session time - 30 mins? Consensus that this is about right.
  • Added value: Ability to produce reports from aggregated data – per institution/per item/per selection of items –such as for Sconul and SUSHI reporting.
  • Google search (restricted to the 12 repositories) based on the item metadata could be added to suggest ‘similar items’ alongside recommended items.


Monday, 9 May 2011

AEIOU - The Business Case

The AEIOU project had an interesting visit from Tom Frankin on 21st April who helped us to develop two technical 'recipes' for the Activity Strand cookbook. These still need to be refined but it was an informative exercise, particularly for me as project manager, as it helped me to understand the processes involved in software development.

Tom also took a look at the business case I am putting together and gave me some extremely useful advice about not trying to oversell the benefits as this weakens the message. I do need all the advice I can get given the current financial climate as it is not an easy task to convince a management team, trying to find ways to save money, of the benefits of a product. I suspect this will apply to all the projects.


Thursday, 21 April 2011

Recommendation service

Do I really want to write a SOAP service (SUSHI use SOAP but don't seem to have mature Java open-source client/servers available for hacking!)? How about a REST service? This could be a neat solution using something like Apache CXF.

Either of these would be great but as I only have a few simple data requests to execute, I've decided to go for a quick and easy solution - Apache XML-RPC deployed in a servlet.

Behind this I'm using a MySQL database with Apache DBCP handling connections and queries. I was going to use the lightweight mybatis data mapper framework (formerly known as iBatis) but again, as I'm only using a couple of queries it isn't really worth the overheads for the flexibility it provides.

The test set-up is working so now all I've got to do is tidy it up, deploy the server as a service and deploy clients within the six DSpace institutional repositories. How long have I got?


Consuming and Querying data

So what service should I use? My first thoughts were to push data to a SQL database, then I thought of Solr. Solr is fast and efficient, it's great for powerful full-text and faceted search, hit highlighting and rich document (e.g., Word, PDF) handling. So I pushed OpenURL Context Objects from DSpace to Solr and used simple queries to view the captured activity data.

Then I thought again. What data do I want returned from a recommendation service? I just want a few item handles and some metadata as suggestions to view alongside the current resource. I found I could do this using an SQL query on a test database but wasn't sure if I could construct queries with inner joins using Solr. I'm not that familiar with Solr and couldn't find what I wanted. A patch was available for the latest release that could do this but then again, maybe this isn't one of Solr's strengths ..or maybe I don't have the right data structure.

My thoughts turned back to basing a service on a SQL database.