<AEIOU> Activity data to Enhance & Increase Open-access Usage: AEIOU Report

AEIOU Outputs

Software - released under Apache foundation license

XML-RPC client/server
- Java source code - svn://aeiou.aber.ac.uk/svn/aeiou-xmlrpc
- XML-RPC server uses a MySQL data source (database schema included)
- Includes implementation for Apache Mahout Recommender
DSpace Activity Exporter
- Java source code - svn://aeiou.aber.ac.uk/svn/aeiou-activity-exporter
DSpace Activity Exporter (Binary)
- This includes a binary distribution with required Java libraries (jar files) and Howto for deploying and configuring with DSpace

Recipes - released under a Creative Commons license

Building & Deploying the XML-RPC server (AEIOU Recipe 1)
Exporting Activity Data from DSpace & getting Recommendations (AEIOU Recipe 2)

Guide

Implementing a recommendation service using Apache Mahout Recommender (to be completed)

Report

Evaluation report - released under a Creative Commons license

Next Steps

We would like to encourage further community use of a generic Recommendation service to leverage information in large activity data sets. Along similar lines, we would also like to develop a generic Reporting service based on activity data.

Refine Recommendation service
- Further develop and explore Apache Mahout implementation
- Provide flexible configuration for easy deployment to exploit data models with simple User / Item preferences
- Requires offline processing for building User / Item similarities with large data sets
- Mahout Includes API for assessing relevance of recommendations
- Develop simple REStful service for (collaborating) institutions
- Open-source release for community with full documentation

Develop Reporting Service based on activity data aggregations
- As simple REStful service for (collaborating) institutions
- Explore JSON API for reports
- Implement SUSHI / COUNTER reports for open-source Java client/server

How can other institutions benefit?

If you'd like to start tracking user activity in your DSpace repository and exploit the activity data to provide recommended items, follow the steps below.

Download and build the AEIOU xml-rpc server for tracking user activity (requires MySQL database) - AEIOU Recipe 1
Download the AEIOU Activity exporter software and configure for use with your DSpace repository - AEIOU Recipe 2
Add a link to a Data Privacy Policy from your DSpace repository to inform users of how their data is being collected and used

Most significant lessons

Think Big! Your activity data will grow very quickly and things (e.g. SQL Queries) that took a few milliseconds will take tens of seconds. Use open source technologies that are tuned for Big Data (e.g. Apache Solr and Mahout) or process data offline and optimise your database and code - see Deploying a massively scalable recommender system with Apache Mahout
Clean up! Your data may contain a lot of noise which makes processing less efficient and results less relevant e.g. filter robots and web crawlers used by search engines for indexing web sites and exclude double clicks. Try using queries to identify unusually high frequencies of events generated by servers and flag these.
Use a simple data model with core elements that can be aggregated with (or mapped to) other activity data sets - see Who What Why and When

Processing Activity Data - Recommending items

Initially we used SQL queries to identify items that had been viewed or downloaded by users within specific 'windows' or session times (10, 15 and 30 minutes). Refinements were made to rank the list of recommended items by ascending time and number of views and downloads. We have a requirement that the service should respond within 750 milliseconds, if not the client connection (from the repository) will timeout and no recommended items are displayed. The connection timeout is configured at the repository and is intended to avoid delays when viewing items.

Unsurprisingly, queries took longer to run as the data set grew (over 150,000 events) and query time was noticeably influenced by the number of events per user (IP address). Filtering out IP addresses from robots, optimising the database and increasing the timeout to 2 seconds temporarily overcame this problem.

However, it was clear that this would not be scalable and that other algorithms for generating recommended items maybe required. A little research suggested that Apache Mahout Recommender / Collaborative filtering techniques were worth exploring. We are currently testing Recommenders based on item preferences determined by whether or not an item has been viewed (boolean preference) or the total number of views per item. Item recommenders use similarities which require pre-processing using a variety of algorithms (including correlations). An API also exists for testing the relevance of the recommended items and we will be using this over the next few weeks to assess and refine the service.

Privacy and Anonymisation

see Licensing & Data Privacy Policy statement.

Labels: #JISCAD

<AEIOU> Activity data to Enhance & Increase Open-access Usage

Thursday, 28 July 2011

AEIOU Report

AEIOU Outputs

Next Steps

How can other institutions benefit?

Most significant lessons

Processing Activity Data - Recommending items

Privacy and Anonymisation

0 Comments:

Post a Comment

Previous Posts