Code I/O

A topnotch WordPress.com site


3 Comments

uClassify: A JAVA SDK for uClassify’s on-demand text classification web service

Obtain the full source code for uClassify and the test module from: https://github.com/udy/UClassify

Text Analysis/Mining has been a topic of interest in the academia for a long time; with technological improvements, one in-theory concepts can now be offered to the masses on-demand and for free.  The web service has changed the programming model totally, and add to it the capability to churn large amounts of text in few seconds to classify them with accuracy, this is a blessing for processing unstructured documents on the fly.

Having investigated various technologies on text classification I must agree that text analysis services have evolved be fully utilized for creating software for the business class.  I believe that it is the most powerful area in Information technology, the applications of text analysis and classifications are known widely (Ref:  http://en.wikipedia.org/wiki/Text_mining).  My personal interest is to add some Business Intelligence sense to such classifications, hence there is a need to find quantifiable measures to evaluate the discoveries and enable business use-cases be mapped to such powerful technologies.  One simple idea to fit such text analysis in social networking is for streaming feeds and eliminating noise depending on user preferences.

Few of the services which can be consumed freely dazzled me in terms of ease of use, constantly improving accuracy and eliminating the need to purchase a complex solution and supporting it

Web services have opened up the text analysis platform from being in-premise and brought it to the cloud.  OpenCalais and uClasify; [http://lifencode.com/lifencode/technology/textelligence-of-ontology-taxonomy-and-text-mining/] in particular are capable of assisting in the process of building metrics based on text analysis and classification without much effort, which in my opinion is a value-add to the application providers to allow slicing-and-dicing of unstructured information.

Few of the interesting measures which can be used for building metrics are as follows:

  • age: age group the content is relevant for
  • language: language the content is written in
  • mood: happy/sad
  • tone: Business/Personal article
  • topics: This is very broad

Solution based on observation of  Patterns while using uClassify service (RESTful API invocation)

While experimenting the classification services from uClassify (RESTful, you can also evaluate the XML API if needed), I observed that, the API’s returns name:value pairs, where name is the classification term and the value is the numeric percent relevance of the discovery, and I’m sure you will agree that such numeric values are essential for analysis and comparison from time to time.

The need is for a Java based consumption library for uClassify, which encapsulates most of the repeated tasks involved in the process of using the API’s.  I’m presenting the entire source code under Apache 2.0 license, so that it can used and extended by the community.

NOTE: uClassify must be used as a complementary  service along with other text analysis frameworks to bridge gaps in technologies and business needs.


Here is a simple snipped of the service consumption:


/*
 **
	Copyright 2010 Udaya Kumar (Udy)

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
 **
 */
package org.onesun.textmining.uclassify.test;

import java.util.Map;

import org.onesun.textmining.uclassify.ResultHandler;
import org.onesun.textmining.uclassify.ServiceType;
import org.onesun.textmining.uclassify.UClassifyService;

public class UClassifyServiceTest {
	public void doTest(){
		for(ServiceType service : ServiceType.values()){
			String text =
				"A new survey has been launched in the United Kingdom to unearth the true nature of cyber stalking in the country."
				+ "n"
				+ "The Network for Surviving Stalking has issued an "Electronic Communication Harassment Observation" or ECHO questionnaire in collaboration with the scientists at the University of Bedfordshire."
				+ "n"
				+ "The survey has been commissioned to classify those who have been stalked on web and how according to a number of criteria."
				+ "n"
				+ "The questionnaire will ask respondents if they were harassed or threatened on a social networking site such as Facebook, Twitter and LinkedIn, email service or Instant Messaging."
				+ "n"
				+ ""At the moment there are very few widely agreed guidelines or rules about how to behave online - we hope Echo will define behaviours that are generally experienced as anti-social or likely to cause distress in online communication." said Dr. Emma Short, head of the project ECHO."
				+ "n"
				+ "The survey has been launched after Crown Prosecution Service (CPS) of the UK revealed a set of new guidelines for law enforcers tough on stalkers on web."
				+ "n"
				+ "Read more: http://www.itproportal.com/security/news/article/2010/9/25/study-reveal-nature-cyberstalking-uk/#ixzz10YckSmCr";

			// *******************************************************************
			// DO NOT FORGET TO SET YOUR OWN KEY HERE BEFORE RUNNING APP
			// You can get a key from: http://www.uclassify.com/Register.aspx
			// *******************************************************************
			UClassifyService.setUClassifyReadAccessKey(null);
			// *******************************************************************

			UClassifyService uClassifyService = new UClassifyService(text, service, new ResultHandler() {

				@Override
				public void process(ServiceType serviceType, Map<String, Double> results) {
					System.out.println(
							"---------------------------------------------------------------------n"
							+
							serviceType.getUrl() + " <<<>>> " + serviceType.getClassifier() + "n" +
							"---------------------------------------------------------------------n"
						);

					for(String key : results.keySet()){
						Double result = results.get(key);

						// interested in match >= 25%
						if(result >= 25) System.out.format("%1$-50s %2$10.2fn", key, result);
					}
				}
			});

			try{
				uClassifyService.process();
			}catch(Exception e){
				e.printStackTrace();
			}

		}
	}

	public static void main(String[] args) {
		UClassifyServiceTest miningTest = new UClassifyServiceTest();

		miningTest.doTest();
	}
}

The result produced is as follows: [The snipped was taken from http://www.itproportal.com/security/news/article/2010/9/25/study-reveal-nature-cyberstalking-uk/%5D

---------------------------------------------------------------------
http://uclassify.com/browse/uClassify <<<>>> Ageanalyzer
---------------------------------------------------------------------

51-65                                                   36.23
---------------------------------------------------------------------
http://uclassify.com/browse/uClassify <<<>>> Text%20Language
---------------------------------------------------------------------

English                                                100.00
---------------------------------------------------------------------
http://www.uclassify.com/browse/prfekt <<<>>> Mood
---------------------------------------------------------------------

upset                                                   25.72
happy                                                   74.28
---------------------------------------------------------------------
http://uclassify.com/browse/prfekt <<<>>> Tonality
---------------------------------------------------------------------

Corporate                                               99.63
---------------------------------------------------------------------
http://uclassify.com/browse/uClassify <<<>>> Topics
---------------------------------------------------------------------

Society                                                 88.75

The percentage matches found will help a lot in pushing the right content to the right audience. There by improving the quality of the service and making the audiences come back to the service often.

Obtain the full source code for UClassifyService and the test module from: https://github.com/udy/UClassify

Advertisements


Leave a comment

OAuth and Open Data Protocol: Security and data modeling in an “open” way

Bringing information from various places of interest [namely: News, Social Networks, Emails] has become a common requirement these days, however this brings common challenges in the area of collecting and unifying information from different places.

Firstly, one must be able to connect with various sources.  Secondly bringing information to a common place by understanding the varying structures [RSS, ATOM, GRAPH, more…] for existing information and representing it a well defined unified format.

OAuth comes to the rescue for the first challenge, and for the next challenge of representing information collected from various sources; one can take approach of open data protocol.  To take a similar approach; I recommend using ATOM as a preferred data model to represent and consume information.

In this post I am sharing my experience on OAuth, its benefits, and some tools to play with.  if you’re interested, I recommend you to play with it.

OAuth Protocol and initiative helps to resolve the first challenge (connecting to multitude sources).  As more services implement OAuth; it will become the de-facto protocol for connectivity.  Though there are some exceptions to it, like Google and Facebook, who add their own custom parameters which might make consumers scramble when it comes to choosing a good OAuth library.

One good OAuth library I found is Scribe, this is so far the best one I’ve come across.  Using such simple interfaces helps one to embrace OAuth instead of scaring them away; otherwise, consuming such services will really become a nightmare.

What you must have to start using OAuth:

  • Obtain your consumer key and secret from the service provider.
  • Store this credentials for you application in a secure way; the secret must be safe guarded as it is the password for your application.
  • Once you’ve the key/secret.  You must get 3 URI to play with.
  • The Request Token URL: The endpoint URL required to ask for a request token
  • The Authorize URL: The endpoint URL to redirect to when the request token has been obtained
  • The Access token URL: The endpoint URL to finally get the access token to enable your application consume the service.

When you’ve all the above information, you can use the API to connect and have some fun.  The most simplest way is to use the OAuth playground and try out the entire process.  This will give you an idea about how OAuth functions.

For representing information in ATOM format; Apache’s Abdera project will be a preferable choice.  Further to enhance it into Activity streams, abdera-activitystreams project is worth considering.

What can one do with both such wonderful tools? The best thing is to unify information … however most of them are doing it as of today; but they are PHP, Ruby on Rails initiatives, having such a wonderful framework for enterprise applications written in Java will be a nice thing to have to speed up projects to consume such information.

I personally call this initiative ‘Atomator’, the process of atomifying information; the purpose of this is be to provide a Java framework that can be used by anyone wanting to consume information from various sources, and add support to the framework that are not, very conveniently.

Currently, I’ve added support for the following:

  • SAP Streamwork
  • Google Mail
  • LinkedIn
  • Twitter
  • TripIt
  • RSS feeds
  • ATOM feeds

Adding few more “commonly” used sources will make this component a generic framework.  As well, I believe that Atomator can become a very interesting project for the open source community.  Let me know what you think of it.  Feel free to talk to me if you’re interested in knowing more about it.

Now having a framework like that can enable you to build excellent tools for various business scenarios for the cloud, desktop and mobile platform.

Tech fact: Did you know you could consume e-mail as atom feed? At least Google does that via https://mail.google.com/mail/feed/atom


1 Comment

Textelligence: Of ontology, taxonomy and text mining

Knowledge is power and with awesome powers, one can do wonders … unstructured data analysis and  mining is becoming a key force for writing software that can enable one do things that were a dream few years back.  I can say that the dream is becoming a reality.  For example, consider the area of “augmented reality”; this was a vision known and seen from George Melies – the Hilarious Posted [1907], Fight Club [1999], Minority Report [2002], Terminator 3 [2003], MUTO by Blu [2007] and to date Iron Man [2008] and many more.  How can such a reality be powered by technology, what is needed?

The context is important; when the context is not understood, everything you see will be noise and useless.

Knowledge of content: When one is aware of a certain thing, it makes it easy to explore information, or make a good use of it.  Essentially, such awareness is important to infer a context or at least make assumption of the context.

When the context and awareness is put together, one can imagine endless possibilities of making the sci-fi vision a reality.

To start with, when you want to write software that is as powerful as the dream, a very strong infrastructure is required to support it.  Few upcoming services from content providers for text mining makes it easy to apply vast amount of knowledge to make the unknown, known.

Anyone looking for such on-demand services which can enable text mining, can try the following:

Calais The toolkit offers API to understand content semantics.

uClassify Offers API for detecting content language, categorization, gender and age recognition, understanding mood of the content.

freebase is a repository of growing knowledge powered by community.  This helps in recognizing entities and verifying them.

Now with all the capabilities, putting them together to do what ever one wants should be as simple as writing a business case.

Subscribe to the feeds to get hands on with the examples … “coming soon”


Leave a comment

Building a simple home entertainment server …

The need: Enable consumption of media in the home network, yet keeping portability of media storage intact.

Having tons of photos, music and videos (various formats) makes it hard to organize them, as well discovering them becomes critical as time passes by.  I was evaluating various solutions that will enable me to organize,  stream and discover them such that it can be consumed in my home network.  Yet, provide me a flexibility to port content when needed (especially when I travel).

I had the following options

  1. NAS (Network storage): bulky and non-portable.  The storage sits in a centralized location and transporting the drive might be difficult.  Not all storage devices lets installation of customer software.
  2. Boxee: Pretty good at organizing content, fetching sub-titles, however, I do not like sharing my private information.
  3. UPnP media server: a little cumbersome to configure, and good ones aren’t free.
  4. iTunes: Requires format change, and works only with Apple products.  Perhaps I might end up buying more Apple tools and products (iPod, iPad, …); which changes too quickly and makes my investment look ridiculous.

However, none of the above gives me portability.

What did I finally come up with?

  • Used a portable drive – to dump all my resources
  • Used an old system that runs Ubuntu with my portable drive connected.
  • Used AutoIndex to index the drive.  This is simple, lightweight, runs as a PHP module in Apache.
  • I’ve customized AutoIndex to make MP3 audio and Video (AVI, MPEG, MP4, MKV) playable  in a browser.  No plugin is required for MP3, however, it depends on a browser plugin for Video.  I’ve also added icons for GNOME;  You can try out the updated code by downloading it from here.

I’ve also submitted the new changes to AutoIndex (in a discussion: )

When at home, use the media server, when on the go, just unplug the drive and take it along 🙂  Thus bringing in convenience and portability together.

The sandbox was built with:

  1. An old PC (a Pentium processor).
  2. Ubuntu Lucid Lynx
  3. Apache 2.0
  4. PHP 5.0
  5. AutoIndex
  6. Firefox 3.6.9 Browser


Leave a comment

Find bugs before your customer does …

Performing static analysis of java code on a regular basis is an extremely useful exercise, and I’ve found “FindBugs” to be an extremely useful and worthy tool.  One particular warning raised by it is “exception is caught, when exception is not thrown” – may appear to be a false positive at first, however, the tool basically recommends catching specific exception types, instead of haging a “catch all” exception clause that catches the base exception; this will mask potential programming mistakes.  Catching specific exceptions and handling them appropriately – and perhaps differently makes for a better error-handling approach.  Overall, it improves the readability of the code where others are able to better understand and extend the exception handling mechanism.

Try it out and fix the bugs before the customer spots it!

http://findbugs.sourceforge.net/


Leave a comment

5 Minutes on Linux: Automating app launch on boot

Adding corn jobs is a way to automate launching of apps. Another approach is to make those apps launch when a user logs in. Depending on how the app must be started up, one can make the right choice.

Add a wrapper to the binary to launched in /etc/init.d

echo "/home/tester/tools/apache-httpd/bin/apachectl $1" > /etc/init.d/start-apache
update-rc.d start-apache defaults 3
service start-apache start

Here is my experience: I wanted to index media on an external drive; one of the coolest indexing app I found is AutoIndex which is a PHP module (so runs under apache httpd); however this possessed a challenge. The external NTFS driver gets mounted by gvfs (fuseblk); Since I wanted to manage the drive via gvfs (and not via ntfs-3g mounting via fstab), I thought it was a good idea to leave it that way.

Now the challenge was to get apache running as a service (launched by user “daemon”) to access the gvfs mounted drive. The permissions cannot be changed (at least I have not found a resolution there).

Then, what’s the solution?

Install apache and php modules in a user account (user: tester)
Make the tester auto-login on boot
Add apache (user installed) to launch on login: (GNOME: System -> Preferences -> Startup Applications")
Add "gnome-screensaver-command -l" (Lock the screen): (GNOME: System -> Preferences -> Startup Applications")

You’ll have the user installed apache running on boot and have the screen locked for security purposes.

This is a hack I did at home to get media streaming work with drives mounted by gvfs.