Skydiving with Search Hats On

I have accumulated quite a pile of search related ideas over the years, most of which are not new algorithms but just new approaches to searching.  Since I am going to be exposed to some NDA protected activities at Microsoft next week and there might be some overlap between their mindset and mine, I thought I should spill some out now just in case.  As to why I am helping Microsoft out and not Google: Microsoft asked, Google didn't.

First of these are Search Hats.  A 'search hat' is just a metaphor for the 'why' behind searches.  When we search for information online, we are searching for a reason.  Grouping those reasons by roles, perspectives, or interests, most people should end up with a handful of large clusters.

If each cluster is a hat a person wears when doing a search, a group of people with similar roles, perspectives, or interests should have similar sets of hats.  Search hats affect the presentation of search results such that items related to the roles, perspectives, or interests appear more prominently.

For example, if I search for 'Eclipse' while wearing the 'Software Developer' hat, I should get Eclipse IDE related links before links related to the astrophysical phenenomon.  If even I was interested in the later, results I get back should be different depending on whether I am wearing a Physicist's hat or a Photographer's hat.

Information on which links are relevant to which hats can be culled by keeping track of which hats searchers are wearing when they do the searches.  Same information can be used to recommend hats a searcher might be interested in wearing.  Hats can also be shared amonger searchers explicitly.

Like Dr. Seuss's magic hats, there are hats within hats so seachers can browse for the right hat that suits them by diving into hats or grabbing one of the hats returned as part of each search result.  Over time, a user's hat collection will be refined and adjusted to meet the user's search needs.

The nice things about Search Hats for search service providers is that a) search results will be more accurate and contain less noise, b) hat collections are great for targeted ads, and c) users will find it difficult to abadon their hat collection.

Oops.  I am out of time so I'll have to cover the 'Skydiving' idea later.  Now where did I put my hat?

Spring in Autumn

This week is turning out to be a slow blogging week because I was busy wrestling with funky HTML email formats.  Generating email is easy just as generating HTML is easy.  Trying to make sense of all the wild variations and loosy goosyness at the receiving end is tougher and doing content surgery in route is even tougher.

I did have some enjoyable time integrating Spring Framework with the pure Java milter though.  It took me only a few hours to wash most of the configuration mess out of code and into an XML file.  Nice.

Although I did notice the recent release of HiveMind 1.0 final, I went with Spring Framework because I felt more comfortable with its design and terminologies than HiveMind's.  But then they are very similar so which you choose to use is just a matter of taste.  HiveMind is fairly small though because it is boxed in by rest of Jakarta projects in terms of functionalities.  Joining a community means having more toes and egos to avoid stepping onto.

Spring Framework, on the other hand, has a growing flotila of integration packages.  Since I was short of time, I ignored them for now and used only the core and context packages.

Firefox on Fire

While it's cool to see that Mozilla Foundation has met their goal of 1 million Firefox 1.0PR download in 10-days, it's sad to see that they achieved that by doing it the Old Fashion Way, delivering every byte to everyone themselves.

Instead of the irrelevant, so called, RSS Support, they could have added BitTorrent support to the Firefox, enabling Mozilla servers to share the download frenzy stress with network download clients and enabling every Firefox installation to be BitTorrent-ready at the same time.

But then it's a puzzle why AOL, Mozilla's former patron, didn't add BitTorrent support across it's product line (AIM, WinAmp, etc.) to make it easy for people to download multimedia.  IMHO, the best way to control illegal sharing of copyrighted goods is by controlling the client.  And lets not forget all the legal ways to leverage P2P technologies.

One example is peer to peer education.  Recording video or audio is so much simpler than writing books.  Why not ask people to share amateur How-To videos?  Time/LIFE made a bundle selling How-To books and market interest is clearly there.  Turn on those Visa/MC logos and let people make money off teaching others how to fix things.

The best time to embrace a technology is when everyone is scared of it.  Grab it by the horn and flip it to your advantage.

Firefox Live Bookmarks == IE CDF?

William Slabbekoorn (see his comment in Firefox RSS Support) duplicates Firefox Live Bookmark feature for IE with a bit of server-side ASP code that transforms RSS into CDF.  You remember CDF don't you?  If my feed was in CDF format, server-side component wouldn't have been necessary which makes Live Bookmarks as uncrappy and useful as CDF.

Below is a partial screenshot of my feed displayed as IE's 'Live Channel':

Full screen version from William Slabbekoorn (Local Copy).

IMHO, false praises are worse than no praise at all.

Bug Enhancement?

This entry in the most latest list of changes to QDBM, a fast dbm-like open source library, gave me a good laugh:

A bug in the extended API was enhanced.

Aside from the typo, I have no complaints about QDBM.  Coming from me, that's a complement.

Firefox RSS Support

I just finished looking at the code implementing Mozilla's RSS support (aka 'Live Bookmarks') and came up with these tips:

To make the orange RSS button show up on the bottom right corner of Firefox when your webpage is displayed, add following HTML fragment to your webpage's HEAD element for each feed.

<LINK type="{feedMimeType}" rel="alternate"
    title="{feedTitle}" href="{feedUrl}">

where {feedMimeType} can be:

application/rss+xml
application/atom+xml
application/x.atom+xml

if {feedMimeType} is not one of the above then {feedTitle} has to be one of the following (case-sensitive):

rss
RSS
Atom

Otherwise, {feedTitle} can be anything.

To make feed items appear correctly in Firefox bookmark sidebar, your feed items *must* have both non-empty 'title' and 'link' tags.

And what do I think of Firefox's so called RSS support?  Words like crappy and useless comes to mind.

Update #1:

Following is a copy of my comment to Dan Gillmor's post quoting my 'crappy and useless' comment:

Some details behind my rather rude comment:

1. There is no such thing as RSS support in Firefox 1.0PR. Firefox 1.0PR *uses* RSS feeds to implement Live Bookmarks. While Live Bookmarks is useful for del.icio.us, live bookmarks are read-only.

2. While such use of RSS is laudable, they failed to distinguish between Live Bookmarks, a specific application of RSS, and the RSS technology, creating confusion as a result.

3. Live bookmark behavior is inconsistent across feeds. For link blogs, live bookmarks point to different destination sites. For other blogs, they point to different sections of a page or different pages at the same site. Live bookmarks confuses and wastes bandwidth.

4. Bookmark sidebars are too narrow to display item titles effectively.

Struts 1.2.2 Released

Struts is still widely used by server-side Java developers, but I have stopped tracking it after the release of Struts 1.1 more than a year ago.  So it's no surprise that I didn't notice the release of Struts 1.2.2 until now.

Scanning through the Release Notes, I don't see any compelling reasons to upgrade.  Even worse, there are good reasons to not upgrade, like removal of code deprecated in 1.1 which will break some existing code.

I don't think I'll be upgrading my Struts 1.1-based projects and, for new java webapp projects, I'll be using the Spring Framework.

So long and thanks for all the actions, Struts.  It's been fun watching the paint dry.

Vary: ETag

These days, I am not tracking Atom mailing list too closely due to the traffic volume (currently 7 times XML-DEV traffic)and lack of time, but Tim's FooCamp2004 post prompted me to read Sam's Vary: ETag post and comments.

While I like the cleverness of the solution, I have misgivings about how practical it really is.  Aside from requiring Vary: ETag aware clients keep track of ETags, the solution requires a lot of server side work for doubtful gain.

  • Everyone seems to agree that low traffic blogs won't see any noticeable gain.
     
  • I don't think the large blog services like TypePad and Blogger will gain much either because such services must support tens of thousands of feeds, each of which must be sliced and diced at the expense of CPU load to reduce bandwidth.  Parsing every feed for every request to figure out which subset to send back is not cheap.  Even if a cache is used, frequent editing of recent posts will increase the CPU load noticeably.
     
  • That leaves only feeds like the MSDN aggregated feed which will see noticeable bandwidth reduction at the expense of writing a custom Vary: ETag handler.

A key problem is that XML is not an efficient format if you are doing a lot of search and extraction.  Regular expression can be used but not reliably or fast enough unless the feed is preprocessed into a more palatable form (canonical or proprietary reg-ex friendly format).

A similar but more practical solution might be to serve feeds as a multipart MIME resources with sequenced parts.  Each feed item becomes a MIME part and feed metadata is also a MIME part.  Extra benefit is that binaries can be embedded as well and other content formats (i.e. RSS) can be supported as well.

Linux VMware Blues

If you are running a Linux guest under VMware like me and my blog's hyperlinks are green instead of blue, turn on subpixel font rendering to get the blues.

FYI, I am running RedHat 9 under VMware running on XP, primarily for development and testing.  For example, I needed to write a milter so I initially wrote a C++ version using Eclipse running under RH9 VMware guest.  The milter was talking to sendmail server running inside the same virtual machine.  Eclipse CDT running inside the VM was rather difficult to work with so rewrote the milter in pure Java using Eclipse running on XP.

To debug, I configured the sendmail server running inside the VM to invoke the pure Java milter running under Eclipse debugger outside the VM.  Then I sent both plain text and multipart MIME messages using Evolution, running inside the VM, as well as Outlook, running on another machine, to the sendmail server inside the VM which in turn invoked the milter running outside the VM.

While all this might be confusing to some, it worked amazingly well.