# MapDB – Useful Java Libraries

I’ve not written much Java in the last year but have been slowly getting back in to it over the last month or two. In the process of doing so I’ve been re-evaluating many of the libraries I commonly use when starting a new project and have been discovering many new ones. I thought I would write a small post on each of the interesting libraries detailing where they might be useful and maybe a small example or two. I am, by no means, an expert on any of them however. I’m also going to be sticking to libraries that I think are likely mature enough to use in a production environment.

To start with there’s MapDB (formerly JDBM4). MapDB is a pure-Java disk-backed off-heap transactional database which offers drop-in placements for common concurrent Java datastructures (ConcurrentMap and ConcurrentNavigableMap) as well as powerful features like secondary indexes. It offers a boat load of other features, such as compression and encryption.

Here’s a link to the Hello World Example.

Alternatives. Essentially any other Java in-memory database, such as:

# Algorithms for generating new ideas for startups

I used to rely on spontaneous inspiration to generate most of my ideas, normally while doing some activity where my mind was relatively free to wander like showering or going for a long drive. Recently though I’ve started to get frustrated at the rather unpredictable rate at which I amass ideas and have started developing more systematic approaches for generating new business ideas.

In many numerical disciplines, there are often functions that we want to sample from but where doing so is difficult. So in general, we cheat. We sample from a simpler approximation and then apply some form of correction afterwards. Consider these different approaches as approximations. For each technique i’ll explain how it works and then go through a few examples of applying it. Feel free to add more examples in the comments and i’ll update the article with them where possible.

#### Servicing trends

A fairly simple technique to kick us off.

1.  Identify a list of technologies, platforms and services on an upward trend
2. For each item in the list:
1. Find an existing concept that has similar characteristics
2. Identify services and complementary products
3. Consider the application of these to the trending item incorporating any new aspects it brings
##### Examples

Let’s look at a few existing servies as examples first.

##### Github

If we look at the trends for around 2008, it’s fairly easy to see Git trending.

Looking at similar, incumbent version control systems there were numerous companies offering hosting for open source projects using Subversion and CVS (Sourceforge and Google Code being two notable ones) and commercial repository hosting for private projects. Github successfully offered both.

##### Heroku

Another good example is Heroku, which originally catered for Ruby developers. While other languages like perl and PHP were often supported on shared hosting environments, Ruby (and specifically Rack) projects required specialist or dedicated environments.

#### Potentially new ideas

It’s fairly easy to rationalise any successful idea retroactively so let’s try applying this method to generate ideas.

Let’s follow the algorithm we defined earlier. First we need a list of things that are trending. Google Trends as we used to look at the VCS trends earlier is one option. A few other useful sources are Crunchbase and Google Insights.

Step one, we need a list of trending items. We’ll start with: Hadoop, MongoDB, Stanford and MIT’s online university courses.

Now for step two. We need to identify an existing concept that has similar characteristics. For Hadoop, probably the closest would be databases.

Step three. We now need to find services and complementary products that already exist for traditional databases. There are two good tricks here. The first is relatively simple, use a search engine to search for related items and looking for advertisements or heavily search engine optimised product pages. For example, searching Google and Bing for ‘mysql’, ‘postgres’ and ‘oracle’ coupled with ‘tools’, ‘services’. The second involves finding a few advertising-heavy sites dedicated to the particular thing you’re looking for and studying the contextually-relevant advertising. Places like eHow and wikiHow are great for this.

I spent twenty minutes doing both of the above and here’s a few things that jumped out:

- Consultancy (Performance tuning, version migration and platform migration seem to be popular)
- Automated performance monitoring/analysis
- Auto-scaling cloud hosting
- Cloud backup
- Query editor and database explorer desktop GUIs

Step four. Consider each of these services or products as a potential starting point for a business involving Hadoop while, if advantageous, incorporating any unique characteristics that Hadoop brings.

Many of these could equally be applied to MongoDB, another trending technology.

How about the recent trend of large online university classes kickstarted by Stanford and now continued by Udacity, Coursera and MITx? We’re probably very early on in the trend but that’s normally a good place to be.

How can we service these trends? Cranking the handle with the techniques from earlier and a little brainstorming:

- One-on-one tutoring for popular courses (or a marketplace for them)
- eBooks of questions and worked solutions{{ footnote(“you could even provide the questions for free but charge for the worked solutions or upsell having your solutions marked by a human with feedback.”) }}
- Specialised recruitment events/platform (‘eMilkRound‘)
- Blog/social media widgets showing your completed courses and current studying

I’ve not tried to filter any of these ideas, there may already been well entrenched competitors and they probably all need a little further development to find a good fit in the market (a little hill climbing).

# Finding similar items using minhashing

You find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you’d really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items).

Clearly we don’t want to scan our entire existing database of items every time we receive a new item but how do we avoid doing so? Minhashing to the rescue!

#### Minhashing

Minhashing is a technique for generating signatures for items. The more similar two items are, the higher the chance that they share the same minhash. Of course, we need a proper definition of similarity for this to be useful.

##### Similarity

For this we use the Jaccard Similarity:

$J(A,B) = \frac{|A\cap B|}{|A\cup B|}$

That is, the Jaccard Similarity is the number of elements two sets have in common divided by the total number elements in both sets. Obviously, a similarity of 0 means two sets contain no elements in common and a similarity of 1 indicates the sets contain the same elements.

If we treat our items as sets and the item attributes (such as tags or keywords) as the set elements then we can use the above definition to compute the similarity between two items. It’s worth pointing out here that we are making the assumption that an item attribute can only appear once in an item, there are ways of dealing with this later though.

A quick example. If we have two items { cat, dog, fish, banana } and { lemon, fish, dog, cat, cake } then the Jaccard Similarity would be:

$J(item_1,item_2)=\frac{|\{cat,dog,fish\}|}{|\{lemon,fish,cake,cat,dog,banana\}|}=\frac{3}{6}=0.5$

Now we have a way of computing the similarity between two sets, let us put it to one side and consider the actual process of generating a minhash. Before we get started, we need a hash function. Your programming language of choice’s standard library will almost certainly have a few acceptable ones within arm’s reach. We’ll call it $H$. The actual process goes as follows:

• Set the $h_{min}$ to infinity
• For each element $s$ in the set S:
• Compute the element’s hash $h_s=H(s)$
• If , the minimum hash

Simple. How does it work though?

First, let’s consider the hash function $H$. The function hashes a value to one of $M$ possible buckets. We assume that the number of buckets greatly exceeds the number of elements or features we’re likely to see, $M >> N$. With a good hash function, the probably of a value being hashed to any one of the $M$ buckets is $1/M$.

Now consider a set $A$ which we’re trying to compute the minimum hash for. Since each element from set $A$ has an equal probability of being hashed in to any one of the $M$ buckets, the probability that any given element is the minimum hash for the set is $1/N_A$ where $N_A$ is the number of elements in set $A$.

Let’s introduce a second set B. What is $p(h_{min}(A)=h_{min}(B))$, the probability that both sets share the same minimum hash? Since any element from combined elements of both sets has an equal probability of being the minimum hash then $p(h_{min}(A)=h_{min}(B))$ is equal to the ratio of the number of elements common to both sets over the total combined elements:

$p(h_{min}(A)=h_{min}(B)) = \frac{|A\cap B|}{|A\cup B|}$

which is the Jaccard Similarity!

What we’ve worked out above is that if we pick an arbitary hash function (picked at random from a particular family, for example) and calculate the minimum hashes for two sets then the probability that they share the same minimum hash is equal to the ratio of their common elements to their total elements.

##### Practical usage

For a key-value database we can use minhashes for keys, storing a list of the keys for the items that resulted in that minhash as the value. I use plural for minhashes because by storing multiple minhashes for a particular item, it is possible to fine-tune the probabilities of minhash matches towards a particular goal.

If speed is a concern for the particular application then minimising false positives is crucial. For this, combining multiple minhashes such that two sets only share the resulting combination if all of their minhashes match, can allow a tuning of sensitivity.

If two sets have a Jaccard Similarity of $s$ and minhashes are generated for $k$ different hash functions then the probability of a match is simply $s^k$. Choice of $k$ obviously depends on the desired characteristics of the application.

Alternatively, if ensuring a low false-negative is crucial then using minhashes generated from $j$ hash functions and selecting based on any match (effectively, the OR) will give a probability:

$p(h_{combined}(A)=h_{combined}(B))=1-(1-s)^j$

Obviously, these two strategies are not mutually exclusive and give you a range of parameters to tune depending on your application.