

Recent blog posts
Resources |
Tag Cloud Font Distribution Algorithm
Submitted by kentbye on Fri, 2005-06-24 17:36.
Development | Drupal | Tagcloud
This post is a follow-up to the previous posts here, here and here with more information on an algorithm for automating the font distribution for a Drupal tag cloud. There's also an optional alteration that would evenly distribute the font sizes across a Power Law tag frequency distribution. I'll step through the details below with the intention of following up with some Drupal developers who will be able to code this up in PHP and provide it to the Drupal community as a module -- most likely this module could be built on top of the tagadelic module. This has been my first attempt at specifying additional Drupal functionality in preparation for specifying other aspects of my Development Roadmap. Being able to automatically create tag clouds, personal tag clouds and tag clouds based upon user-specified identities will be a very useful tool for visualizing the subjective context and qualitative opinions that volunteers have about interview sound bites. My intent is to get this tag cloud development rolling to build momentum for the other aspects of the Phase 01 of my roadmap. The remaining aspects of this post are pretty technical, but I'll include some graphics below for anyone else interested in following along... To start off here's a photo of the original frequency distribution of all of the tags across EchoChamberProject.com.
There were originally 6 sections that I picked out to determine the different font sizes:
This provided intuitive thresholds of 10, 20, 30, 40, 50 to break up the font sizes. However, plotting out a graph of the distribution and qualitatively eyeing the thresholds only works if you're doing this laborious task manually. I needed a way to do automate this font distribution algorithm so that it could automatically determine the thresholds across many different tag sample sizes. Here's what I came up with: VARIABLES
Here's the code:
This code gives the following thresholds for determining the font size for the EchoChamberProject.com data set:
Here's a graph of the tag frequency distribution that uses the new thresholds:
There are only two tid fontsize values that changed with these calculated thresholds than from my original approximation. These calculated thresholds would then be used to assign the fontsizes to the term tid frequency count contained in the tid_Count(1:#tid) array VARIABLES:
At this point, the code could produce a tag cloud by outputting the following code for each taxonomy tid:
where
Chris Messina suggested using the span-style, font-size code for specifying the pixel sizes of the font:
So instead of using HTML font-sizes = 6, 5, 4, 3, 2 & 1 -- Notice I upped the #FontSizes from 6 to 8, and I altered the pseudocode above to correlate 1 to 10px, 2 to 12px... 8 to 24px. REPLACE WITH Now let me just point out one additional potential modification that could be made in order to more evenly distribute the font sizes across the Power Law Distribution of Folksonomy Tags. First let's take a look at the normal tag distribution using 8 different font sizes -- the maximum tid_Count is 53 with a minimum tid_Count of 1
Now let's take a look at a graph of the normal tid_Fontsize distribution -- Notice how the Power Law nature of the quantized distribution is exacerbated.
Now let's take a look a look a graph plotting the total number of times each fontsize is used. Notice that you get an inverse Power Law effect with the smallest fontsize of one occurring for over half of the tid terms (78 / 122).
One way to even out the fontsize distribution in the tag cloud would be to use a logarithmic thresholding algorithm instead of a linear one. Specifically, the following two changes would occur to the pseudocode above: REPLACE WITH AND REPLACE WITH Now let's take a look at the logarithmic tag distribution using 8 different font sizes -- the maximum tid_Count value is 100*log(53+2) = 174 with a minimum tid_Count value of 100*log(1+2) = 47.71. I added 2 to the tid_Count(x) value because the log(1) = 0 & the log(0) = infinity which would be results that would both mess up the font distribution algorithm.
Here's are the normal and logarithmic graphs overlaid to see the difference -- the logarithmic data is transparent.
Now let's take a look at a graph of the logarithmic tid_Fontsize distribution -- Notice the Power Law nature of the quantized distribution is leveled out.
Here's are the two graphs overlaid to see the difference -- again with the logarithmic data being transparent.
Finally, let's take a look at the new logarithmic fontsize distribution. The inverse Power Law effect is still present, but reduced as shown by a more even distribution throughout all of the sizes.
And here's two graphs overlaid again:
The differences between linear and logarithmic thresholding may be too subtle to bother with, but I thought I would also publish the following 4 tag clouds for comparison. The first and third use the logarithmic distribution, and the second and fourth use the linear distribution. The first two are ordered by frequency and the last two by alphabetization.
Below is the tag cloud order by frequency using the logarithmic fontsize distribution algorithm with 8 font sizes (10px to 24px) & "Span Style Font Size" code:
Below is the tag cloud ordered by frequency using the linear fontsize distribution algorithm with 6 font sizes (1 to 6) & "HTML Font Size" code:
Below is the alphabetized tag cloud using the logarithmic fontsize distribution algorithm with 8 font sizes (10px to 24px) & "Span Style Font Size" code:
Below is the alphabetized tag cloud using the linear fontsize distribution algorithm with 6 font sizes (1 to 6) & "HTML Font Size" code:
problem with logarithmic calculations in algorithmSubmitted by Anonymous (not verified) on Mon, 2006-05-29 11:05.
Hi, Example: // For logarithmic: tagWeight = ( ( linearTagWeight - minimumWeight ) / linearRange ) * logRange + thresholds[ 0 ] tagWeight?Submitted by Pepino (not verified) on Sun, 2007-01-14 20:46.
Hi, can you just explain a little more how you calculate the tagWeight? TIA Stumbled on this algorithmSubmitted by Anonymous (not verified) on Fri, 2005-12-16 21:31.
Stumbled on this algorithm while googling, but found some inconsitencies in your explanation/charts. When you move the example with 8 fontsizes, it seems like you forget to add the MINIMUM[tid_Count(1:#tid)] during the threshold calculation. Based on your example/pseudocode for the 6 font sizes, the thresholds should be: Threshold(1) = (1 + (1 * 6.5)) = 7.5 Threshold(1) = 6.5 Not a big deal, but then I noticed that your Power Law alternative pseudocode differs from the results in your graph. You state the new Power Law Formula for calculating thresholds should be: Loop Fontsize = 1 to #FontSizes Based on your dataset (and that the curly braces simply denote grouping), the thresholds should look something like: Delta = 52/8 = 6.5 Threshold(0) = 100 * log(1 + (0 * 6.5) + 2) = 47 .... Instead in your graph, you have: Am I reading your pseudocode wrong? By using your pseudocode, there is no difference between the two types of distributions, so I'm curious as how you arrived at your set of numbers in the graph. TagLinesSubmitted by Francis (not verified) on Sat, 2005-07-02 23:18.
I made this: "TagLines", similar to TagCloud but with AJAX(like Google Maps) http://www.francisshanahan.com/taglines/default.aspx?cat=All it's an auto-folksonomy tool built using the Term Extraction APIS. Would love to get some feedback on it or suggestions for improvement. |
Distribution Algorithm
Hi
Very interesting, thanks