Tag Cloud Font Distribution Algorithm

kentbye's picture
| |

UPDATE 5-24-06: I have now included the source data for the charts below including term id, term name & frequency

This post is a follow-up to the previous posts here, here and here with more information on an algorithm for automating the font distribution for a Drupal tag cloud. There's also an optional alteration that would evenly distribute the font sizes across a Power Law tag frequency distribution.

I'll step through the details below with the intention of following up with some Drupal developers who will be able to code this up in PHP and provide it to the Drupal community as a module -- most likely this module could be built on top of the tagadelic module.

This has been my first attempt at specifying additional Drupal functionality in preparation for specifying other aspects of my Development Roadmap.

Being able to automatically create tag clouds, personal tag clouds and tag clouds based upon user-specified identities will be a very useful tool for visualizing the subjective context and qualitative opinions that volunteers have about interview sound bites.

My intent is to get this tag cloud development rolling to build momentum for the other aspects of the Phase 01 of my roadmap.

The remaining aspects of this post are pretty technical, but I'll include some graphics below for anyone else interested in following along...

To start off here's a photo of the original frequency distribution of all of the tags across EchoChamberProject.com.

Original Tag Distribution

There were originally 6 sections that I picked out to determine the different font sizes:

0-10 (fonsize=1)
10-20 (2)
20-30 (3)
30-40 (4)
40-50 (5)
50-60 (6).

This provided intuitive thresholds of 10, 20, 30, 40, 50 to break up the font sizes. However, plotting out a graph of the distribution and qualitatively eyeing the thresholds only works if you're doing this laborious task manually.

I needed a way to do automate this font distribution algorithm so that it could automatically determine the thresholds across many different tag sample sizes. Here's what I came up with:

VARIABLES

MAXIMUM[tid_Count(1:#tid)] = tid_Count(40) = 53 (i.e. "New Media" is taxonomy term 40 & is used as a tag in 53 posts)
MINIMUM[tid_Count(1:#tid)] = tid_Count(89) = 1
#FontSizes = 6
Delta = (MAXIMUM[tid_Count(1:#tid)] - MINIMUM[tid_Count(1:#tid)])/NumberOfFontSizes = (53-1)/6 = 8.6667

Here's the code:

Loop Fontsize = 1 to #FontSizes
      Threshold(Fontsize) = MINIMUM[tid_Count(1:#tid)] + (FontSize) * Delta
End Loop

This code gives the following thresholds for determining the font size for the EchoChamberProject.com data set:

Threshold(1) = 1+ 8.667 = 9.667
Threshold(2) = 18.333
Threshold(3) = 27
Threshold(4) = 35.667
Threshold(5) = 44.333
Threshold(6) = 53

Here's a graph of the tag frequency distribution that uses the new thresholds:

Automatic Font Distribution w/ 6 fonts

There are only two tid fontsize values that changed with these calculated thresholds than from my original approximation.

These calculated thresholds would then be used to assign the fontsizes to the term tid frequency count contained in the tid_Count(1:#tid) array

VARIABLES:

Threshold(1:#FontSizes) -- thresholds calculated above.
#tid -- the total number of term tid's for each vocabulary vid
tid_Count(1:#tid) comes from summing the total times each tid appears in the term_node table -- shown as the input to the font distribution algorithm in these flowcharts.

Loop tid_loop = 1 to #tid
      Fontset_flag=.FALSE
      Loop fontsize_loop = 1 to #FontSizes
             If(Fontset_flag=.FALSE.) then
                    If(tid_Count(tid_loop) <= Threshold(fontize_loop)) then
                         tid_Fontsize(tid_loop) = fontsize_loop
                         Fontset_flag=.TRUE.
                   End If
             Endif
      End Loop
End Loop

At this point, the code could produce a tag cloud by outputting the following code for each taxonomy tid:

<font size=4><a href="http://www.echochamberproject.com/taxonomy/term/23">Transparency</a></font>"

where

tid=23;
tid_Fontsize(23) = 4; and
tid_Name(23) = Transparency

Chris Messina suggested using the span-style, font-size code for specifying the pixel sizes of the font:

<span style="font-size: 24px;"><a href="http://www.echochamberproject.com/taxonomy/term/20">Collaboration</a></span>

So instead of using HTML font-sizes = 6, 5, 4, 3, 2 & 1 --
On my tag cloud I used span-style, font-sizes = 24 px, 22px, 20px, 18px, 16px, 14px, 12px, 10px

Notice I upped the #FontSizes from 6 to 8, and I altered the pseudocode above to correlate 1 to 10px, 2 to 12px... 8 to 24px.

REPLACE
If(tid_Count(tid_loop) <= Threshold(fontize_loop)) then
      tid_Fontsize(tid_loop) = fontsize_loop
      Fontset_flag=.TRUE.
End If

WITH
If(tid_Count(tid_loop) <= Threshold(fontize_loop)) then
      tid_Fontsize(tid_loop) = 10 + (fontsize_loop-1)*2
      Fontset_flag=.TRUE.
End If

Now let me just point out one additional potential modification that could be made in order to more evenly distribute the font sizes across the Power Law Distribution of Folksonomy Tags.

First let's take a look at the normal tag distribution using 8 different font sizes -- the maximum tid_Count is 53 with a minimum tid_Count of 1

Normal Tag Distribution with 8 Fontsizes

Now let's take a look at a graph of the normal tid_Fontsize distribution -- Notice how the Power Law nature of the quantized distribution is exacerbated.

Normal Font Distribution w/ 8 Fontsizes

Now let's take a look a look a graph plotting the total number of times each fontsize is used. Notice that you get an inverse Power Law effect with the smallest fontsize of one occurring for over half of the tid terms (78 / 122).

Normal Font Frequency

One way to even out the fontsize distribution in the tag cloud would be to use a logarithmic thresholding algorithm instead of a linear one. Specifically, the following two changes would occur to the pseudocode above:

REPLACE
Loop Fontsize = 1 to #FontSizes
      Threshold(Fontsize) = MINIMUM[tid_Count(1:#tid)] + (FontSize) * Delta
End Loop

WITH
Loop Fontsize = 1 to #FontSizes
      Threshold(Fontsize) = 100 * log [{MINIMUM[tid_Count(1:#tid)] + (FontSize) * Delta} + 2]
End Loop

AND REPLACE
If(tid_Count(tid_loop) <= Threshold(fontize_loop)) then
      tid_Fontsize(tid_loop) = 10 + (fontsize_loop-1)*2
      Fontset_flag=.TRUE.
End If

WITH
If(100 * log[(tid_Count(tid_loop) + 2] <= Threshold(fontize_loop)) then
      tid_Fontsize(tid_loop) = 10 + (fontsize_loop-1)*2
      Fontset_flag=.TRUE.
End If

Now let's take a look at the logarithmic tag distribution using 8 different font sizes -- the maximum tid_Count value is 100*log(53+2) = 174 with a minimum tid_Count value of 100*log(1+2) = 47.71. I added 2 to the tid_Count(x) value because the log(1) = 0 & the log(0) = infinity which would be results that would both mess up the font distribution algorithm.

Logarithmic Tag Distribution w/ 8 Fonts

Here's are the normal and logarithmic graphs overlaid to see the difference -- the logarithmic data is transparent.

Normal & Logarithmic Tag Distribution

Now let's take a look at a graph of the logarithmic tid_Fontsize distribution -- Notice the Power Law nature of the quantized distribution is leveled out.

Logarithmic Font Distribution w/ 8 Fonts

Here's are the two graphs overlaid to see the difference -- again with the logarithmic data being transparent.

Normal & Logarithmic Font Distribution

Finally, let's take a look at the new logarithmic fontsize distribution. The inverse Power Law effect is still present, but reduced as shown by a more even distribution throughout all of the sizes.

Logarithmic Font Frequency

And here's two graphs overlaid again:

Normal & Logarithmic Font Frequency

The differences between linear and logarithmic thresholding may be too subtle to bother with, but I thought I would also publish the following 4 tag clouds for comparison. The first and third use the logarithmic distribution, and the second and fourth use the linear distribution. The first two are ordered by frequency and the last two by alphabetization.

UPDATE 5-24-06: Here is the source data including term id, term name & frequency

Below is the tag cloud order by frequency using the logarithmic fontsize distribution algorithm with 8 font sizes (10px to 24px) & "Span Style Font Size" code:

New Media | Website | PR | Status | Collaboration | Drupal | Journalism | Transparency | Theory | Decentralization | Echo Chamber Project | Open Source | Film | Blog | Interview | Political | Worldview | Communications | Conference | Folksonomy | Media Criticism | Volunteer | Dialogue | International Law | Rosen | Evolution | Kent Bye | Objectivity | Plante | ToDo | Advisor | Civics | Roadmap | Wilber | About | CivicSpace | Ecosystem | Choice | Murphy | Sociology | ACH | del.icio.us | Intelligence Analysis | Science | Credibility | Distribution | Diversity | Errors | Final Cut Pro | Fundraising | Law | Philosophy of Science | Podcast | Political Bias | Activism | Analysis | CBS | Deception Detection | Editing | History | RSS | Social | Subjectivity | Vlog | ABC | AL Tubes | Economics | FCC | NYT | Sirota | Sundance | Training | Wiki | XML | Borger | Brody | Deliberation | EcoVillage | Identity | LAMP | Lobe | Maine | May | Media Logic | Metaphor | Mitchell | NBC | OHanlon | Psychology | Queen | Software | Spiral Dynamics | Strobel | Sustainability | Transcripts | Brown | Buddhism | Community | Digital Divide | Donnelly | Education | Fair Use | FireANT | Google | Human Rights | KM | Kwiatkowski | Landay | Loiseau | Math | Music | Nature | Schechter | Screencast | Sivaraksa | Skype | Social Capital | Tag Cloud | Thielmann | Thomas | Tiger | Wedgwood |

Below is the tag cloud ordered by frequency using the linear fontsize distribution algorithm with 6 font sizes (1 to 6) & "HTML Font Size" code:

New Media | Website | PR | Status | Collaboration | Drupal | Journalism | Transparency | Theory | Decentralization | Echo Chamber Project | Open Source | Film | Blog | Interview | Political | Worldview | Communications | Conference | Folksonomy | Media Criticism | Volunteer | Dialogue | International Law | Rosen | Evolution | Kent Bye | Objectivity | Plante | ToDo | Advisor | Civics | Roadmap | Wilber | About | CivicSpace | Ecosystem | Choice | Murphy | Sociology | ACH | del.icio.us | Intelligence Analysis | Science | Credibility | Distribution | Diversity | Errors | Final Cut Pro | Fundraising | Law | Philosophy of Science | Podcast | Political Bias | Activism | Analysis | CBS | Deception Detection | Editing | History | RSS | Social | Subjectivity | Vlog | ABC | AL Tubes | Economics | FCC | NYT | Sirota | Sundance | Training | Wiki | XML | Borger | Brody | Deliberation | EcoVillage | Identity | LAMP | Lobe | Maine | May | Media Logic | Metaphor | Mitchell | NBC | OHanlon | Psychology | Queen | Software | Spiral Dynamics | Strobel | Sustainability | Transcripts | Brown | Buddhism | Community | Digital Divide | Donnelly | Education | Fair Use | FireANT | Google | Human Rights | KM | Kwiatkowski | Landay | Loiseau | Math | Music | Nature | Schechter | Screencast | Sivaraksa | Skype | Social Capital | Tagcloud | Thielmann | Thomas | Tiger | Wedgwood

Below is the alphabetized tag cloud using the logarithmic fontsize distribution algorithm with 8 font sizes (10px to 24px) & "Span Style Font Size" code:

ABC | About | ACH | Activism | Advisor | AL Tubes | Analysis | Blog | Borger | Brody | Brown | Buddhism | CBS | Choice | Civics | CivicSpace | Collaboration | Communications | Community | Conference | Credibility | Decentralization | Deception Detection | del.icio.us | Deliberation | Dialogue | Digital Divide | Distribution | Diversity | Donnelly | Drupal | Echo Chamber Project | Economics | Ecosystem | EcoVillage | Editing | Education | Errors | Evolution | Fair Use | FCC | Film | Final Cut Pro | FireANT | Folksonomy | Fundraising | Google | History | Human Rights | Identity | Intelligence Analysis | International Law | Interview | Journalism | Kent Bye | KM | Kwiatkowski | LAMP | Landay | Law | Lobe | Loiseau | Maine | Math | May | Media Criticism | Media Logic | Metaphor | Mitchell | Murphy | Music | Nature | NBC | New Media | NYT | Objectivity | OHanlon | Open Source | Philosophy of Science | Plante | Podcast | Political | Political Bias | PR | Psychology | Queen | Roadmap | Rosen | RSS | Schechter | Science | Screencast | Sirota | Sivaraksa | Skype | Social | Social Capital | Sociology | Software | Spiral Dynamics | Status | Strobel | Subjectivity | Sundance | Sustainability | Tag Cloud | Theory | Thielmann | Thomas | Tiger | ToDo | Training | Transcripts | Transparency | Vlog | Volunteer | Website | Wedgwood | Wiki | Wilber | Worldview | XML

Below is the alphabetized tag cloud using the linear fontsize distribution algorithm with 6 font sizes (1 to 6) & "HTML Font Size" code:

ABC | About | ACH | Activism | Advisor | AL Tubes | Analysis | Blog | Borger | Brody | Brown | Buddhism | CBS | Choice | Civics | CivicSpace | Collaboration | Communications | Community | Conference | Credibility | Decentralization | Deception Detection | del.icio.us | Deliberation | Dialogue | Digital Divide | Distribution | Diversity | Donnelly | Drupal | Echo Chamber Project | Economics | Ecosystem | EcoVillage | Editing | Education | Errors | Evolution | Fair Use | FCC | Film | Final Cut Pro | FireANT | Folksonomy | Fundraising | Google | History | Human Rights | Identity | Intelligence Analysis | International Law | Interview | Journalism | Kent Bye | KM | Kwiatkowski | LAMP | Landay | Law | Lobe | Loiseau | Maine | Math | May | Media Criticism | Media Logic | Metaphor | Mitchell | Murphy | Music | Nature | NBC | New Media | NYT | Objectivity | OHanlon | Open Source | Philosophy of Science | Plante | Podcast | Political | Political Bias | PR | Psychology | Queen | Roadmap | Rosen | RSS | Schechter | Science | Screencast | Sirota | Sivaraksa | Skype | Social | Social Capital | Sociology | Software | Spiral Dynamics | Status | Strobel | Subjectivity | Sundance | Sustainability | Tagcloud | Theory | Thielmann | Thomas | Tiger | ToDo | Training | Transcripts | Transparency | Vlog | Volunteer | Website | Wedgwood | Wiki | Wilber | Worldview | XML

Distribution Algorithm

Hi
Very interesting, thanks

problem with logarithmic calculations in algorithm

Hi,
As a post indicated above, there is a problem with the logarithmic calculation.
The problem is that the value must not be changed again to a logarithmic value, but should rather be changed in linear manner to the logarthimic range.

Example:
if the logarithmic range becomes (47-1000) than the value should be moved from the range (1-53) to the range (47-1000) in a *linear manner*.
Doing the log function the second time is wrong since it "fixes" the value to the logarithmic range and it becomes linear again. Hence there are no differences.

// For logarithmic:
instead of:
If(100 * log[(tid_Count(tid_loop) + 2] <= Threshold(fontize_loop)) then

tagWeight = ( ( linearTagWeight - minimumWeight ) / linearRange ) * logRange + thresholds[ 0 ]
If( tagWeight <= Threshold(fontize_loop)) then

tagWeight?

Hi,

can you just explain a little more how you calculate the tagWeight?

TIA
Pepino

Stumbled on this algorithm

Stumbled on this algorithm while googling, but found some inconsitencies in your explanation/charts. When you move the example with 8 fontsizes, it seems like you forget to add the MINIMUM[tid_Count(1:#tid)] during the threshold calculation.

Based on your example/pseudocode for the 6 font sizes, the thresholds should be:

Threshold(1) = (1 + (1 * 6.5)) = 7.5
Threshold(2) = (1 + (2 * 6.5)) = 14
Threshold(3) = ....

Threshold(1) = 6.5
Threshold(2) = 13
Threshold(3) = ...

Not a big deal, but then I noticed that your Power Law alternative pseudocode differs from the results in your graph.

You state the new Power Law Formula for calculating thresholds should be:

Loop Fontsize = 1 to #FontSizes
Threshold(Fontsize) = 100 * log [{MINIMUM[tid_Count(1:#tid)] + (FontSize) * Delta} + 2]
End Loop

Based on your dataset (and that the curly braces simply denote grouping), the thresholds should look something like:

Delta = 52/8 = 6.5

Threshold(0) = 100 * log(1 + (0 * 6.5) + 2) = 47
Threshold(1) = 100 * log(1 + (1 * 6.5) + 2) = 97.7
Threshold(2) = 100 * log(1 + (2 * 6.5) + 2) = 120.411998266
Threshold(3) = 135.218251811
Threshold(4) = 146.23979979
Threshold(5) = 155.022835306
Threshold(6) = 162.32492904
Threshold(7) = 168.57417386
Threshold(8) = 174.036268949

....

Instead in your graph, you have:
Threshold(0) = 47
Threshold(1) = 62.79
Threshold(2) = 78.58
Threshold(3) = 94.37
Threshold(4) = 110.16
Threshold(5) = 125.95
Threshold(6) = 141.74
Threshold(7) = 157.53
Threshold(8) = 173.32

Am I reading your pseudocode wrong? By using your pseudocode, there is no difference between the two types of distributions, so I'm curious as how you arrived at your set of numbers in the graph.

TagLines

I made this: "TagLines", similar to TagCloud but with AJAX(like Google Maps)

http://www.francisshanahan.com/taglines/default.aspx?cat=All

it's an auto-folksonomy tool built using the Term Extraction APIS.
It combines Ajax with Folksonomies and Yahoo Web Services to allow you
to search on RSS feeds, News, Movies, Images or just obtain the
original story, all without a page-refresh.

Would love to get some feedback on it or suggestions for improvement.
regards,
-fs