PHP Tag Cloud Remove Common Words
Friday, July 11th, 2008If you have an application that allows a user to enter arbitrary tags for an entity, you might want to filter their tag input based on certain criteria. This tutorial assumes that you will be working with a space-separated list of tags, e.g. from a form input field. If your tag input is coming from an array, you can try $tags = implode(' ', $tag_array); to prepare it for the rest of the code presented here.
Firstly, we’ll cover the tag filter. A list of the most common English words was compiled by merging data from the following resources:
http://en.wikipedia.org/wiki/Most_common_words_in_English
http://www.askoxford.com/oec/mainpage/oec02/?view=uk
http://esl.about.com/library/vocabulary/bl1000_list1.htm
http://www.deafandblind.com/word_frequency.htm
We can look at the whole list later for reference, but to make an effective tag filter, many had to be removed by hand. Additionally, this script disregards words with less than three letters, so those were removed as well. Many search applications, including the one that inspired this coding, won’t mess with those. Here is a space-separated list of the common English-language words to be filtered:
able about after again all also and any are bad been before being between but came can cause change come could did differ different does don't down each end even every far few for form found four from get good great had has have her here him his how into its just keep let many may might more most much must near need never new next not now off one only other our out over part put said same say seem set should side some still such take than that the their them then there these they thing this three through too two upon use very was way went were what when where which while who will with would you your
Let’s use the following example for a space-separated tag string:
fluffy is freeze you Rocket don't cute boulder fry
The following PHP code will filter out the common words as well as words that contain less than three letters. It only pulls strings with alpha characters, and additionally converts all tags to lowercase:
$tag_filter = array('able', 'about', 'after', 'again', 'all', 'also', 'and', 'any', 'are', 'bad', 'been', 'before', 'being', 'between', 'but', 'came', 'can', 'cause', 'change', 'come', 'could', 'did', 'differ', 'different', 'does', 'don', 'down', 'each', 'end', 'even', 'every', 'far', 'few', 'for', 'form', 'found', 'four', 'from', 'get', 'good', 'great', 'had', 'has', 'have', 'her', 'here', 'him', 'his', 'how', 'into', 'its', 'just', 'keep', 'let', 'many', 'may', 'might', 'more', 'most', 'much', 'must', 'near', 'need', 'never', 'new', 'next', 'not', 'now', 'off', 'one', 'only', 'other', 'our', 'out', 'over', 'part', 'put', 'said', 'same', 'say', 'seem', 'set', 'should', 'side', 'some', 'still', 'such', 'take', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'this', 'three', 'through', 'too', 'two', 'upon', 'use', 'very', 'was', 'way', 'went', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'will', 'with', 'would', 'you', 'your', );
$tags = 'fluffy is freeze you rocket don\'t cute boulder fry';
preg_match_all('/([a-zA-Z]{3,})/', $tags, $matches);
$matches[0] = array_map('strtolower', $matches[0]);
$tags = array_diff($matches[0], $tag_filter);
The $tags array would then be filtered and lowercased, producing the following output with print_r($tags):
Array
(
[0] => fluffy
[1] => freeze
[3] => rocket
[5] => cute
[6] => boulder
[7] => fry
)
If you need to convert it from the array back to a space-separated string, try $tags = implode(' ', $tags);. You may also of course add more words to the word list — that could come in handy with other application-specific functions such as cursing filters.
Here is the full list of common words merged from the above-stated resources, separated by spaces, including those with less than three letters:
a able about act add after again air all also am an and animal answer any are as ask at back bad be been before being between big boy build but by call came can case cause change child city close come company could country cover cross day did differ different do does don't down draw each early earth end even every eye fact far farm father feel few find first follow food for form found four from get give go good government great group grow had hand hard has have he head help her here high him his home hot house how i if important in into is it its just keep kind know land large last late learn leave left let life light like line little live long look low made make man many may me mean men might more most mother move mr mrs much must my name near need never new next night no north not now number of off office old on one only or other our out over own page part people person picture place plant play point port press problem public put read real right round run said same saw say school sea see seem self sentence set she should show side small so some sound spell stand start state still story study such sun take tell than that the their them then there these they thing think this thought three through time to too tree try turn two under up upon us use very want was water way we week well went were what when where which while who why will with woman word work world would write year you young your