I mentioned the other day that I've been tasked at work to build an Intranet search engine. We have a lot of departments in Wells Fargo, where I work as a consultant, and I build database-driven web sites for them. A department came to my group recently and said that they needed a search engine for the nearly 100 PDF's that they have on their Intranet site. Small task, but I knew that if we built it right, other groups would ask for the same thing. We delivered the prototype and others are lining up. Very cool.
Lots of thought goes into a project like this. Given a cool puzzle, I find my head turning it around in my off-hours as well. What goes into a search engine?
First, because we're talking about myriad and large PDF's, I need to convert the PDF's to text because the PDF file itself looks like this:
ÈÍj?ãû&ÈçÇCOÕ5Õë¹Aá;It's not searchable unless you're in Adobe Acrobat and viewing the document. I found a little conversion tool at VeryPDF. It does batch conversions, so first problem solved.
With the text in hand, it's time to consider how people search. Take this sentence, for example:
The Home Mortgage Assessment Program(TM), developed by Judy Edwards, addresses the needs of all borrowers and co-borrowers alike.It can't be built to just store all of the text and then allow for phrase searches only. If so, and the user knew it was a mortgage program of some sort and typed in "mortgage program," they'd miss it because the actual phrase is "Mortgage Assessment Program." So it has to be by word.
Simple search engines on some web sites just store all of the words in a document into a big table of words and then return everything with either "mortgage" or "program" in it - not necessarily demanding that both are in it. So the user could be browsing documents with only the word "mortgage" in them and not "program." That's annoying, and wasteful of the user's time.
Next, the words themselves. Take the word "Program" in my example sentence. It's not just "Program" - it's "Program(TM)." So the words have to be analyzed and reduced to their true intent.
Further, the database will return results faster with exact match searches and not wildcard searches. English is a good mountain to climb with that goal in mind. Let's say that a user types "short-term loans" into the search engine. Will all documents have "short-term" hypehnated? Probably not. And even though they typed in "loans," they'll want to find hits for "loan" as well. Today, I find myself researching grammar rules pluralization and rules for possessive nouns so that the engine can "de-plural" words so that users can get the best search results possible.
Rule #1In my geeky way, I love this project.
The plural of nouns is usually formed by adding - s to a singular noun.
Nouns ending in s, z, x, sh, and ch form the plural by adding - es.
Nouns ending in - y preceded by a consonant is formed into a plural by changing - y to - ies.
Nouns ending in y preceded by a vowel form their plurals by adding - s.
Most nouns ending in o preceded by a consonant is formed into a plural by adding - es.
Some nouns ending in f or fe are made plural by changing f or fe to - ves.
Tuesday, I researched the 1,000 most common words in English and the search engine won't be indexing prepositions, adverbs, articles, conjunctions, and words like those.
And at the moment, I'm studying names... such as "Wells," like in "Wells Fargo." Here are some common names:
SMITHIf I de-plural normal words, I have to be careful not to de-plural names, such as Williams or Davis or Thomas or Jones, which end in -s.
Once the words ae broken down and indexed, it's time to create the search engine results given to the user. How do you decide which document gets listed at the top of the results?
If I have a document in front of me, I know what's most important to it by what appears first. The main idea is generally given first and then repeated throughout the document. So I'm giving a weight to the words in a document in their order of appearance. If the word "loan" appears 8 times in the upper half of a document, but appears 12 times in only the last part of another document, the first document will get listed first.
Further, people will want to see the context of their search words in the matching documents. So I plan to grab 50 words to the right and 50 words to the left of the first occurence of the word and display that, much like Google does.
I love puzzles like these. I released a prototype yesterday that had everything but the de-pluralization and users are beating up on it. Good response thus far. We'll see how the finished product turns out.
ETC: Found a cool word search engine that does pattern matching, so if you like crossword puzzles and a word is giving you a fit, go here. Do the "Common words only" search.
I was using it to find every plural "*ves" word that came from a singular "*f" word. Found them all!