Indexing files

Ben Okopnik [ben at linuxgazette.net]

Tue, 3 Feb 2009 21:59:16 -0500

[ Karl, I hope you don't mind me copying this exchange to The Answer Gang; I'd like for the error that you pointed out to be noted in our next issue, and this is the best and easiest way to do it. If you have any further replies, please CC them to 'tag@lists.linuxgazette.net'. ]

On Tue, Feb 03, 2009 at 02:59:57PM -0500, Karl Vogel wrote:

> Very cool follow-up article!

Thanks, Karl; I appreciate that. That's a very, very fun program - again, thanks for introducing me to it!

> >> In a previous message, you unhesitatingly continued with this missive:
> 
> B> In practice, I've found that indexing HTML files with either "-ft" or
> B> "-fh" leads to exactly the same results - i.e., a working index for all
> B> the content - and so now I lump both of the above under "-ft".
> 
>    The display is different in the web interface.  I indexed the same small
>    collection of HTML files as both plain text and HTML, and then looked for
>    "samba troubleshooting".
> 
>    Search for the phrase when indexed as plain text:
>      http://localhost/search/plain/estseek.cgi?phrase=samba+troubleshooting
> 
>    Command used to index:
>      estcmd gather -sd -ft plain /tmp/searchlmaiHg
> 
>    Display:
>      SAMBA_Troubleshooting.htm 24428
>        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <meta
>        name="generator" content=" ...  me="Generator" content="Microsoft
>        Word 97"> <title>Troubleshooting Log for VOS Samba</title> </head>
>        <body link="#000 ...  /font></b></p> <p align="CENTER"><b><font
>        size="6">Samba</font></b></p> <p align="CENTER"><b><font size="6" ...
>        >Troubleshooting</font></b></p> <p align="CENTER"><b><font size="6"
>        ...  1516637">*</a></a></p> <p><a href="#_Toc531516638">Samba
>        Symptoms, Causes and Resolutions <a href="#_Toc531 ...
>      http://localhost/search/docs/SAMBA_Troubleshooting.htm - [detail]
> 
>    Click the "details" link and check the attributes:
>      @type: text/plain
>
>    Now do a search for the same phrase when indexed as HTML:
>      http://localhost/search/hyper/estseek.cgi?phrase=samba+troubleshooting
> 
>    Command used to index:
>      estcmd gather -sd -fh hyper /tmp/searchlmaiHg
> 
>    Display:
>      Troubleshooting Log for VOS Samba 25592
>        Samba Troubleshooting Guide Version 2.0.7 Paul Green May 22, 2002 -
>        2001, 2002 Paul Green.  Permi ...  ree Documentation License".  Contents
>        Terminology * Samba Symptoms, Causes and Resolutions * Introduction
>        * ...  and Editing Host Files from a PC * Miscellaneous * Samba Web
>        Access Tool (SWAT) * Troubleshooting * GNU Fre ...  t that I am unable
>        to offer personal assistance in troubleshooting specific problems.
>        Installation This section lists ...  hat arise during installation
>        and configuration of Samba.  Symptom: Cannot add a new HOST machine
>        to an NT D ...
>      http://localhost/search/docs/SAMBA_Troubleshooting.htm - [detail]
> 
>    Click the "details" link and check the attributes:
>      @type: text/html

I just ran a careful, step-by-step manual retest of the above, and you're absolutely right. I must have lost track of what I did during which test - it does indeed make a difference.

On the other hand - please bear with me while I think "out loud" about this - since the only place that difference shows up is in the cited "hit context" paragraphs in Hyperestraier and not in the content itself, I'm not sure how much extra effort this deserves. In order to make that small change - i.e., not have the HTML markup appear in the cited paragraph, which only shows up for a second or so during the process - you'd have to split the files into two streams, index each of them individually, then do 'extkeys/optimize/purge' on both... pretty much double the processing time and seriously increase the complexity of the build script. Doesn't seem like much of a payback for a whole lot of work.

I suppose you could use "-fx" to keep the modifications really simple: just add something like '-fx htm* H@"lynx -dump -nolist"' to the "estcmd gather" line... but if you're going to do that, you might as well set up processing for all the other "interesting" types of files: PDFs, RTFs, OpenOffice files, etc. (I was going to write about that, too, but figured it would become too complex at that point.) I guess it's a question of deciding where the cutoff point is and building the indexer to reflect that.

Overall, I don't think that doing major hackery just to fix the context paragraph is worthwhile. For myself, I'm going to leave it just as it is until I decide to start processing the other filetypes.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Thu, 5 Feb 2009 13:08:06 -0500 (EST)

>> On Tue, 3 Feb 2009 21:59:16 -0500, 
>> Ben Okopnik <ben@linuxgazette.net> said:

B> On the other hand - please bear with me while I think "out loud" about B> this - since the only place that difference shows up is in the cited B> "hit context" paragraphs in Hyperestraier and not in the content B> itself, I'm not sure how much extra effort this deserves.

Yup, this only starts to matter if you're searching lots of different filetypes. I was trying to index as much content on a fileserver as I could, to assist in records-office searches.

B> [...] you'd have to split the files into two streams, index each of B> them individually, then do 'extkeys/optimize/purge' on both.

No, the extkeys/etc stuff only has to be done once if you're building one index to hold more than one type of files.

B> I suppose you could use "-fx" to keep the modifications really simple: B> just add something like '-fx htm* H@"lynx -dump -nolist"' to the B> "estcmd gather" line... but if you're going to do that, you might as B> well set up processing for all the other "interesting" types of files: B> PDFs, RTFs, OpenOffice files, etc.

And this is where I found the memory problem mentioned in the original article, not to mention all sorts of MS/Adobe files which aren't handled well by rtf2txt, antiword, xls2csv, and pdftotext. I finally had to resort to running "strings" on lots of things and hoping for the best. That's what the "locword" entry does in the example below.

The approach that worked best (failed least) was to run "file -i" on a fileset to get the MIME types, and then make a few passes through the resulting list to index what I could. Here's part of the script.

-- 
Karl Vogel                      I don't speak for the USAF or my company

--Washington Post "alternate definitions" contest

# -------------------------------------------------------------------------- # $ftype holds output from "file": # # /tmp/something.xls| application/msword # /tmp/resume.pdf| application/pdf # /tmp/somedb.mdb| application/x-msaccess

opts="-cl -sd -cm -xh -cs 128"

( # ------------------------------------------------------------------- # Plain text files. The mimetypes file looks like this: # application/x-perl # application/x-shellscript # message/news # message/rfc822 # text/html # text/plain # text/rtf # text/troff # text/x-asm # text/x-c # text/x-mail # text/x-news # text/x-pascal # text/x-tex # text/xml

logmsg starting plain text mimetypes='/usr/local/share/mime/plain-text' fgrep -f $mimetypes $ftype | cut -f1 -d'|' | estcmd gather $opts -ft $dbname -

# ------------------------------------------------------------------- # Word files

logmsg starting Word exten=".doc,.msg,.xls,.xlw" grep 'application/msword' $ftype | cut -f1 -d'|' | estcmd gather $opts -fx "$exten" "T@locword" -fz $dbname -

# ------------------------------------------------------------------- # Access DBs

logmsg starting Access exten=".mdb,.mde,.mdt,.use"

grep 'application/x-msaccess' $ftype | cut -f1 -d'|' | estcmd gather $opts -fx "$exten" "T@locword" -fz $dbname -

# ------------------------------------------------------------------- # Excel files with different MIME type

logmsg starting remaining Excel exten=".xls,.xlw" grep 'application/vnd.ms-excel' $ftype | cut -f1 -d'|' | estcmd gather $opts -fx "$exten" "T@locword" -fz $dbname -

# ------------------------------------------------------------------- # PDF files

logmsg starting PDF exten=".pdf" grep 'application/pdf' $ftype | cut -f1 -d'|' | estcmd gather $opts -fx "$exten" "H@estfxpdftohtml" -fz $dbname -

# ------------------------------------------------------------------- # Index cleanup for searching.

logmsg cleaning up index estcmd extkeys $dbname estcmd optimize $dbname estcmd purge -cl $dbname

) > BUILDLOG 2>&1

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 6 Feb 2009 19:30:21 -0500

On Thu, Feb 05, 2009 at 01:08:06PM -0500, Karl Vogel wrote:

> >> On Tue, 3 Feb 2009 21:59:16 -0500, 
> >> Ben Okopnik <ben@linuxgazette.net> said:
> 
> B> [...] you'd have to split the files into two streams, index each of
> B> them individually, then do 'extkeys/optimize/purge' on both.
> 
>    No, the extkeys/etc stuff only has to be done once if you're building
>    one index to hold more than one type of files.

You're right, of course.

> B> I suppose you could use "-fx" to keep the modifications really simple:
> B> just add something like '-fx htm* H@"lynx -dump -nolist"' to the
> B> "estcmd gather" line... but if you're going to do that, you might as
> B> well set up processing for all the other "interesting" types of files:
> B> PDFs, RTFs, OpenOffice files, etc.
> 
>    And this is where I found the memory problem mentioned in the original
>    article, not to mention all sorts of MS/Adobe files which aren't
>    handled well by rtf2txt, antiword, xls2csv, and pdftotext.  I finally
>    had to resort to running "strings" on lots of things and hoping for
>    the best.  That's what the "locword" entry does in the example below.

I did wonder about that. The way I saw it, trying to convert all the PDFs at once would really play hell on my poor underpowered laptop. So, I didn't actually go into indexing all the PDFs and such, although I did a couple of small test runs just to see what it would be like.

>    The approach that worked best (failed least) was to run "file -i" on a
>    fileset to get the MIME types, and then make a few passes through the
>    resulting list to index what I could.  Here's part of the script.

That certainly makes sense. I figured that for a simple indexing run, all you needed was a pipe of the sort I put together - but for anything more complicated, you'd need tempfiles, for exactly the reason you've stated (multiple passes.) I actually played around with that quite a bit ('tmp=`mktemp /tmp/searchXXXXXX`' plus 'trap "/bin/rm -rf $tmp" 0' are my friends!), and found it useful. [snipping script]

Thanks, Karl - I was actually going to write something like this for myself later. This will give me a good start on it; possibly, it'll be of help to any of our readers who have been following along with this.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Sat, 7 Feb 2009 20:22:53 -0500 (EST)

>> On Fri, 6 Feb 2009 19:30:21 -0500,
>> Ben Okopnik <ben@linuxgazette.net> said:

B> The way I saw it, trying to convert all the PDFs at once would really play B> hell on my poor underpowered laptop. So, I didn't actually go into B> indexing all the PDFs and such, although I did a couple of small test runs B> just to see what it would be like.

My ideal setup (not there yet, but I'm inching closer) is to have two distinct filetrees on a workstation or server. The first tree would be /, /usr, /src -- all the junk we know and love. The second tree (call it /shadow for now) would have drafts for most files under the first tree. (If you didn't see the first Estraier article, drafts -- or ".est" files -- are the guts of the system; they hold the stuff that's actually indexed.)

I don't much like databases for search/retrieval because it's not a really great fit. I don't like millions of tiny files, either; if you go down hard and have to run fsck, you not only have time for coffee, you can go to Columbia and pick the beans. My compromise looks like this:

* Create 256 directories under /shadow using hex digits 00-ff. Each directory has at most 256 zip files named the same way.

* Create a draft for any regular file of interest. One of the attributes in each draft will be the hash of the contents of the file being indexed. The MD5 hash of the filename without newline determines where the draft file will go. For example, we index /etc/motd like so:

me% echo /etc/motd | tr -d '\012' | md5sum b3097c3f6cd13df91fac6e56735da0b6 - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <-- draft-filename ^^^^ <-- directory

me% md5sum /etc/motd 58d9f375623df94a2b26b0bcddb20e3d /etc/motd

The file /shadow/b3/09.zip will hold a draft called 7c3f...b6.est. 7c3f...b6.est holds all the interesting stuff about /etc/motd: keywords, last modification time, and an attribute that holds a signature of the file contents:

@sig=58d9f375623df94a2b26b0bcddb20e3d

This way, we can go directly from any "interesting" file on the system to its corresponding draft by looking in no more than one zipfile, and the draft doesn't have to be updated or reindexed for searching unless the original file's contents have changed.

I want something that will scale up to tens of millions of indexed files. I did a few experiments with this, and the 1.6 million files on my workstation could fit into 64k zipfiles with an average of 25 drafts per archive. My home directory has ~17,400 files taking up ~300 Mbytes. Unpacking and zipping the equivalent draft files takes up 9.1 Mbytes (about 3% of the original file space) if you don't mind doing without phrase searches.

The current fad seems to be "consolidating" people's working files on some massive central server for searching, which is dumb on so many levels; crossing a network to get files that should be local, having a nice juicy single point of failure, etc. If you want to search files without generating enough heat to boil the nearest body of water, put the draft files on the central server and index those instead.

-- 
Karl Vogel                      I don't speak for the USAF or my company
The outpatients are out in force tonight, I see.        --Tom Lehrer

Top Back