Please note: This module is no longer supported. Drupal 6 users are encouraged to use Search Files, which incorporates much of the funcationality of Search Attachments.
The third version of Search Attachments 5.x-4, 5.x-4-dev-2008-03-05, is now available. This release contains a number of new features, including the ability to use PHP snippets as helpers, the ability for helpers to manage multiple extensions, and a number of new scalability, error catching, usability, and troubleshooting features.
This version will hopefully be the last before 5.x-4 is released as stable. I welcome any feedback and test results.
To upgrade from a previous version of 5.x-4-dev, uninstall the previous version, replace the old files with the files from the 5.x-4-dev-2008-03-05 tarball, and reindex your files. To upgrade from 5.x-3, replace the old files with the new ones and run update.php.
Detailed list of changes since 5.x-4-dev second release (2008-02-24):
Known bugs/issues
Changes since 5.x-4-dev (first release, 2008-02-21):
-Minor interface wording tweaks.
-Fixed bug in assigning search_attachments_helpers.id (thanks to Dinis for finding this).
-Fixed bug in webfm_driver.inc not returning file id.
-Increased maxlength of helper name and path form fields to match corresponding database fields.
-Added register_shutdown_function('search_attachments_update_shutdown') to search_attachments_update_index().
-Added ability to display multiple parent nodes in search results table view.
-Added trim function to remove leading periods from helper extension values.
-Added t() functions to drupal_set_messages() strings in .install file.
-Added a pure PHP helper text_helper.php to illustrate use of PHP helpers, and provide default text helper on Windows. Users are warned in README.txt and on install to move this file outside their web server document root.
-Added "if (function_exists('iconv'))" logic to search_attachments_convert_to_utf8() since not all systems have iconv loaded. This is a temporary fix. (thanks to mroswell for finding this)
-Changed search_attachments_perm() string from "administer attachment searching" to "administer attachment and file searching".
To install this new version, uninstall the previous version of 5.x-4-dev, copy over the new files, and run update.php. If you have configured helpers you might want to copy their path values before upgrading since you will need to re-configure them.
Version 5.x-4-dev of search_attachments is a thorough rewrite that incorporates the changes described at http://drupal.org/node/188895 -- basically, in addition to indexing files that are attached to nodes, the module now also indexes files that are not attached to nodes or managed by other Drupal modules. Also, reindexing of files is now triggered by changes to the files themselves, not to the parent nodes.
Below is the change log for 5.x-4-dev. This version should be considered beta software. However, I would very much appreciate testing and feedback so I can designate it as stable and move on to porting the module to Drupal 6.
To install the module, copy the contents of the tarball into your modules/search_attachments directory and run update.php. The install file deletes index entries produced by previous versions of the module, but you will need to run cron.php to update the index with entries produced by the new module. Before you install this version, back up your current database.
Changes since 5.x-3:
-Extensive code cleanup and addition of inline documentation.
-Replaced search_attachments_nodeapi() with search_attachments_register_files() and search_attachments_update_index() functions as per http://drupal.org/node/188895.
-Rewrote search_attachments_search() so it doesn't rely on the foreach ($drivers as $driver) loop.
-New permissions logic: Use permissions string from file management module, then show link to parent info in search results only if searcher is allowed to see parent node.
-Added support for files that are attached to multiple nodes (e.g., webfm allows this).
-Updated database field definitions to be consistent with system.module; created search_attachements_files db table; renamed search_attachments table to search_attachments_helpers; added code to .install file to perform these updates (users must reindex site and then run update.php).
-Install file's update function deletes all old 'file_xxx' entries from search_index and search_dataset, and tells user to run cron.php after running update.php.
-Updated message that appears in the helper delete confirmation page to describe new behavior.
-Resolved issue http://drupal.org/node/219409.
-Changed driver filenames from .inc to _driver.inc.
-Replaced variable name $attachments with $files in drivers.
-Added 'search_attachments_' to all functions in drivers to avoid possible collision with other modules'functions.
-Added no_file_manager_driver.inc to handle files that are not managed by other modules (i.e., they are FTPed or otherwise copied to the Drupal instance).
-Rewrote attachement_driver.inc to make it work with {search_attachments_files} (e.g., added path information).
-Added get_XXX_register_files() function to upload and webfm drivers.
-Removed get_no_file_manager_file_list($nid) from drivers, since it was only used in search_attachments_nodeapi().
-Added get_xxx_file_nids() functions to all drivers (but not no_file_manager_driver.inc since it is not nec.)
-Replaced 'file_xxx' where xxx is driver name with 'file' in search_dataset and search_index tables.
-Added 'clone' action in helper list.
-Changed settings link text from 'Search attachments settings' to 'Search settings for attachments and other files'.
-Changed admin tabs: renamed 'Edit helpers' to 'Helpers'; relocated 'Add helper' to bottom of helper list; added 'File paths' tab that allows administrators to define what directories to scan for files and also what path patterns to exclude from the scan.
-Added option to log files added/dropped/updated in search index. Added option in helper configuration to log first 100 characters of text extracted by helper during indexing.
-Added a setting that limits number of files indexed during a single cron run.
-Added a "Reindex files" button on the settings page, and use form_alter to put some text on the main
search module's reindexing form indicating that files need to be reindexed separately.
To do for 5.x-4-dev
KNOWN BUGS/ISSUES
-attachement_driver.inc doesn't respect attachment.module's 'hidden' attribute. This means that all users can see links to all files in the search results list, although attachment.module should enforce access to the contents of the file. Respect for 'hidden' in searching/display is currently under development.
-See http://drupal.org/node/215554: Does the {file_revisions} table apply just to upload.module or others as well? Not webfm or attach, at least. Question: If a file is deleted from a node using upload, should the file be then considered to be a no_file_manager file if it is in a directory that is scanned by that driver?
-Editing a helper using /bin/cat %file% throws an error if $test_file_list[0] is not encoded convertable by search_attachments_convert_to_utf8().
-Further test search_attachments with porter-stemmer module (no problems found on preliminary testing).
-Add formatting of multiple parent nodes to table search results view. Incorporate check_url(), as in list view.
FUNCTIONALITY
-For files that aren't attached to nodes, is there any way of getting the file's user for display in search results (as opposed to the parent node's user)? Probably depends on the managing module.
NEW FEATURES
-Implement register_shutdown_function(). Add to .install file variable_set statements for search_attachments_cron_last_id and search_attachments_cron_last_change. See PDD page 209 for details.
-If helper not found, invoke 'which' on *nix systems to try to find executable
-Add ability to extract text using PHP code. See http://drupal.org/node/207929.
| Attachment | Size |
|---|---|
| search_attachments_5_x_4-dev-2008-03-05.tgz | 26.63 KB |
Comments
Excellent module, only found 1 problem :)
When adding more than two helpers, the numbering on the helpers admin page seems to break, repeating helper 1 and 2 in the "edit" option, although showing the correct titles in the listing.
As an aside, I would love to see the option to combine files and content in a single search.
Other than that, top marks for a superb mod which is very easy to install, A+ :)
Please send a screen snapshot
When adding more than two helpers, the numbering on the helpers admin page seems to break, repeating helper 1 and 2 in the "edit" option, although showing the correct titles in the listing.
I can't replicate this -- when I add additional helpers they work as expected. It would be useful to have a screen capture illustrating the problem. Can you attach an image to a comment in this thread, and let me know what the IDs for all the helpers shown in the image are (get them by putting your pointer over the 'Edit' link, they are the numbers at the end of the URLs)?
As an aside, I would love to see the option to combine files and content in a single search.
Yes, this would be a great feature. So far I have not been able to figure out how to do it but I will keep on looking. Any suggestions on how this would work would be appreciated.
Other than that, top marks for a superb mod which is very easy to install, A+ :)
Thanks, good to hear you found it easy to install.
Here you go:
I hope these images help. One and two appear fine, then the next one I add becomes one, then two etc. Later I'll try deleting all the helpers and starting from scratch to see what that does, but "out of the box" I've had this happen on test and just now on production too, it's odd.
Fixed
OK, I've fixed this bug. Problem was that I am using the db_next_id() function to generate helper ids, and I forgot to seed the sequences table. Therefore, the first id generated when you added a new helper was 1, which caused the behavior you describe. Thanks for catching this.
I've incorporated the fix into the next release of the module, which I'll make available in the next few days. In the meantime, if you have access to your database I can supply SQL to fix your sequences table.
I have full access to the db
I have full access to the db / servers, would love to have the fix :)
Many thanks again, very nice mod.
Here you go
Note to everyone who has installed this first beta version: You only need to do this fix if you have configured additional helpers. The bug that caused this problem has been fixed and the updated version of the module will be available tomorrow (Feb. 24) some time. Don't perform this fix if you don't need it by then.
First, of course, back up your database before proceeding.
Do select * from search_attachments_helpers; to see your helpers table. I would imagine yours looks like this:
You need to assign unique ids to each helper. So, issue the following commands (the id values below may not be correct for you but the important point is that they are sequential and unique):
Now do another select * from search_attachments_helpers; to make sure that you have unique id values for all of your helpers. If you do, the last step is to update your sequences table so that it records the highest id value in your helpers table (in this example, 5):
Do select * from sequences; to verify that the search_attachments_helpers_id id is the same as the highest value id value in your helpers table, and you're done.
Combined search results
I wonder if it's practical to output the text from an atteched file into a kind of meta table attached to a node, perhaps using a system similar to the old "node_words" module - http://drupal.org/project/nodewords only instead of typed data the helper adds the text content of the attached file.
What do you mean by combined?
Do you mean search the combined text (i.e., the attachment(s) text appended to the node text) or display the search results from nodes and files in one list, under a single tab? http://drupal.org/project/search_all does the latter in a certain way.
Cheers for the tip with
Cheers for the tip with search_all; I had a look at the module but it's a little bit messy with it's output and is not very configurable (though I have been able to modify the code somewhat to make it suitable as a stop gap).
The way I would like to see it work (I think it's possible) would be to have the files tab much as it is now, but with an option in your module to *not* list documents which are attached to a node, maybe the file names of all attached documents could be added to an exclude table in the database and then not displayed on the files tab.
Then in the content tab I would like to see the content of attached and parsed files treated in the same manner as the body and meta text of their parent nodes, thus inheriting search/sort/taxonomy etc. from other modules.
Thanks once again for your module, it's opened up some fantastic new possibilities with Drupal, I'm going to test it combined with WebFM module over the weekend to give us a very nice portal driven file share and manager.
Second release of 5.x-4 is available
See changelog under Update 2008-02-24, above. Thanks to everyone who tested the first version and sent in bug reports.
SQL Error
I have been anticipating the release of this update
It is going to solve a lot of the issues I have with large numbers of "unattached" files residing on a network drive
Have downloaded and updated to 5.x-4-dev
Getting a recurring sql error
With no hits to a search argument :
user warning: Unknown column 'n.nid' in 'field list' query: SELECT DISTINCT(n.nid) FROM node in /var/www/html/drupal/includes/database.mysql.inc on line 172
With multiple hits :
user warning: Unknown column 'n.nid' in 'field list' query: SELECT DISTINCT(n.nid) FROM node in /var/www/html/drupal/includes/database.mysql.inc on line 172
and repeated lines of ...
warning: in_array() [function.in-array]: Wrong datatype for second argument in /var/www/html/drupal/modules/search_attachments/search_attachments.module on line 867
Unfortunately, I have limited experience with SQL
All help is truly appreciated
Thanx
Tim :)
What modules do you have enabled?
Hi Tim,
What file management modules do you have enabled (upload, webfm, or attachment)?
Also, can you confirm that when you upgraded, you ran update.php?
I notice that you have gallery enabled -- were some of the files that are being searched uploaded by gallery?
Mark
Upload 5.7 Gallery
Upload 5.7
Gallery 5.x-1.0
IMCE 5.x-1.0 ( Used with TinyMCE )
Ran update.php when module enabled, am now showing "no updates available"
A lot of files are uploaded by gallery - and these are searching without any error / warning
With the remote drives, I just mount these as shares ( above Drupal root )
Thanx :)
Try this version
Hi Tim,
I'm not exactly sure what the problem is, but I found a bug in the access control mechanisms in the 2008-02-24 version that might have something to do with your problem. Attached is an updated search_attachments.module. Unzip it and replace the one in your search_attachments directory with this one (no need to run update.php). You will also need to go into Administer > User management > Access control and assign the new 'view files not managed by a module' permission as you see fit.
Please let me know if the problem you reported is still happening after you replace the .module file.
Thanx for the quick response
Thanx for the quick response Mark :)
Has fixed the issue with "multiple hits"
Still getting the user warning tho
user warning: Unknown column 'n.nid' in 'field list' query: SELECT DISTINCT(n.nid) FROM node in /var/www/html/drupal/includes/database.mysql.inc on line 172.
doesn't seem to be a critical warning ... line 172 just seems to be part of a generic catch-all error message for the "Helper function for db_query" ... so will prob just ignore it ... For now at least :)
if you want to rule it out as an "opportunity for development" in your coding ( aka a bug ), drop me email and will grant access to the server
Thanx again :)
Let's wait
Hi Tim,
Let's wait until you try the latest version. I'm not sure where the error is coming from (I don't use DISTINCT anywhere in my module) but I'd be curious to poke around a bit to see if we can track it down.
Spoke too soon :|
Would appear I have bigger issues than I thought
I can see the updating of the "non-attached" files in the logs - removing / adding
I can see the file paths/names in the DB
But I get no hits for any content within those files ( doc pdf xls txt - helpers are all working fine )
Using my sad lack of sql, can I assume that last_indexed = 0 indicates never ?
Are successive cron runs indexing these files?
Hi Tim,
Yup, 0 indicates that they have not yet been indexed (or the search index has been wiped and they have not yet been reindexed). Are successive cron runs indexing these files? Or do they remain unindexed even after you run cron again?
I am adding some checks to log non-readable files and empty extracted text. The new version, which will make it easier to troubleshoot the types of error you are reporting, should be available by Monday.
Mark
CRON
Morning Mark
CRON was timimg out ... haven't had a successful run since i added search_attachments :|
Increased timeouts, reduced items, etc ... but still timed out
For now, I have disabled the module ... and cron is running fine
Eagerly awaiting the updated release
Thanx
Tim :)
Sorry to hear that
Sorry to hear that Tim, but the latest version of 5.x-4-dev, which I plan to have out by mid-week, has several new features to help with timeouts, i.e., you can select how many files are indexed on each cron run, and a function to record the last ID successfully indexed before the script times out, just like in the core search module. There are also a heap of new troubleshooting features, including a report of how many files have been indexed, logging of files that can't be read by the module due to permissions problems, and the ability to select which file to use during helper testing.
Out of curiosity, how many files do you have on your site? Any really large ones?
Lots !
17,000 files
Just over 1Gb
A lot that are 30Mb and more
A mix of DOC, TXT, PDF, XLS, PPT
ATM, they are just a dump of the network drive ( well, some parts of it )
Once all running nicely, I will just mount the remote shares
----
I will be patient and work on other areas I need to improve on :)
Tim A.
Well there's your problem!
That is lots. The new "files to index at once" feature was added for cases like this. I am curious (and hopeful) that it will help. Is your content fairly stable or does it change frequently? I'm asking because if you have 17,00 files and you index 10 at a time, that's 1,700 cron jobs to index the whole wad. Of course, if you update any files that were previously indexed, they get reindexed.
Not sure how to handle the large files, but we'll work something out.
Static Files
Luckily most of the shares are static
Some of them haven't changed in 2+ years
Actual web content is pretty much static - some is created from the main msSQL servers, some is just updated / amended
Will also be adding in a mirror ( in some way or another ) of content on a multi-company shared server - when can get search_attachments working I will prob just get the mutli-company shared stuff exported as files - I have issues making it talk nicely - probably something to do with language barriers - the server is in Europe somewhere - somewhere close to where Hostel was filmed
Never fully understood the CRON / Index
Just assumed the option in the generic "search" would be the one to decide how many to do per run
Also assumed search would give priority to files that were new / changed
And that once they were all indexed, only new / changed content would be included in the next cron run
Can see I am gonna have to brush up on my php and start reading the code :|
So yeah ... a few files ... getting your module up and running will be mana from heaven
Cron works like this...
The standard search module does have an "items to index per cron run" but I added one that is specific to search_attachments, since extracting text from external files will likely take more time than pulling HTML from the database and it might be good to be able to compensate for that difference. On every cron run, 1) the search_attachments_files table (the one that you screensnapped above) is updated, i.e., new files are added onto the end of the table, changed files have their 'changed' attribute updated to reflect when they were last modified, and files that were deleted are removed from the table, and 2) the module's hook_update_index() is called by cron and the files that need to be reindexed/indexed for the first time according to a query on the search_attachments_files table are indexed. Search_attachments' "items to index per cron run" determines how many rows are pulled from the table druing that run. The query asks for files that have changed since the last time they were indexed (new files or ones that could not be indexed for some reason have a 'last_indexed' value of 0).
Last night, I added an option for admins to change the maximum timeout limit for Drupal during indexing. Drupal already sets this to 240 seconds of PHP execution time but search_attachments might need more. This setting might help in cases like yours. I'll be interested to see if it does. I'm already cooking up an algorithm that gets the sizes of the files queued up for indexing during the current cron run, determines if there are any 'big' files in this group, and adjusts the number of files it processes accordingly. Just a thought at this point but it might be necessary to add some adaptive file processing of this sort to handle large files.
"Large file threshold" and max execution time setting added
I've added a feature that lets you define a "large file" in megabytes. This setting is used during the cronned index updating process to identify any files in the current cron run's list of files to add or reindex that should be extracted and indexed on their own, i.e., a file identifed as large will be the only file processed during that cron run. The more files that are bigger than this setting, the more cron runs it will take to work through the list, but at least the module gets to work on only one file per cron run.
Also, I've added a setting that allows you to increase the maximum time Drupal can run during cron without timing out. The default set by Drupal core is 240 seconds.
Finally, I am going to do final testing of the module on Windows this evening, and unless there are any major problems, release what I hope is the last or at least second last -dev version.
Still timing out?
Hi Tim, did you have a chance to test the 2008-03-05 version? Are timeouts still a problem?
Can't activate the module
I can't seem to enable the module. Everytime i activate it, the /admin/build/modules/list/confirm page is white. No errors appear in the Apache error.log either - which is odd.
Try increasing PHP's memory limit
If no errors are appearing in your server logs, it's most likely a PHP memory usage problem. Try increasing the amount of memory PHP is allowed to use with one of the techniques described at http://drupal.org/node/29268 and try again.
Update on newest version of 5.x-4-dev
Even though I had planned to make the last release of 5.x-4-dev the last beta release, I have added a considerable number of new features, many of them in an attempt to address the problems that some of you have reported. The changelog since the 2008-02-24 release is:
Sample PHP helper for text is provided. Removed text_helper.txt from the second release distribution file and adjusted README.txt and the .install file
In addition to the various file checks and other troubleshooting features, the two biggest additions are the ability to use PHP snippets as helpers and the ability to configure helpers to work on more than one file type. My focus over the next few days (apart from interface refinements that I might make) is to test for some of the outstanding problems that have been reported, e.g., the porter-stemmer problem, the internationalization problem, and the timeout problem. I'd like to get the new release out by mid week and as usual would very much appreciate your assistance in testing the module, particularly those of you who have taken the time to report problems.
Hi Mark, All set to help
Hi Mark,
All set to help with the testing here :)
I've created a bug report to help track the apparent Porter-Stemmer bug here - http://drupal.org/node/229305 There is an example of the output there too if that helps.
Also, I am unable to find the 02/03/08 build of 5.x-4-dev to start testing, any chance you could point me to the link?
Many thanks,
Danielle
Thanks for the bug report
That will be useful. I haven't released the 02/03/08 version yet, it's my local copy. I'm still doing some testing on my own site and will release it in a couple of days.
Still can't replicate Porter-Stemmer problem
Danielle, I'm using version 5.x-1.x-dev of Porter-Stemmer and my development version of Search Attachments and can't replicate the problem. Maybe it's fixed in this combination of versions? Maybe you can test on the -dev verion of P-S when I have my latest -dev ready?
BTW, I assume the Porter-Stemmer module truncates the words like that to do its job, i.e., making searching 'fuzzier'. However, it certainly should not display the words like that.
More than happy to test it
More than happy to test it :)
The combination of features that these two mods add provide valuable functionality on our intranet so I'll give testing top priority as soon as I can get my hands on your new build.
Hi Mark, Only just noticed
Hi Mark,
Only just noticed you posted a new build; I can't seem to extract the file properly though. I can only see one file when I extract the contents which appears to be all the files of the mod in one long text file.
Is it just me being more blonde than usual and I've missed something?
Cheers,
Danielle
Works for me
I can gunzip and tar xvf it no problem, on two different machines. Try again and I can repackage it if necessary.
Did it work?
Hi Danielle, were you able to extract the 2008-03-05 version and try it?
Hi Mark, Got there in the
Hi Mark,
Got there in the end, not sure what was happening but I think it was Rar playing silly buggers on my PC.
Installed the lasted build and still had the problems with truncation when using Porter Stemmer, so I dug a bit deeper to see if I could figure it out.
It appears that there was some garbage hanging around from my attempts at getting the Swish-E Drupal mod to run succesfully, even though I could find no trace of the module itself, it's remains lingerd on in the indexing.
The simple fix was to just temporarily rename my sites files folder, reindex everything (including files). Then just name the files folder back to "files" and let cron take care of the rest.
Not the most elegant solution, but it worked and now Search Attachments and Porter Stemmer are living in perfect harmony :)
Wew!
I'm glad that you resolved the Porter-Stemmer problem. Thanks for digging into it and reporting back.
Third version of 5.x-4-dev now available
See changelog at top of this page.
Need help for New Version
Hi Mark
Thanks for releasing new version.
So in this new version no need for me to download any external helper to search inside Pdf Files, am i right?
After Downloading your new version :
In Helper Path : i should write only %file%
Can you tell me what should i write in Helper Php ?
Currently i am leaving helper Php as blank
When i Save configuation i am getting the below warning:
user warning: Unknown column 'extensions' in 'field list' query: UPDATE search_attachments_helpers SET name = 'PDF', extensions = 'pdf', extension_labels = '', helper_path = '%file%', test_file = '', log_text = 0, log_text_length = '200', helper_php = '' where id = 1 in D:\AppServ\www\drupal\includes\database.mysql.inc on line 172.
And then i try to search the File but i can not get any result.
Do I need to Run Corn and if yes then please tell me how can i run corn?
I hope this version works for me on window XP Platform
Thanks a lot
Ankit
You need pdftotext
No, the only PHP helper I supply currently is for txt files. You need to download pdftotext, which works well on Windows. Install pdftotext, locate the pdftotext.exe program and put the full path to the pdftotext program in the helper path field followed by %file% -. It will work better if you install pdftotext at c:\pdftotext\pdftotext.exe.
From the SQL error it looks like you didn't install the new version properly. First deactivate and uninstall the module, then delete all files in your Drupal's search_attachments module directory. Then put the new files there and activate the new module. It is important to uninstall first. If the install is successful, you should only need to replace the directory in the sample helper path with the one you installed pdftotext in. Everything else should be fine as is.
works for me
installed the 5.x.4-dev-2008-03-05 version on a previously-installed 5.x.3 install (hand-built WAMP). had to uninstall the old one and clear the tables manually; probably did something wrong on my end. not so bad since you require re-indexing anyway. resetting the helper app paths forced me to look at your admin pages.
noticed that the indexing will stick if files are not searchable. e.g. pdfs that are images are not indexed, but the counter still says it is at "95%" (or whatever) and the remainder is the number of files skipped. is that behavior as designed? just so i know what to look for.
other than that, nice work! like the new admin interface; separating the re-indexing is really nice, and the "number per run" of with large-file exception is a great idea. noticed that the helper app paths now take backslashes; "nice to have" for windows people.
thank you, thank you! stuff like this may help drupal push into the enterprise space as a generic dms.
Great, glad to hear it
Thanks for reporting your experience. WRT the percentage of files indexed, yes, that is by design. Next to the percentage I provide a link to the logs, which will say what files were not indexed or are not readable. Glad to hear that you like the large-file exception. I'll do a test on 5.x-3 -> the new version to confirm that the installer script is working OK; just to confirm, what database are you using, MySQL or PostgresQL?
Catdoc for windows
and the catdoc works fine too. (http://webaugur.com/wares, click on the hardware and software list. i used the catdoc-0.94.2-win32.zip version.). now if i could only locate a set that can read ofc2007 files.
Search Text File
Hi Mark,
Sorry to bother again.
Now this new version can search txt files without any external helper.
So to test and i upload the txt file and it goes into this location D:\AppServ\www\drupal\files
and then search for some words inside the uploaded txt file but can not see any result .
So please advice do i need to do some more setting for txt files also.
One more thing , please can you help that how can I install pdftotext helper in windows.
I already tried normal pdf to text convertor and install it and then search but it didnt work out.
please help!!
Regards
Ankit
You need to run cron.php
Hi Ankit, did you run cron.php before you did your search? The module won't work until you run cron.php (and won't index new files unless you run cron.php periodically).
To install pdftotext, download ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip, unzip it, and put the files in c:\pdftotext (or any location as long but do not use spaces in the directory name). In the PDF helper config page, replace the default path to pdftotext.exe with the path where you installed it. It should work fine.
Thanks a lot Mark
Hi Mark
Finally i can search inside pdf files. Its really great. Thanks a lot for you help. I will
always appreciate it.
Can you help me in 1 more thing. can you provide me the link from where i can download the
helper application for excel, word and powerpoint attachements for windows.
Thanks a lot again !!
Regards
Ankit
Excellent!
I'm glad to hear that you got pdftotext working. Is the plain text PHP helper working as well?
I don't know of any .doc, .xls, or .ppt helpers for Windows. Perhaps someone else can sugges some. As soon as I release the new version I am going to do some research on finding more helpers and document how to configure them. Until then, I can't suggest any.
Yup , work for plain text as well from your Php Code
Hi Mark
Yes, search work for inside simple text files also.
I really need some helper for doc,ppt and xls, even its paid , I can pay for it. So if you have any information. please let me know.
I find this web for doc helper but I am not sure how it will work with drupal or not if have time please take a look.
http://plone.org/documentation/how-to/enable-full-text-indexing-of-word-...
One another thing everytime i upload new file and want to search I need to run Cron.php before.
Can you suggest how can i run Corn automatically after every 15 minutes and how much time it should take to run corn if i have files in excess of 1GB?
Thanks for your support
Regards
Ankit
wvWare works OK for .doc
Hi Ankit, I'm glad to hear things are working better. Thanks for the pointer to wv. It seems to work OK but includes a lot of HTML markup in the extracted Word text. To configure it, set up a .doc helper and use the following path:
C:\wv\bin\wvWare.exe %file%
using the path where you installed it of course.
Re. configuring cron, the documentation at http://drupal.org/node/31506 looks pretty good.
I don't have any experience indexing a gig of files, but the version of search_attachments you are running has several features to make indexing such a large amount of files more reliable. For example, if your cron jobs time out, reduce the number of files you are indexing in one cron run, and maybe set the 'large file threshold' to 2 or 3 MB. Also, try indexing only fewer than 10 files at a time.
In general I'd also recommend timing your cron jobs no closer than 15 minutes apart.
Doc Helper
Hi Mark
I not understand from your last reply how to install that helper from doc.
Let me explain what i did.
I install the set up from this web http://gnuwin32.sourceforge.net/packages/wv.htm into the path C:\wv\bin\GnuWin32\bin and then i put the helper path as C:\wv\bin\GnuWin32\bin\wvWare.exe %file%
Now when i search then i see the result but see only the line containing that word which i search but i dont see the link for for the file in which that word is there. So i cant go to the file in which that word is present.
I hope you understand my problem
I am not sure its happening because i am installing in wrong way or any other issue.
One more thing does your module send some parameter to the helper application?
Regards
Ankit
Snippet but no link
I think I understand the problem but if you could attach a screen capture that would be helpful.
Re. sending parameters to the helper app, yes, whatever you put in the path. For example, %file% is a parameter, as is the '-' that pdftotext requires.
Getting Closer - I Think :)
Heya Mark
Thanx for the helpful advice and assistance
I am getting closer to getting this working - At least as far as parsing the files
Never had an issue with PDF
Have had to exclude XLS - turns out out report server creates "munted" xls - Even Excel complains that file extension is different to content
Every now and then I come across a file that just seems to hang the cron process causing a timeout - So I just exclude it :|
At least now I have SOME last_indexed values in the databases
I have cut file numbers right back to about 1,000 - for now
Unfortunately, to avoid timeouts I have had to rem out line 366 in drupal/modules/search/search.module .... I think preg_match is having issues with ( ) around dates
And I am still unable to get any hits to this "unmanaged" content - Attached files return hits with no issues - I am still getting the error ...
user warning: Unknown column 'n.nid' in 'field list' query: SELECT DISTINCT(n.nid) FROM node in /var/www/html/drupal/includes/database.mysql.inc on line 172.
I have even gone to the extent uf upgrading to Drupal 5.7 ... uninstalling search_attachments AND search ... ensuring the databases were dropped .... and reinstalling the modules
Read a comment earlier re Porter-Stemmer ... I ajve no issues with this and managed content ... cannot say with un-managed stuff ... yet !
I itnend to remain positive avout this
Especially as I now also have to "mirror" an external data source and have it fully searchable :|
Thanks for the detailed status report
Hi Tim, thanks for the detailed status report. Let me respond to some of your points.
You shouldn't have to do this. If we can figure out what the problem is with these files, we may be able to make them work. When you index them, were any entries showing up in your watchdog logs? Are the files really big (and if so, did you try setting the large file threshold accordingly?)? Can you extract text from them using the helpers on the command line?
Without more investigation I don't know what the side effects of quoting out that regex are, but search.module is likely timing out here because the list of characters it is searching for is so long. If you are experiencing timeouts while search_attachments is active but not while it is inactive, then I would say that the additional data that some large files are feeding into this regex is causing the timeout. Again, I think you should try setting search_attachment's large file threshold to some low figure like 5 MB or something and see if that has any effect.
What do you have in your "Directories to scan" field?
Also, the SQL error you are reporting, are you saying it is associated with unmanaged files in some way? Doing "grep -r 'SELECT DISTINCT' * | grep 'n.nid' in the root folder of a Drupal 5.7 install gets me hits in taxonomy.module and tracker.module. I forget if you're on Windows or *nix, but if on the latter, can you issue that command in the root of your Drupal install and let me know what your results are?
The Never Ending Saga :)
I am getting no time-outs ( unless line 366 is enabled )
Has been stuck at 31% for some time now - despite successive cron.php
20 files per run
5MB limit
No "file messages" in status logs except when it is adding / removing files
Successful "cron messages" in logs after each run
Very little in httpd/error_log
I usually only show an error with files that it cannot open - did find a lot of temp word doc files on the mounted server - I have since removed these ( were causing catdoc errrors in error_log and "empty file" in drupal logs )
Helper files show as extracting text just fine - i have tried samples from /files ( managed ) and the mounted server ( unmanaged )
I can also extract from command line
Directories to Scan :
../.mount_share/
Exlcuded paths :
/\.xls$/
/CallCentre\/Procedures\/HELPDESK\ OPERATOR\ LEVELS\.doc/
/CallCentre\/Sanitised\ Pec\ Warranty\ Calls/
/CallCentre\/Reports/
/CallCentre\/Sanitised2\ PREVENTATIVE\ MAINTENANCE\ SCHEDULES/
/Procedures\/Workbench/
/Procedures\/Sanitised3/
/Procedures\/Sanitised4/
/Procedures\/ACCOUNTS\ PAYABLE\ NON\ CBA/
/Procedures\/ACCOUNTS\ RECEIVABLE\ NON\ CBA/
Running on Fedora Core and your grep command ( I added -n ) in drupal root gives :
modules/taxonomy/taxonomy.module:1250
modules/taxonomy/taxonomy.module:1263
modules/tracker/tracker.module:81
modules/tracker/tracker.module:88
Disabled tracker module - and still got the Unknown column 'n.nid' error ( which incidently does NOT appear in httpd/error_log )
Disabled tracker module - and still get the error
Disabled both - ditto
Unfortunately I am now at the point where something has stopped it from removing files that I placed in /files/temp ... so I have just hit the re-index button
Hope some of that helps :)
Check your path
Hi Tim,
The entries in Directories to scan are interpreted relative to your Drupal install directory. I notice in the directory you indicate (../.mount_share/) you have a period after the forward slash. Is that directory a hidden directory? On my laptop install, when I use ../test to point to a directory that is a sibling of my Drupal install directory, files within it are read; when I try ../.test (which doesn't exist) I get the error "Warning: Can't find ../.test/ listed in Directories to scan", which is what search_attachments is supposed to do. I assume you didn't get that error, since you should have if that directory doesn't exist, but I thought I'd ask.
Also, is that directory readable by your web server user? Even though I check to make sure each directory exists, I don't check to make sure it is readable. I'll add that check now, but can you confirm that ../.mount_share/ both exists relative to your Drupal install directory and is readable by your web server?
The saga is ongoing, but we'll get to the end!
PATH
Yeah
Path is all readable and accessable
I can't mount the "remote" server below drupal root - as users cannot then access them thru Drupal - unless I go and add .htaccess files galore :|
Search_attachments is finding and adding / removing files quite happily from the "hidden mount" above root
I can see this in the DB
Some have been parsed and have updated the date/time field
Some are still 0
Last indexed = 0
Hi Tim, thanks for verifying the path. If the files show last_indexed = 0 in the db, they are either new and not yet indexed, the helper can't read the file, or the helper can't extract the text. If you are not seeing watchdog log entries for files in the last two categories, let me know, since you should be. I might get you to run the helper app on the 0 files to verify that they can be extracted.
Files and paths are in the
Files and paths are in the DB and last_indexed is 0
Initially I thought the issue was white space in file names
I could not catdoc or pdftotext from the shell unless i enclosed the 'file path and name'
Then it started to get interesting
I could extract all the files I selected ( didn't try all 500 + tho ) until I came to a .pdf which said "cannot copy this protected document" (sic). What I found amusing is that if I use this same file path and name in the test pull down for the helper, it extracts without issue.
Selecting the docs with white space in the path/name also extract as a test file for the helper
However, I do not believe white space is the issue ... as I have "no upload manager" files in the DB with white space and a last_indexed != 0
Unfortunately I am now seeing no entries in httpd/error_log or drupal logs about failed files ... it is as tho it has decided that enough is enough
I have a horrid feeling that this is going to be something exceedingly obvious or stupid ... or I am going to have to re-create the site from scratch
I am hoping for the prior :)
Don't think it's whitespace
I account for white space in filenames by wrapping them in double quotes. Unless that is not working consistently, I'd rule out whitespace in names as well. PHP doesn't need double quotes in files with spaces in their names to stat or do operations like is_readable() on them, but in the shell you do need quotes.
Don't create the site from scratch, for sure. Try this: determine the name of the first file in your search_attachments_files table that will not index after repeated cron runs and is one of the "no upload manager" files. Then, figure out a pattern that will match this name and enter that pattern in the Paths to exclude field of your module config. Run cron (do not clear the index first, just run cron.php) to see if you get any further down the table. Files that can't be extracted should be showing up in your drupal logs, but maybe they are not for some reason. If we can determine whether the module is stalling or failing on a specific file and not going any further, we might be able to narrow down the problem more easily. Unfortunately, Patterns to exclude only works on "no file manager" files, not ones managed by upload, webfm, etc.
Thank you for your diligence, we'll get you searching those files sooner or later.
I moved things ...
Because I changed the way the mount points were being done, I thought perhaps it was getting confused ( I still had a lot of "old mount point / filenames" in the DB ) ... So I cleared the index
While at it, I made sure was set to 10 files per run with a timeout of 300, file size of 5mb and added back the line in search modules that was causing a timeout
Changed cron interval to 15 minutes so that the cut down list of 900 files would be done by morning
Ran a coupla manual crons b4 I went to bed and found that each cron run is putting in 60+ "extracted" entries into watchdog ... with each file being "extracted" 20 times
So I just let it run ... Woke up ... and not so good
9% indexed - last successfu cron 9 1/2 hours ago
a couple of files showing "extracted" per latest cron runs ( same file tho - files/manual a_0.pdf )
one file showing as "empty" ( files/557_OVERNIGHT_BAGS_DESPATCHED_&_RECEIVED.doc )
With cron timeouts
httpd/error_log only shows the PHP timeout
( Did I mention this is sending me bald ? )
files/557_OVERNIGHT_BAGS_DESPATCHED_&_RECEIVED.doc will CATDOC from command line
Exclude path : /files\/557_OVERNIGHT_BAGS_DESPATCHED_\&_RECEIVED\.doc/
Run cron manually
And these are the 5 lines produced by cron :
Message Cron run exceeded the time limit and was aborted.
Severity warning
Message Search_attachments helper "PDF" has extracted the following text from files/manual a_0.pdf (only first 100 characters shown here): Part No. 50557 8850 Site Controller Manual Part 1 (of 2) contains the following chapters: Chapter
Severity notice
Message Search_attachments helper "PDF" has extracted the following text from files/manual a_0.pdf (only first 100 characters shown here): Part No. 50557 8850 Site Controller Manual Part 1 (of 2) contains the following chapters: Chapter
Severity notice
Message The text of files/557_OVERNIGHT_BAGS_DESPATCHED_&_RECEIVED.doc is empty and was not indexed.
Severity notice
Message Search_attachments helper "DOC" has extracted the following text from files/557_OVERNIGHT_BAGS_DESPATCHED_&_RECEIVED.doc (only first 100 characters shown here):
Severity notice
NOTE : Line 2 and 3 ARE the same - this is not a cut/paste error and that 4 and 5 are the same file with different helpers
( Banging head against wall - real hard this time )
Check helpers ... I have not used the same extension twice
Play with the "excluded file path" incase I have the syntax wrong - same result
Remove the file via the drupal edit dialogue ...
Run cron ...
Message Cron run exceeded the time limit and was aborted.
Severity warning
Message Search_attachments helper "PDF" has extracted the following text from files/manual a_0.pdf (only first 100 characters shown here): Part No. 50557 8850 Site Controller Manual Part 1 (of 2) contains the following chapters: Chapter
Severity notice
Message File files/557_OVERNIGHT_BAGS_DESPATCHED_&_RECEIVED.doc has been removed from the search index
Severity notice
Some success there - file "removed" at this stage any step forward is a success :)
As the pdf has now supposedly been indexed a few hunderd times according to status logs ... let's check the DB .... last_indexed = 0
Mark, I am officially stumped with this now
And still haven't even started looking at how to resolve that error in line 172 when I try to search attachments
Maybe we can take this thread offline...
Hi Tim,
I am officially stumped too. And already bald. Since your case seems be getting more complex, I suggest we continue to troubleshoot it via email or skype.
Files should not show up in the search_attachments_files table more than once, since they are keyed on file path, and if you move the files around, the old file paths will be considered "deleted". This is true of the no file manager files but now that you mention it, moving files that are managed by another module might not get treated this way since they are keyed off the ID that the managing module assigns them. I'll have to look at my code to refresh my memory. But if you moved managed files then the managing modules will complain as well.
Re. the error on line 172 that you reported, I'm hoping that has been fixed in the latest (unreleased at this point) version of search_attachments. http://drupal.org/node/233506 covers a similar error experienced when using i18n.module but I think it applies to others as well since I was using SQL in a permissions query that interacted poorly with other modules.
I think I can get your email from the comment admin pages. Can I contact you at that address?
Taking it offline...
Mark
You are more than welcome to email or IM ... both addresses are the same
Thanx heaps :)
Don't have your address
Hi Tim, I thought Drupal retained email addresses in the comments table but it doesn't. Contact me at mark2jordan 6 gmail com.
Update for WebFM changes
In the newest stable version of WebFM, the fname column was dropped from {webfm_file}, which causes search_attachments to return results with no name and which are not clickable. I updated the webfm_driver.inc file to account for this and also add the user-supplied title, as well as the file name. Let me know if I can help flesh this out further.
Thanks Russell, I'll include that fix
Thanks very much Russell, I'll patch the webfm driver for the 5.x-4 release. Looks like it shouldn't need any more work but I'll test it later today.
Further refinements for WebFM
I made a couple of other further refinements to webfm_driver.inc. It appears that the new WebFM also urlencode()s the name and filename before inserting them in the database, so I added a urldecode(). Also, since the title is optional, I included logic to just use the filename if it isn't present.
Making the attached file "name" searchable
I found a somewhat unique case when searching for documents handled by WebFM. In a few cases I need to search for documents which contain no text, i.e. a PDF which contains only images. In WebFM I can give these documents a title, but that title is not included when indexing the document. I've included a patch which adds this capability. It simply prepends the document "name" (as supplied by the search_attachments_get_{module}_file() function) to the extracted document text, which then allows it to show up in searches. I'm sure there's probably a more elegant way to do this, perhaps even weighting the filename/title more heavily than the text content, but I'll leave that to someone more ambitious.
Interesting problem, and solution
Hi Russell, this is interesting, as it raises the question of how to handle file metadata of various sorts. Two that I can think of are data maintained by the file management module (like your example) and data extracted from files but not part of the text that is normally extracted by helpers. For the latter I am thinking about things like the author, date, etc. metadata in .doc and .pdf files. Before I apply your patch to my local copy, let me see if I can generalize it a bit so that the call it is used consistently within all drivers that do this sort of thing, and maybe not just use name but also description or whatever else a file management module lets you record. I don't see any problems with this approach, however.
I've also been thinking a bit about allowing multiple helpers per file type, so that specialized helpers can do things like extract specific metadata. An issue here is performance and scalability, since each call to a helper cab be very expensive in terms of memory, script execution time, etc. I think in all cases any "extra" metadata would need to be pre/appended to the file text like you suggest.
Upload File Size
Hi Mark,
Sorry to bother you again.
I have 1 problem but it not relate to your module.
But now i have problem that when i upload some file , i can upload file of size more than 2MB.
Can you help me or suggest anyway that how can i make this limit unlimited or atleast 100MB.
Thanks and your module is working great!!
Regards
Ankit
Adjust upload_max_filesize
Hi Ankit, you need to increase the value of upload_max_filesize in your php.ini file. I am not sure where that file is on your system. Many PHP distrobutions come with a limit of 2MB. http://us3.php.net/features.file-upload has more detailed instructions if you need them.
Also, Drupal has its own limit, which you set in admin/settings/uploads. You will need to adjust this too if you are uploading files using Drupal. Other file management modules might have similar settings.
Good to hear search_attachments is working for you.
Dir Depth / length
Heya Mark
Does the depth of the directory have an impact ? Or the length of the file_path ?
/files/Efpec.pdf
/files/MOUNTS/Call_Centre/HelpFiles/Efpec.pdf
/files/MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/Efpec.pdf
All 3 files are identical
All 3 files parse in the helper app dialogue
All 3 files parse form command line
The first two show in search results and are indexed
The 3rd file refuses to index / update but does not crash cron
I have excluded a number of files from this same directory for not indexing - tho the same file indexes elsewhere
Other pdf in this same dir are fine - so I am really at a loss for a reason :|
Tim
Max path name is currently 256 chars
Hi Tim,
The maximum length a path can be is 256 characters, but that is not the cause of what you are describing, since your longest path is 69 characters long. I chose 256 to be consistent with what I saw in other file management modules but there is really no good reason to limit this length, so I might change it to unlimited.
Have files listed _after_ /files/MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/Efpec.pdf in your search_attachments_files table been indexed?
Thanx for the reply I have
Thanx for the reply
I have been "excluding" file_paths since id = 9783
I can see these being "removed" in drupal logs
If i enable helper logging, I can see many multiple instances of already indexed files being "extracted"
I can see no attempt to extract data from id = 9791
The pdf helper can extract the data from id = 9791
I can extract the data from command line
There are no php errors
+------+-----------------+--------------+------------------------------------------------------------------------+
| id | module | last_indexed | file_path |
+------+-----------------+--------------+------------------------------------------------------------------------+
| 9783 | no_file_manager | 1207015478 | MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/Compac Main.pdf |
| 9791 | no_file_manager | 0 | MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/postec.pdf |
| 9792 | no_file_manager | 0 | MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/STC Main Menu.pdf |
| 9793 | no_file_manager | 0 | MOUNTS/Call_Centre/Pumps and Consoles/Nupi/01A01E02.pdf |
| 9794 | no_file_manager | 0 | MOUNTS/Call_Centre/Pumps and Consoles/Nupi/054IN06.pdf |
| 9795 | no_file_manager | 0 | MOUNTS/Call_Centre/Pumps and Consoles/Orion/Orion.pdf |
+------+-----------------+--------------+------------------------------------------------------------------------+
last_indexed
G'day Mark
Can u please advise what "last_indexed" indicates and what causes it to change
From the above example I have filtered out the path "/Helpdesk Files/Compac Main.pdf " and also "/Helpdesk Files/postec.pdf "
Even tho it said it was "last_indexed" it appeared to get stuck at that point but with no apparent errors ), but kept on updating "last_indexed" with every cron.php
Now I am seeing the same on the next file
9792 | no_file_manager | 1207538556 | MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/STC Main Menu.pdf
9792 | no_file_manager | 1207538692 | MOUNTS/Call_Centre/Pumps and Consoles/Helpdesk Files/STC Main Menu.pdf
Coupla queries ...
1. How many passes does it take to index a file ?
2. Do multiple passes of a file occur in order to index it ?
3. What tells search_attachments to reindex a file ? I am assuming there is a check against a modified flag ? What situation would also cause it to do multiple passes of the same file in the same cron run ?
On a good note, I am no longer getting the "distinct" error
Thanx :)
last_indexed details
Hi Tim,
The last_indexed field contains a 0 if the file is newly added to the search_attachments_files table, or contains a timestamp that indicates the time at which the file was successfully indexed. This field is updated when the file's text is extracted by the helper and just before it is passed off to Drupal's search_index() function, which actually updates the index (so it should more properly be called 'last_extracted' not 'last_indexed'). If the text that is extracted is not '', the last_indexed field gets updated with the current timestamp; if it is '', the field gets updated with a zero.
The information you provide is very useful. What I think is happening is that only part of the text of Helpdesk Files/STC Main Menu.pdf is being extracted (so my test to see if the text is '' or not is passing and the value of last_indexed is being updated) but that the extraction fails before all of the text has been extracted. We need to figure out why all of the text is not being extracted. Let me think about this for a bit. I think I would like to add a new field to the search_attachments_files table that holds the extracted text. This would record where files like this one fail, and also allow for alternative extraction schedules, i.e., external helper scripts that update only the search_attachments_files table but don't screw with the search_dataset and search_index tables. This feature might not make it into 5.x-4 since I've really got to get that version out, but it would make it into the next version for sure (which might be the first 6.x version).
In response to your queries:
1. How many passes does it take to index a file ?
One. If it can't be indexed in one pass, we'd probably see symptoms like the ones you have documented.
2. Do multiple passes of a file occur in order to index it ?
No, the module assumes that it can extract the text from a file and hand that text off to Drupal for indexing in one pass. It's a one-to-one relationship.
3. What tells search_attachments to reindex a file ? I am assuming there is a check against a modified flag ? What situation would also cause it to do multiple passes of the same file in the same cron run ?
A file is reindexed if its last_changed value is greater than its last_indexed value. When cron.php runs, it invokes search_attachments twice: once to update the search_attachments_files table by checking the last updated times of all the files listed (also add any new files or remove any files that have been deleted from the file management modules or from the file system), and once via cron's normal invocation of Drupal's search_attachments_update_index(), which fires hook_update_index() in all modules that call it.
The module doesn't (intentionally, anyway) perform multiple passes of the same file in the same cron run.
Re. the distinct error, did you get my email where I said I logged into your server?
Mark
Heya Mark You can write
Heya Mark
You can write almost as much as I do :)
Nope, did not receive email from you - looks like spam filter ate it and I can't get it back now :|
Your explanations are always welcomed by me ... even if it doesn't always make sense immediately :)
I will have to congest it for awhile and compare it to what I am seeing in the logs etc
But as always, I will persevere
Tim
Let's bypass spam filters and use personal email of leatherandlace@netspace.net.au or pariah911@hotmail.com <=== please sanitise this address for me :)
Postgresql install issues
Hi All,
I have tried installing this module on a postgresql database but can get it to work. (fails to install required tables)
I then manually installed these tables, and reinstalled the module ( no errors/warning but no search_attachment UI gets created - as if the module is not enabled)
Drupal: 5.1
PostgreSQL 8.2.4
search_attachment: 5_x_.3 (also tried 5_x_4-dev)
Has anyone manage to install it on Postgresql, is there a know issue, or am I missing something.
Thanks for any help/guidelines
--
Walied.
Post new comment