search_attachments.module for Drupal

See the module's project page at http://drupal.org/project/search_attachments for additional information, including archived support requests.

Purpose

search_attachments.module allows searching the text of PDF, MS Word, plain text, and other types of files attached to nodes. As of version 5.x-4, the module will also allow searching of files that are not attached to nodes but that are FTP'ed or otherwise uploaded to a Drupal site.

In order to extract the text from attached files, this module calls 'helper apps'. The module is not limited to using specific helpers described below -- Drupal administrators can configure any helpers they like.

Currently, search_attachments.module is available for Drupal 4.7.6 and 5.x. The 4.7.6 version is no longer supported, but the 5.x version will be maintained to keep pace with Drupal. All versions have been tested on Linux and Mac OS X, but users have reported no problems using the module on Windows other than with helper paths that contain spaces (see below).

If you need Drupal 6.x compatibility you may want to consider using Search Files, which offers a subset of search_attachments' functionality. Search_attachments will be ported to Drupal 6 shortly.

Screen snapshots

Some screen snapshots are available that show both the end user and administrator views.

Helper apps

In order to use search_attachments.module, you will need the appropriate helper apps on the same computer that Drupal is running on. These apps need to print out extracted text to standard output; currently, search_attachments cannot read extracted text that is saved to a file.

Please see the Recommended Helper Apps list for links to helpers and more information.

Running search_attachments on Windows

Helpers with paths containing spaces, such as c:\Program Files\Acme Helper\ahelper.exe, will not work. You will need to install your helpers in locations that do not contain spaces. I have tried to troubleshoot this problem a number of times and have failed to come up with a reliable solution. Common sense approaches like imitating Window's own way of quoting paths like '"c:\Program Files\pfile.pl" "c:\tmp\data dir\test.txt"' do not work. If anyone has a solid fix for this problem, and can supply some code, I would be happy to receive it.

Module driver files

As of 2007-06-08, the 5.1-dev version of search_attachments requires the use of file manager driver files, which contain several functions specific to each module that is being used to manage attachments. Here is upload_driver.inc, the driver file for the core upload.module:

<?php
// $Id$

/**
* Each module that manages files needs to have a 'driver' like this one, containing the four functions below,
* each with the module's name (a.k.a 'the current module' or 'the file management module' below) as the
* second segment of the function name, e.g., get_upload_file_list(). More documentation is available at
* http://interoperating.info/mark/search_attachments.
*/

/**
* Returns a nested array of files which are managed by the current module, with their 'id' and
* 'path' attributes as defined by the current module, and their last modified time via stat()'s
* mtime value. We serialize $row->nid because nids are stored in {search_attachments_files} in
* serialized in order to accommodate arrays of nids for some file management modules, e.g., webfm.
* Only files that have an extension defined in the {search_attachments_helpers} table get returned
* by this function.
*/
function search_attachments_get_upload_register_files() {
 
$result = db_query("SELECT fid, filepath FROM {files}");
  while (
$row = db_fetch_object($result)) {
    if (
search_attachments_has_helper($row->filepath) && (file_exists($row->filepath))) {
     
clearstatcache();
     
$stats = stat($row->filepath);
     
$nids = search_attachments_get_upload_file_nids($row->fid);
     
$files[] = array('module_id' => $row->fid, 'nid' => $nids, 'file_path' => $row->filepath,
       
'module' => 'upload', 'changed' => $stats['mtime']);
    }
  }
  return
$files;
}

/**
* Given a file's ID in the current module's db table, returns an array of attributes ('link', 'name',
* 'size', 'nid'). We serialize $row->nid because nids are stored in {search_attachments_files} in
* serialized in order to accommodate arrays of nids for some file management modules, e.g., webfm.
*/
function search_attachments_get_upload_file($fid) {
 
// Select the parent node's ID, the attachment's file name, link and size.
 
$result = db_query("SELECT filesize AS size, filename AS name, filepath AS path FROM {files}
    WHERE fid = '%d'"
, $fid);
 
$file = db_fetch_object($result);
 
$info = array();
 
clearstatcache();
 
$stats = stat($file->path);
 
$info['mtime'] = $stats['mtime'];
  
$info['url'] = file_create_url($file->path);
  
$info['name'] = $file->name;
  
$info['size'] = $file->size;
  
$info['nid'] = search_attachments_get_upload_file_nids($fid);
   return
$info;
}

/**
* Given a file's ID in the current module's db table, returns a serialized array of all the nodes
* the file is attached to. For upload.module, there should always only be one nid in this list since
* upload.module allows a file to be attached to only one node.
*/
function search_attachments_get_upload_file_nids($fid) {
 
$nids = array();
 
$result = db_query('SELECT nid FROM {files} WHERE fid = %d', $fid);
  while (
$row = db_fetch_array($result)) {
   
$nids[] = $row['nid'];
  }
 
$serialized_nids = serialize($nids);
  return
$serialized_nids;
}

/**
* Returns a permission string that controls who can view files managed by the current module.
* This string should be identical to the one used by the file management module.
*/
function search_attachments_get_upload_view_permission() {
 
// Return permission string from module that allows users to access attachments.
 
return 'view uploaded files';
}
?>

Search_attachments comes with three driver files, upload_driver.inc, attachment_driver.inc, and webfm_driver.inc. A fourth driver, no_file_manager_driver.inc, handles files that are not managed by another module, for example if they are FTPed to your Drupal instance. If you are using any of the three file management modules or if you FTP files to your Drupal instance, you don't have to do anything special, just follow the installation instructions below. If you are using a different file management module, you will need to create a driver for it. To do so, put a PHP file in the search_attachments directory with the same name as the module you want to get file paths from appended with '_driver' and give it an '.inc' extension (like 'upload_driver.inc'). Then, write four functions in that file whose name follows the pattern 'search_attachments_get_modulename_file_list($nid)', 'search_attachments_get_modulename_file($fid)', 'search_attachments_get_modulename_file_nids($fid), and 'search_attachments_get_modulename_view_permission()', as illustrated in the upload_driver.inc code above. Your code will need to return the same variables that upload_driver.inc's functions do.

Installation and usage of search_attachments.module

  1. Install helper apps such as catdoc and pdftotext. Unix cat command is sufficient to test .txt attachements.
  2. Place search_attachments.module in your Drupal modules directory.
  3. Log into Drupal as admin and go to administer > modules and activate the module (the 5.x version installs its own database tables; the 4.7.x version uses only the Drupal variable table).
  4. Go to administer > settings > Search attachments settings and configure the helpers you want to use. If you save the settings and the module can't find the indicated helper apps, it will tell you.
  5. Attach some files to nodes using the file management modules that you have drivers for, or upload some files and make sure you have your directory paths configured properly.
  6. Go to Administer->Site configuration->Search settings, then re-index site.
  7. Running cron.php on your site to will index any new attachments (e.g., http://yoursite.org/cron.php)
  8. Test by searching for words contained in your attachements. You should see a 'Files' tab (or whatever you named it in the admin settings) listing your results.

This module uses PHP's shell_exec() function, so you should restrict 'administer' access to trusted users -- normal end users would never need to configure site-wide search settings anyway so this should be an obvious precaution. Just thought I'd point it out.

If you are sure that the pdftotext and catdoc are installed (i.e., 'which catdoc' returns a valid path), and search_attachements.module still complains that it can't find the helpers, chances are that PHP is configured using safe mode) http://php.net/features.safe-mode).

To do

  • The ability to display search results from both nodes and attachments at the same time. This will likely not happen until Drupal 7, since there is an issue open for Drupal 7's search module that will make combined search results lists a lot more feasible.
  • Integrate functionality to allow more efficient parsing of attachments, i.e., to reduce the possibility that cron.php will time out. If this is happening to you, see http://drupal.org/node/65307 for a way to run cron from the Unix command line.
  • The ability to use helper apps that save extracted text to instead of printing it to standard output.

Thank you

-Yuri McPhedran for early testing, suggestions, and pointers to helper apps for various platforms.
-Dmitry Arkhipkin for various patches, including one that enabled indexing of attachments separate from nodes.
-Andrew Turner for drupal_get_path patch.
-Jake Ochs for feedback of various types.
-WorldFallz for the attachment.module driver.
-Everyone else who posted comments with suggestions and offers of help, or who emailed me with same.

AttachmentSize
search_attachments_4_7.tgz4.79 KB
search_attachments_5_x_3.tgz16.33 KB

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

how about xls, ppt and rtf

how about xls, ppt and rtf searching?

helpers are:

xls --> xls2csv
rt --> unrtf
ppt --> ppthtml

cheers,

Damian

Have a look at the swish-e module

RE: todo
"The ability to index attachments separately from the node text. Using the current architecture, I don't see how this can be done since the attachment's identity is lost by virtue of its being appended to its parent node's text. Any suggestions are welcome."

This is already possible, if you use the drupal module swish-e, see http://drupal.org/node/16428

Best regards

Juerg

Thaks Juerg, I'll take a

Thaks Juerg, I'll take a look to see how that module accomplishes this.

swish-e !shared hoting

I think this is still something to evaluate due to the fact that swish-e is not suitable for a shared hosting environment and mark has made no mention that might be the case here. Mark, this could be very beneficial. Go forth and conquer!

Most web hosts providing

Most web hosts providing *nix will have pdftotext and cat installed, but not swish-e. Search_attachments is an alternative to swish-e.

move to drupal's site.

You should move this module to the actual drupal site so it is easier for people to find. I found the link after searching the forums for a while.

Migration for Drupal 5.0

Great!

Will this module be migrated to Drupal 5.0?

interested in this module

def interested in this module. What version of drupal have you tested it against?

agreed - add to Drupal.org

Agreed-- please add this gem to drupal.org. I too came across it via that drupal thread in search for exactly this while brainstorming (still just looking forward to trying it one day).

Windows Paths

Excellent module, and I have the PTConverter for PDF documents in the "Program Files" directory, and no matter what I try to do with the path it still does not find the executable.

I saw the Perl example above, but where is it actually located.

Any advice is appreciated.

Thanks
Stephen

Belated reply.. Looking at

Belated reply..

Looking at the PTConverter website, it looks to me like it should work as long as the output .txt file option is not required. If you omit the output file option and run it against a PDF, and the extracted text is printed to the console, it should work. Early versions of search_attachments contained a bug that made extraction from attachments whose filenames contained spaces fail. I'd be interested to hear if the current 4.7.x or 5.x versions work with PTConverter (without the output .txt file option).

Yes!

Yes - I'd love to see this ported to v5.x. Can I help?

Thanks,

Jason

Drupal 5

This exactly what I need for a site I'm working on. Only problem is I'm using Drupal 5 :-)

Do you have any idea when this might be ready for v 5?

Cheers.

Drupal 5.x

Hi,
I'm very interesting in this module.

I need this module for Drupal 5+,
can I help you to upgrade this
module?

thanks
diego

Updated to 5.0

Mark-

I have a working 5.0 module with forms/settings updated, XLS, PPT, and RTF added, bug in the shell command fixed, etc....
Just sent it to you by email, please let me know if you do not get it.

mike@codingclan.com

Great module! I did a quick

Great module! I did a quick port to drupal 5 and it works. It's actually just adding a menu_hook into the file with a callback to the settings function. In the settings function, just add the system_settings_form to return the form.

One small remark. In the settings page, make sure you set the %file% between double quotes. Otherwise, attachments with spaces in their name will not be indexed.

Thanks for the tip

I'll add the quotes -- thanks.

Mark

5.0 soon?

What a great module - should have been part of drupal's search to begin with!

I could really use this pronto - do you have an idea when it will be ready for 5.x?

I also agree you should get into the modules index at drupal.org.

Thanks!

Working on Windows with Cygwin

OK, I've added this fix to the 5.x dev version

Thanks a lot for supplying the fix Andy. I've incorporated it into the 5.x dev version and have tested it on Mac OSX, so it should also work on other *nix platforms.

The ability to index attachments separately from the node text.

Regarding the ability to index attachments separately from the node text, one can do the following :
1. Modify driver to retrieve attachment id along with path
2. Add "search_index($attachment['id'], 'attachment', $attachment_text);" to function search_attachments_nodeapi so text is indexed with type="attachment".
3. Create function "search_attachments_search"(hook_search) and add "name" => "Files",
"search":
a) $find = do_search($keys, 'attachment');
b) Ask driver to retrieve items with $find->sid (node id included)
c) Wrap & return results according to hook_search rules (snippet is parsed from {search_dataset})

Done! We have "Files" tab and can search attachments only (+ node backreference link in result).
I believe it is better than what is done now, because it is impossible to find out which attachment has the information I search for when node has 2+ attachments.

Dmitry, I'll give this a

Dmitry, I'll give this a try, unless you've already got some code I can patch against the current version of 5.x-dev.

Excellent!

Dmitry, I've incorporated your patch -- the module now indexes nodes and their attachments separately. Excellent!

Dmitry's patch and node_access

When performing a search for files, will search_attachments' search results include filles that are attached to nodes which are inaccessible to a user?

Not for files managed with

Driver-specific view permissions added

I've added driver-specific view permissions to the latest version of the module. If the user doesn't have the permission defined in the driver file ('view uploaded files' for upload, 'see webfm_attachments' for webfm), the attachment isn't included to the search results. Thanks for picking this up.

The problem is when you have

The problem is when you have a page of results what you don't have access to view .

It is possible, for example, to have the user able to have a page of uploads followed by a page of webfm attachments where they have no access to the upload attachments. They will get an empty page followed by a full page.

Please send screen caps

In the current implementation, the themable information about an attachment (link, title, snippet, etc.) only gets added to the $results array if user_access() returns true, so the list of results should only display entries for attachments that belong to nodes the user has access to. See excerpt from Dmitry's original patch above -- I've updated it to use permission strings from any driver file but the conditional logic remains the same. I don't have enough attachments populated in my test drupal to see what you are describing -- can you send along some screen caps so I can take a look?

Will this restrict results based on taxonomy access?

I'm using taxonomy access and I'm wondering if this tool will restrict search results based on access settings and the user's role?

Thanks,

Moses

I've added db_rewrite_sql to solve this

Hi Moses,

I've modified the module to only display entries in the results whose parent nodes are filtered by db_rewrite_sql. If I understand correctly (and I have to admit I am having a difficult time with db_rewrite_sql), this means that any module that restricts access using hook_db_rewrite_sql (like taxonomy access) will only return nodes that the current user has access to. Nodes, and therefore attachments, that the user doesn't have access to should not show up in the results list. I'd appreciate it if you could test the search_attachments_5_1-dev-db_rewrite_sql.tgz version above and let me know how it works.

This fix should also address Andrew's question.

Thanks, Mark. I'm still

Thanks, Mark. I'm still trying to get the module working. I have it and pdftotext installed and I can see the contents of a pdf file in the testing area, but for some reason the search results aren't bring up the right nodes.

I'll test the fix as soon as possible.

I got the module working and

I got the module working and the taxonomy access restriction seems to be working. Thanks!

Now what's the best way to combine all the results on one tab?

Currently, there is none

But it would be good to have. Ideally, I'd like to have a checkbox in the advanced search form that would allow you to include attachments in node searches. Is that what you were thinking?

Yes, exactly

It seems like it will be difficult because hook_search doesn't support it. Is this your take on it?

Performance?

This is an excellent module!!! Thank you so much!
I installed it and it is working (Note: the function get_upload_view_permission is missing in the attached search_attachments_5_1-dev-db_rewrite_sql.tgz, I added it from this page).

It is also working with Powerpoint and Excel files with the tools catppt and xls2csv from the catdoc package - Fantastic!

I am planning to use this on a site with a few thousand files (a few Gigabytes in filesize) - anyone has an idea how this module would perform in such a scenario? Anyway I will soon be able to test this.

Indexing arbitrary upload directories?

On a large site we're using, only about 1/3rd of our documents are actually managed via attachments. The rest we pile into a generic 'docs' directory that is web-accessible. What I'd like to do is tell search_attachments to also index this directory.

Right now the module scans the attachments of nodes, and for each attachment, indexes it. I tried to work on an INC driver file for this, but the documentation on how to do INC files is sparse, and also requires an associated module to be available for the files being indexed. Sort of a catch-22.

Any pointers to an inc file that can do this would be a huge win for us :)

Sorry for the delay

Sorry for the long delay in answering this question. Currently, search_attachments requires that files be attached to nodes because the theming of the search results requires a node ID, node title, etc. Your .inc driver might be supplying enough information to the module and the files might be getting indexed, but they would not show up in your search results because they don't have a node ID. Would you mind supplying your .inc driver so I can take a look?

help with pdftotext on shared hosted environment

I'm trying to set up the module on my Fedora Core 4-based hosted virtual server. Can someone give me tips on how to install the pdftotext runtime? Someone here commented that its often already on hosted servers, but a "whereis" for xpdf or pdftotext came up empty. I downloaded the x86-Linux Xpdf from http://www.foolabs.com/xpdf/download.html and put the executables in /usr/local/bin as per the install file, but attempting to run pdftotext from the console or poting to it from the module doesn't work. Any help would be appreciated.

-Thanx in adv.

problem with installation/indexing

I've tried to install/enable search_attachments on an existing Drupal 5.2 site and when I "Save Configuration" from the modules admin page I get a blank screen. Thereafter, the module indicates that it was installed, and I can configure the app, but my PDF attachments don't show up on search. I've also tried to install it on a vanilla (unmodified) basic Drupapl 5.2 install for testing and, while I do get a proper confirmation message after pressing "Save Configuration" and can config the module, I still can't get indexed PDF search results. I have confirmed that the pdftotext module is installed and working on the host. Any help/suggestions would be greatly appreciated.

The first problem, the blank screen...

The first problem, the blank screen on saving your modules page, is usually a symptom of insufficient memory allocation to PHP, set in php.ini. If you're on a shared host, you likely don't have access to this setting so you might want to submit a trouble ticket to your web host tech support.

As for the second problem, modify one of the nodes that you have a PDF attached to, reindex by hitting cron.php, search for a word in the PDF, and see if that works. search_attachments won't pick up attachments unless the node they are attached to has been modified.

Module installed in two locations?

It looks like you have the module installed in two locations, the general site modules directory and also the sites/all/modules directory. Remove it from one and try again. Let me know what happens.

that's one problem out of the way...

Doh! Fixed. OK, so the module installs and configures. I upload a sample pdf containing the text "Homer simpson cromulent words in pdf attachment", run cron and the search for "cromulent." nada. Still no results.

I've added text to remind users to look under the 'Files' tab

Jake, based on our conversation, I've made some changes that will hopefully increase the visibility of the 'Files' tab. You can also now control the label on this tab. Thanks for the feedback.

biblio

hello,

I would like the biblio module to work with the search attachments. Is there somebody who knows how to make an inc file as described in the documentation ?

I would like the search attachments to work on both the attached files and the files linked using the URL field (mostly pdf)

Thanks,

Luc

Biblio is supported

Hi lucDK, if the core upload module is enabled, biblio uses it to manage attachments. Since an .inc driver file for upload is supplied with search_attachments, you already have everything you need. I just confirmed this on my laptop, using biblio 5.x-1.9. If you still can't index attachments on biblio entries, let me know.

Access controls on files uploaded without using Drupal modules

I'd like to get some indication of how people who are wanting to index files not attached to any node restrict access to those files. Are these files accessible to all users, including anonymous users? Do you use any access/authentication modules to restrict access to them? Presumably if these files are uploaded without using Drupal modules, Drupal can't (as far as I can see) control access to them.

Thanks for any info you can provide.

Problem with french caracters

Hello,
First thanx a lot for your module, works great (almost) for me.
I've just a problem with french caracters, they are not recognize in a PDF document.
In the text shows in the result page, french caraters are missing everytime they might be in a word.
I've tried to install Latin2 language package, but it does not fix my problem, but I'm not sure of what I did.
I've downloaded the package, put it on the server, and add this line in the xpdf config file :

unicodeMap Latin2 /usr/local/share/xpdf/latin2/Latin2.unicodeMap
textEncoding Latin2

I've tried the other language included in xpdf, but it doesn't work...

Do you have any idea that may help me?
Thanks a lot!

Try pdftotext

pdftotext is part of xpdf. If you run the command 'pdftotext Yourfile.pdf -', on a PDF that contains French characters, you see the French characters. So, xpdf/pdftotext can handle French. However, it appears that during the indexing process, Drupal removes the French diacritics. I'll look into whether this is normal behavior.

Encoding problem fixed

Hi Pierre, I've finally fixed this problem, in version 5-x.3. All text extracted by helper apps is now converted to UTF-8 before being indexed. Sorry for the delay, and thanks for reporting the problem.

bug in 5 2

Fixed in 5.3-dev?

Can anyone replicate this problem in 5.3-dev? I can't but would like confirmation that no one else can either.

Bug in 5.x-3: only site admin gets files search results

I couldn't figure out why the site administrator was the only one who got search results under the Files tab. I finally tracked it down to the webfm.inc driver file in the get_webfm_view_permission() function. The permission string for webfm should be "view webfm attachments", not "see webfm_attachments".

Good catch, I'll fix that

Good catch, I'll fix that and upload a new version later today.
Thanks,
Mark

opt out certain files

I recently discovered this module and i am excited about it. I was wondering if there's a feature or future feature that enables certain files not to be indexed (per company policies ect). I love the idea of searching through my PDFs and word docs, but I have a few PDFs that we at my company are not allowed to post copyable text of anywhere on the site.

thanks a bunch and keep up the good work!

Interesting feature

Interesting feature -- this is certainly possible, esp. now that I am currently changing this module so that it maintains a central registry of all files to be indexed (see http://drupal.org/node/188895 for preliminary discussions of how this would work). A field in the file registry table could be reserved for a flag that indicates that the file is to be excluded from indexing.

How would you envision that users/admins would indicate whether a file is not to be indexed? Keep in mind that the new version will allow indexing of files not attached to nodes (e.g., uploaded via FTP and other means) so flagging files as not indexible within the edit record for the node they are attached to wouldn't apply. http://interoperating.info/mark/node/72 shows some screen caps of the admin pages for the current version of the module, which might suggest some possible locations, but I assume that you'd want regular users or at least certain roles to be able to control which files are excluded?

Thanks for the suggestion -- I look forward to hearing back from you.

Single list of all files to exclude from indexing, with patterns

Hi Ryan,

Thinking about how this might work, I'd like to suggest that all the files to be excluded from indexing be identified in a single text area, one per line, similar to how the page-specific visibility settings are defined for modules. This list would be defined on the same page where admins define all the directories where files can be located, and would contain patterns (probably regular expressions), so you could define things like specific file paths and patterns like 'files/restricted' (paths) and 'confidential' (in names). I am assuming that 1) the number of files to be excluded would be relatively small, and 2) only users with 'administer search_attachments' access should have this ability -- in other words, ordinary users who might be uploading files would not. Let me know what you think.

Mark

Helper Program Path Observation

Excellent work on the upgrade. Congratulations and thanks for the module.

One small point is that if the helper program path contains a space, even if in quotes the module does not find the path.

I am running it on a windoze server.

Thanks again
Stephen

Can you send me an example helper path?

Hi Stephen, glad you find search_attachments useful. Can you post one or two helper paths that have this problem so I can take a look?

Mark

Cool! It works!

Thank you! Very nice!
I located a Win32 version of catdoc (David L Norris' site at http://webauger.com), so I can now search "searchable PDFs," Word, PowerPoint and Excel docs. The only thing I might wish for is the ability to look at files as they are uploaded to existing nodes, or are changed/replaced. It is okay, though; small site, so re-indexing is fine.

Indexing on upload

Excellent, glad you like the module and that your're having luck on Windows. I'll look into having the files indexed immediately on upload. Good idea.

Mark

Well done, very good job

Well done, very good job !!!

Could you be so kind and tell us the path that you use & where did you install/copy the files.

Thank you very much.

Paul

file paths

Um, which paths?

- Attachments are in the Drupal "files" folder, augmented by the "upload path" module
- I don't use "Program Files"; I have a custom folder (no spaces) for server apps (\apps, actually). "xpdf" and "catdoc" are in there. No Cygwin. We don't do text-files as attachments, so that is not an issue. Yet.
- Module config paths (for the file-types): I use full local-paths, forward slashes, but otherwise as-is. Per someone else's comment earlier.

Sorry, I was talking about

Sorry,

I was talking about the Module config paths, I'm trying to configure as you say but when indexing it never finish. Pleased what exactly you write at Module "helper path" for catdoc docs.

Thanks again for your help

Paul

oh!

i put all my server-side apps in d:\Apps (catdoc is in \catdoc), so the helper path is d:/Apps/catdoc/catdoc.exe %file% (for DOC)

on the not finishing the indexing; i had problems with that in other areas related to cron, and i had to either decrease the number of items indexed per run, and/or increase my max_execution_time in php.ini to 45 (or more). especially when i had to move an installation to a slower server.

hope it helps!

I couldn't pipe the

I couldn't pipe the shell_exec() output, So it didnot work on my system. $test_file_text has no value. How can I handle this?

What platform are you running?

Are you on *nix or Windows?

Windows + IIS

Windows + IIS

search_attachments and porter-stemmer module

When documents are indexed and the porter-stemmer module(which helps our search relevance a ton) is installed, their output on the serp is basically unreadable. The stems are displayed in the snippet. I haven't traced down where this happens, and am hoping it is trivial, have you heard of this issue, or have a fix? The 'content search' still displays the regular snippet with stemming enabled(and searches are improved).

I'll look into it

Thanks for reporting this. Search_attachments uses the Drupal search_excerpt() function just like the core search module so I'm a bit surprised this is happening, but I'll into the problem.

I get the same really messed

I get the same really messed up summary output from searches with porter stemmer too, I'll do some testing over the weekend to see if I can narrow it down a little.

empty page search results

An issue that we've run into with this module is in how the results are returned, then checked for user access. This leads to unfilled and/or empty search pages (which mistakenly returns that there are no results).
Example:
There are 30 results for the search phrase "the" for an authenticated user, when the authenticated user searches, everything is fine. When an anonymous user searches for this phrase, page 2 is "empty" (because anonymous doesn't have view privs to any content on the second page) and clicking on page 2 from the pager, we get:

Your Search Yielded No Results
* Check if your spelling is correct.
* Remove quotes around phrases to match each word individually: "blue smurf" will match less than blue smurf.
* Consider loosening your query with OR: blue smurf will match less than blue OR smurf.

When looking through your code, it seemed that there was a chicken before the egg issue, so maybe this won't be a trivial fix.

Also wanted to give you a shout out/props for great work on this module.

If you can point out the chicken and egg, I'll take a look

Thanks for reporting this problem, one I thought I had fixed a few versions ago. If you could be more specific about the chicen and the egg (i.e., what code do you suspect is causing the problem), I'd be happy to take a look. I am currently quite far along in a substantial rewrite of this module and am now in the testing phase of a large group of new features, including a new approach to managing files not attached to nodes (as well as files attached to nodes), so this might be a good time to include changes that are not trivial.

I'm glad you find this module useful,

Mark

Update module on Drupal.org

Mark, would you please move development of this module to drupal.org? There's a lot of good project management stuff over there that will help this module along the way, as well as help the module's users. For instance, the update_status and cvs_deploy modules can't be used with this module because it isn't updated in the Drupal repository. Additionally, if there's a security issue found in your module, Drupal's Security Team can work with you to get the problem fixed and help notify all the module's users about the issue.

Excellent reasons to commit to cvs

Hi Tim, the reasons you provide for committing search_attachments to the Drupal cvs are excellent ones. I'll do it this week.

Great, thank you!

Great, thank you!

Any update on this?

Any update on this?

I've tried and failed miserably

I committed 5.x-3 and somehow created a release package that is totally broken. My attempts to use cvs have always resulted in misery, and this instance is no exception. I've followed http://drupal.org/handbook/cvs/quickstart and other resources and still somehow have not been able to get this done. If you could supply any assistance I would be grateful.

Thank you for giving it a

Thank you for giving it a try! I've never actually done a module release, so I'm not entirely sure what the problem is. It looks like there's no DRUPAL-5--3 branch, which I think is a prerequisite for a correct release from that branch. If you want to give me CVS access to the project (I'm Junyor on drupal.org), I can try to clean it up.

FWIW, instead of messing with sending the branch commands myself, I've always just used separate checkouts for each branch, as described in http://drupal.org/handbook/cvs/quickstart#advanced. Then, any time you commit files in the branch checkout, they're automatically checked in on the correct branch.

How can I help?

It looks like the DRUPAL-5 branch isn't correct. If you look at http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/search_atta..., you'll see that there are no version numbers for any of the files on the DRUPAL-5 branch. I now suspect this may be causing the packaging problems.

That makes sense

Hi Tim, I would have guessed I messed up the tag, not the branch, but all I know is something is messed up. I know I said this before, but when 5.x-4 of search_attachments is out, I'll try committing again. I might take you up on your offer. greggles from d.o. also offered to help so I really have no excuse.

blank screen for searches

hi there,

this must be a cool feature but currently i run into blank screens whenever i try to call up the search. also when i logout a blank screen is appearing. the indexing seems to be running well though since the search_* tables in the database are filled.
i already extended the allocated memory in php.ini to 128M but still nothing comes up.

can anyone help?

cheers, peter.

Not necessarily a memory issue

White screen of death can be caused by an error as well. If you have access to your web server's error logs, take a look to see if the module is throwing an error (such as iconv not being installed on older versions). What version of search_attachments are you running?

blank screen

Hi Marc,

thanks for the reply. I just installed version 5.x.4_dev on my installation and everything indexes fine, just the blank screen comes up when calling up the search. Blank appears also when I change settings for the module. I do not have access to server logs but I will try to get hold of them contacting the provider. The configuration of the servers seems to be up to date though...
Any other ideas?

Kind regards, Peter.

blank screen issues here too

Hi Marc & Peter,

Yes, I just installed version 5.x.4_dev on my Windows installation and am getting the white screen of death on the module config screen which appears to be caused by the following error (from the log, and as a warning when viewing other content).

Cannot modify header information - headers already sent by (output started at C:\wamp\www\actdev2\sites\all\modules\search_attachments\search_attachments.module:1) in C:\wamp\www\actdev2\includes\common.inc on line 314.

Not a problem with 5.x.3.

Any ideas?

Simon

Hi all, same for me on Unix.

Hi all,

same for me on Unix. I installed 5.x.3 now and this is working fine. I had the same error logs in drupal that you have posted here.
I can't think of a solution myself, help needed!

cheers, peter.

Headers already sent error - more info requested

Hi everyone, sorry to hear about the white screen of death and for the delay in responding. I'm a bit suprised to hear about this error since I can't replicate it on any of my installs, which means it might be PHP version-specific (or something). If you could provide more information about the PHP version you are using that might help me track down the problem. Any messages from your web server error logs associated with this problem would be super useful as well.

Mark

php version

Hi Marc,
my installation runs on php 5.2.4 with 50M of script-memory. Unfortunately I can't post any server logs and drupal is only presenting what has been mentioned here before.
Thanks for your help.
Cheers, Peter.

Fixed by eliminating strange character

I had this problem too. White screen of death, and in the logs:

Cannot modify header information - headers already sent by (output started at
/****/sites/all/modules/search_attachments/search_attachments.module:1)
in /****/includes/session.inc on line 100.

My friend sorted this: "So that told me that search_attachments.module on line 1 was sending some output, which was wreaking havock with the rest of the site (once there is html page output, headers can no longer be set and things like sessions break - and you get this typical error message). I looked at that line in the file, and found 3 strange characters befor the <?php start tag. I removed them, and the problem seems fixed."

Any comments on this solution?

iconv_get_encoding error

Hi,

I'm trying to install 5_x_3 and am getting the white screen when I try to configure the helper apps. I'm getting this error on my server logs:

PHP Fatal error: Call to undefined function: iconv_get_encoding() in [site location]/sites/all/modules/search_attachments/search_attachments.module on line 644

What would the solution be?

Thanks for your help! Great module by the way...

--SFG

How to get helper file for windows XP

Mark,
Please help mate. your module is really great but i just need helper file to use it and i badly needs it so please tell me how to to get helper file for windows XP and what is helper path i should put in your module

Quick reply will be really appreciated!!

Thanks a lot

Ankit

I can't recommend any

Hi Ankit,

I do not run this module on Windows so can't recommend any helpers. I can tell you that whatever helpers you install can't have any spaces in their paths (see above). Perhaps some search_attachments users who do run it on Windows can supply some suggestions.

Also, the next version of the module, which is still in test phase, will let you use PHP code added to the helper config entries instead of external helpers to extract the text of files. So, if you can extract the text of a .doc or .xls with PHP, you won't need an external helper. I will configure the module by default with whatever PHP helpers I can write myself or collect from other users.

In the meantime, any suggestions for external helpers for Windows anyone? I can compile them into a list and add it to this page.

Mark

Thanks a lot Mark!!

Hi Mark

Thanks a lot for your quick reply!!

So it means in your new version of this module , we just put your module in our module folder in the server and activate it and then we can start searching, there will be no need to find any other helper. Am i right?

If so , then mate can you give some idea when you will be releasing new version, I will be really waiting for it.

Thanks again and you are going great job for the beginers in CMS like me!!

Ankit

Yes, for text files

In the upcoming version, for text files, yes you are right, since it is easy to write PHP code to read the conents of them =8^) Over time I hope to write or have contributed to the project PHP snippets that will work this way on .pdf, .doc, .xls, and other common binary file formats. Until I or someone else writes these snippets, you'll need to configure an external helper for these formats.

I'm gearing up for a third beta release of version 5.x-4, and hope to have it out at the end of this weekend. So, if all goes right, the new version should be tested and stable in a couple of weeks.

Thanks for your update!!

So in your version 5.x-4, we will be able to search only text files or pdf, word also?
Mark , can you suggest some other module or way by which I will be able to search pdf files , if there is no other way then its fine I will wait for your next release.

Thanks again!!

Ankit

You can search for whatever type of files there are helpers for

You will be able to index whatever type of files there are helpers for. The new version will have two ways to configure helpers, the first as it currently does (external helpers) and the second by using PHP code snippets (call them internal helpers if you want) to extract and process text. This means that if someone can write PHP code to extract text from files, they don't need to use an external program. Two ways are better than one!

I'm not aware of any other modules that index attachments and files.

PHP Code snippets will be easily run on windows XP

Hi Mark

Thanks for reply. As you know i can not find helper for your module to run on windows XP. But like you said from second way by Using PHP snippets , they can be easily used on windows platform. am i right??

I hope you release your next version soon or i get some external helper for your current module for windows XP.

Thanks again to spend your time for replying!!

Ankit

Yes, that's correct, PHP

Yes, that's correct, PHP snippets will run as well on Windows as on any other platform.

List of Helper Apps

I wonder if we can get a list of helper apps put together somewhere for this? What made me think of this is that there are probably so many helper apps possible, and it's hard to think of them all. For example, it seems like this ought to have .csv files set up to go by default, since it has .txt files, and they're basically the same thing*.

I only thought of .csv files because I have them currently. The problem I foresee is that the next time somebody uploads a certain kind of file, they (and especially I), won't think to check if search is enabled for their extension. If there was a list of helper apps, that would be very helpful.

Just some thoughts...

*Side note here: it would be nice if you could indicate multiple extensions per helper app, so for /usr/bin/cat you can indicate txt, csv, html, php, etc. without having to set up the same helper app a half dozen times...

Multiple extensions per helper

Nice suggestion, I was thinking about that myself. If that feature doesn't make it into 5.x-4, it will make it into the next one. I've already added so many new features to the latest -dev version that I'm now needing to redo all my prerelease testing, so I'd like to stop adding new stuff for now. It's a must have feature however.

OK, helpers can now handle multiple extensions

I added this feature. It will be in the next version of 5.x-4-dev, which I hope to have out after a bit more testing, maybe by mid week. You can also configure labels to match extensions, if you want to say stuff in your search results like "Comma Separated Values" file instead of "cvs". These labels are optional.

Yes, we do need a good list of helpers

I do intend to maintain a list, but I'd love some help finding good helpers. Perhaps if everyone posted a comment with the helpers they are currently using (maybe indicating operating system, helper name, and a URL of where to get it), that would be awesome. After 5.x-4 is out, I'm going to clean up the search_attachments web presence, maybe have a separate page just listing the helpers.

The beginnings of a list

pdftotext --> pdf
cat --> csv, txt, html, php
unrtf --> rtf
ppthtml --> ppt
xls2csv --> xls
antiword --> doc

Thanks, that's a great start at a list of helpers

I'd like to gather as many as possible, then flesh the list out a bit with some setup info, OS-specific tricks, etc. Thanks again -- keep 'em coming!

Knowing what type of files your users are uploading

Yes, that's a concern. Would a report in search_attachments' admin pages listing extensions for which there was no helper defined be of use? At least you would know what people are uploading.

Maybe another way to tackle it is to pick out the lists of allowed file types that various file management modules provide (like upload, for example), and use that information in some way. If the list is stored by the file management module in the Drupal database, search_attachments can retrieve it and display it.

Thanks for the feedback and suggestions,

Mark

That's a good thought, but

That's a good thought, but it probably wouldn't be as helpful as a list. Such a list would by necessity have media files like avi's, jpg's and mp3's, which we probably don't care so much about. I think just having a list of helper apps would probably be the biggest benefit over all. I'll post the ones I've installed, under the comment above, though I don't have many that are that creative.

avi's, jpg's and mp3's

I might include a warning about files that have no helpers, since the logic is already in place for another check. I'll make it unobtrusive so it doesn't irritate admins who don't want it =8^)

As for media files, PHP libraries like http://getid3.sourceforge.net/ can help extract metadata that can be searched, and with search_attachment's new ability to use pure PHP helpers, using this type of library will be pretty easy.

Need help!!

Hi Mark

Now I am changing to Linux Platform so I can try your Search Attachement Module. I need some helps :

1. The link i see above to download Helper File is not working properly. So can you recommend some another web to download helper?

2. For helper meaning I need to downlaod Precomplied Binaries File?

3. After I download how can I use it. I mean where should i save it and which path should i specify in your module? or do i need to something more as well?

Please Advice as I am new to CMS ?

Will be waiting for you reply.

Regards

Ankit

Wait before changing operating systems

Thanks for reporting the broken link to catdoc.

I can't provide help setting up a linux machine, but I don't think you should change operating systems just to try this module. The new version, which will be available for testing in a few days, has built-in helpers for text and CSV files. Wait until you can give it a try before moving to a new OS.

New Version will support only text and csv not pdf and word

Hi Mark

Any rough estimation , when new version will be coming out? and will it only support text and csv formsts not pdf , word formats .

Can you advice me which OS , i just want to try out your module and about OS I will handle later on.

So which OS you are using for this module?

I just need to know the which helper path should i put in your module and download only Precomplied Binaries file from the web mention above?

Thanks for all your help and advice

Regards

Ankit

No, that's not correct

No Ankit, that is not correct. The new version will work with any type of file, as long as there is a helper that can extract searchable text from it. I will probably provide internal PHP helpers for txt, csv, and xml, but a major feature of the module is that you can configure arbitrary helpers. If I ever get time to figure out how to extract text from PDF and .doc files using PHP, I'll include those as well.

I am running my production site on Redhat Enterprise Linux, and am developing the module on Mac OS X. I am also testing installation and configuration, but not a lot of external helpers, on WinXP.

The path that you use on linux depends on the helper app. Your linux machine should have 'cat' installed by default, which is a good helper for txt. To find the path, do 'which cat' at the command line. To use the path to 'cat' in search_attachments, enter the path plus '%file%' (i.e., assuming the path is /bin/cat, the helper path should look like this:

/bin/cat %file%

The installation instructions for any helper apps you download and install will probably suggest what the default path for the app is. If you do install something, you should be able to find out what its path is by issuing a 'which' command followed by the name of the application. See 'man which' for more information.

windoze working fine

@ankit

I have search_attachments working fine in windoze with helpers for PDF, Text, Powerpoint, Excel, Word, and Rich Text Format. Which helpers are you having trouble finding?

Great Module ... Support for large pdf?

Hey,
Thanks for your hard work on this module. I've started using it today and so far it works well. I was wondering though, for large documents, is there a way to manually add them to the search index because cron will start timing out for a large (4MB) PDF document..

Any help would be appreciated.

Michael

Just added some features to help with that

Hi Michael, see http://interoperating.info/mark/node/74#comment-4688 -- I've tested it with 14 MB PDFs with no timeout. I'll be releasing the latest version this evening or tomorrow morning, maybe you could give it a try and let me know how it goes.

6.x?

It doesn't seem long ago that people, were asking for a 5.0 version. Now, I was wondering if you had plans to update it to version 6?

Thanks!

6.x is next

Hi Richard, my plan is to start the port to 6.x as soon as 5.x-4 is released as stable, which should be with