recoll
Вообще, чот давно не пользовалась. Ну, и тут с остатками латеха.
text
Удобнее через recoll -t
Наиболее удобным был вариант запроса recoll -t -A запрос
. Сейчас почему-то «абстракты» пустые.
Common options:
-c <configdir> : specify config directory, overriding \verb|$RECOLL_CONFDIR| -d also dump file contents -n [first-]<cnt> define the result slice. The default value for [first] is 0. Without the option, the default max count is 2000. Use n=0 for no limit -b : basic. Just output urls, no mime types or titles -Q : no result lines, just the processed query and result count -m : dump the whole document meta[] array for each result -A : output the document abstracts -S fld : sort by field <fld> -s stemlang : set stemming language to use (must exist in index…) Use -s "" to turn off stem expansion -D : sort descending -i <dbdir> : additional index, several can be given
-F <field name list> : output exactly these fields for each result. The field values are encoded in base64, output in one line and separated by one space character. This is the recommended format for use by other programs. Use a normal query with option -m to see the field names.
recoll -t [ -c <configdir> ] [ -o | -f | -a ] [ -b ] [ -d ] [ -A ] [ -e ] [ -m ] [ -n <[first-]cnt> ] [ -Q ] [ -s <stemming language> ] [ -S <fldname> ] [ -D ] [ -i <additional index directory> ] [ -F <space separated field name list> ] <query string>
recollq -t -P
DESCRIPTION
The recoll -t command will execute the Recoll query specified on the command line and print the results to the standard output. It is primarily designed for diagnostics, or piping the data to some other program.
The basic format and its variations can be useful for command line querying. The -F option should exclusively be used for using the output data in another program, as it is the only one for which output is guaranteed to be fully parseable.
The -c option specifies the configuration directory name, overriding the default or \verb|$RECOLL_CONFDIR|.
The query string is built by concatenating all arguments found at the end of the command line (after the options). It will be interpreted by default as a query language string. Quoting should be used as needed to escape characters that might be interpreted by the shell (ie: wildcards).
If -a is specified, the query string will be interpreted as an all words simple search query.
If -o is specified, the query string will be interpreted as an any word simple search query.
If -f is specified, the query string will be interpreted as a file name simple search query.
-b (basic) can be specified to only print the result urls in the output stream.
If -d is set, the text for the result files contents will be dumped to stdout.
If -m is set, the whole metadata array will be dumped for each document.
If -A is set, the document abstracts will be printed.
-S <fieldname> sorts the results according to the specified field. Use -D for descending order.
n [first-]<cnt> can be used to set the maximum number of results that should be printed. The default is 2000. Use a value of 0 for no limit.
-s <language> selects the word stemming language. The value should match an existing stemming database (as set in the configuration or added with recollindex -s). -s"" - отключает расширение запроса при помощи стемминга.
-i <extra dbdir> adds the specified Xapian index to the set used for the query. Can be specified multiple times.
-F <space separated field list> should be used for piping the data to another program. After 2 initial lines showing the actual query and the estimated result counts, it will print one line for each result document. Each line will have exactly the fields requested on the command line. Fields are encoded in base64 and separated by one space character. Empty fields are indicated by consecutive space characters. There is one additional space character at the end of each line.
-e use url encoding (\%xx) for urls
recoll -t -P (Period) will print the minimum and maximum modification years for documents in the index.
Parameters for the PDF input script pdfocr Attempt OCR of PDF files with no text content if both tesseract and pdftoppm are installed. The default is off because OCR is so very slow.
Можно попробовать взять на вооружение… интересны cuneiform + ocrodjvu и tesseract-ocr. Вроде бы неплохо распознавали.
\verb|recollindex -c ~/.recoll-texts/ >> recollindexlog 2>&1| - когда хочется несколько конфигов для разных коллекций текстов, например.
Ссылки
Язык запросов recoll
Умеет AND, OR, скобки для группировки. -word - исключённое слово.
Поля:
\begin{itemize} \item title, subject or caption are synonyms which specify data to be searched for in the document title or subject. \item author or from for searching the documents originators. \item recipient or to for searching the documents recipients. \item keyword for searching the document-specified keywords (few documents actually have any). \item filename for the document's file name. This is not necessarily set for all documents: internal documents contained inside a compound one (for example an EPUB section) do not inherit the container file name any more, this was replaced by an explicit field (see next). Sub-documents can still have a specific filename, if it is implied by the document format, for example the attachment file name for an email attachment. \item containerfilename. This is set for all documents, both top-level and contained sub-documents, and is always the name of the filesystem directory entry which contains the data. The terms from this field can only be matched by an explicit field specification (as opposed to terms from filename which are also indexed as general document content). This avoids getting matches for all the sub-documents when searching for the container file name. \item ext specifies the file name extension (Ex: ext:html) \item dir for filtering the results on file location (Ex: dir:/home/me/somedir). -dir also works to find results not in the specified directory (release >= 1.15.8). Tilde expansion will be performed as usual (except for a bug in versions 1.19 to 1.19.11p1). Wildcards will be expanded, but have an important limitation of wildcards in path filters. Relative paths also make sense, for example, dir:share/doc would match either /usr/share/doc or /usr/local/share/doc Several dir clauses can be specified, both positive and negative. For example the following makes sense: dir:recoll dir:src -dir:utils -dir:common This would select results which have both recoll and src in the path (in any order), and which have not either utils or common. You can also use OR conjunctions with dir: clauses. You need to use double-quotes around the path value if it contains space characters. \item size for filtering the results on file size. Example: size<10000. You can use <, > or = as operators. You can specify a range like the following: size>100 size<1000. The usual k/K, m/M, g/G, t/T can be used as (decimal) multipliers. Ex: size>1k to search for files bigger than 1000 bytes. \item date for searching or filtering on dates. The syntax for the argument is based on the ISO8601 standard for dates and time intervals. Only dates are supported, no times. The general syntax is 2 elements separated by a / character. Each element can be a date or a period of time. Periods are specified as PnYnMnD. The n numbers are the respective numbers of years, months or days, any of which may be missing. Dates are specified as YYYY-MM-DD. The days and months parts may be missing. If the / is present but an element is missing, the missing element is interpreted as the lowest or highest date in the index. Examples: 2001-03-01/2002-05-01 the basic syntax for an interval of dates. 2001-03-01/P1Y2M the same specified with a period. 2001/ from the beginning of 2001 to the latest date in the index. 2001 the whole year of 2001 P2D/ means 2 days ago up to now if there are no documents with dates in the future. /2003 all documents from 2003 or older. Periods can also be specified with small letters (ie: p2y). \item mime or format for specifying the MIME type. These clauses are processed besides the normal Boolean logic of the search. Multiple values will be OR'ed (instead of the normal AND). You can specify types to be excluded, with the usual -, and use wildcards. Example: mime:text/* -mime:text/plain Specifying an explicit boolean operator before a mime specification is not supported and will produce strange results. \item type or rclcat for specifying the category (as in text/media/presentation/etc.). The classification of MIME types in categories is defined in the Recoll configuration (mimeconf), and can be modified or extended. The default category names are those which permit filtering results in the main GUI screen. Categories are OR'ed like MIME types above, and can be negated with -. \end{itemize}mime, rclcat, size and date criteria always affect the whole query (they are applied as a final filter), even if set with other terms inside a parentheses.
mime (or the equivalent rclcat) is the only field with an OR default. You do need to use OR with ext terms for example.
Some characters are recognized as search modifiers when found immediately after the closing double quote of a phrase, as in "some term"modifierchars. The actual "phrase" can be a single term of course. Supported modifiers:
\begin{itemize} \item l can be used to turn off stemming (mostly makes sense with p because stemming is off by default for phrases). \item s can be used to turn off synonym expansion, if a synonyms file is in place (only for Recoll 1.22 and later). \item o can be used to specify a "slack" for phrase and proximity searches: the number of additional terms that may be found between the specified ones. If o is followed by an integer number, this is the slack, else the default is 10. \item p can be used to turn the default phrase search into a proximity one (unordered). Example: "order any in"p \item C will turn on case sensitivity (if the index supports it). \item D will turn on diacritics sensitivity (if the index supports it). \item Weight can be specified for a query element by specifying a decimal value at the start of the modifiers. Example: "Important"2.5. \end{itemize}Чтобы этим пользоваться, нужны кавычки вокруг запроса.
python-libxml2 and python-libxslt1 should be added to the 'Recommends', and catdoc should be dropped from the 'Suggests' because it's not used at all any more.
\iffalse
Recoll
Recoll -t …
Default: will interpret the argument(s) as a xesam query string query may be like: implicit AND, Exclusion, field spec:t1 -t2 title:t3 OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4) Phrase: “t1 t2” (needs additional quoting on cmd line). Runs a recoll query and displays result lines.
-P: Show the date span for all the documents present in the index [-o|-a|-f] [-q] -o Emulate the GUI simple search in ANY TERM mode -a Emulate the GUI simple search in ALL TERMS mode -f Emulate the GUI simple search in filename mode -q is just ignored (compatibility with the recoll GUI command line)
Common options: -c : specify config directory, overridinqg \verb|$RECOLL_CONFDIR| -d also dump file contents -n [first-] define the result slice. The default value for [first] is 0. Without the option, the default max count is 2000. Use n=0 for no limit -b : basic. Just output urls, no mime types or titles -Q : no result lines, just the processed query and result count -m : dump the whole document meta[] array for each result -A : output the document abstracts -S fld : sort by field -D : sort descending -i : additional index, several can be given -e use url encoding (\%xx) for urls -F : output exactly these fields for each result. The field values are encoded in base64, output in one line and separated by one space character. This is the recommended format for use by other programs. Use a normal query with option -m to see the field names.
несколько ссылок
- http://www.lesbonscomptes.com/recoll/
- http://xapian.org/docs/bindings/perl/Search/Xapian.html
- https://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos
- http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.tips.html
- http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.commandline.html
- http://www.lesbonscomptes.com/recoll/usermanual/rcl.program.api.html#RCL.PROGRAM.API.PYTHON
You can execute your own version of rclpdf by modifying ~/.recoll/mimeconf:
[index] application/pdf = exec /path/to/my/own/rclpdf
At this point, recollindex would receive and extract a pdfpages field, but it would not know what to do with it. We are going to tell it to store the value inside the document data record so that it can be displayed in the results, and sorted on. For this we modify the ~/.recoll/fields file:
[stored] pdfpages=
That's it ! After reindexing, you can now display pdfpages inside the result list (add a \%(pdfpages) value to the paragraph format), and display pdfpages inside the result table (right-click the table header), and sort the results on page count (click the column header). Note that pdfpages has not been defined as searchable (this would not make much sense). For this, you'd have to define a prefix and add it to the [prefixes] fields file section:
[prefixes] pdfpages = XYPDFP
Have a look at the comments inside the fields file for more information.
\fi