Directory Scanner
1. Introduction
The purpose of the Directory Scanner processor is to analyse a directory structure
in a filesystem and to produce an XML document containing metadata about the files,
such as name and size. It is possible to specify which files and directories to
include and exclude in the scanning process. The Directory Scanner is also able to
optionally retrieve image metadata.
2. Inputs and Outputs
Type |
Name |
Purpose |
Mandatory |
Input |
config
|
Configuration |
Yes |
Output |
data
|
Result XML data |
Yes |
The Directory Scanner is typically called this way from XPL pipelines:
<p:processor name="oxf:directory-scanner" xmlns:p="http://www.orbeon.com/oxf/pipeline"> <!-- The configuration can often be inline --> <p:input name="config">...</p:input> <p:output name="data" id="directory-scan"/> </p:processor>
3. Configuration
The config input configuration has the following format:
<config> <base-directory>file:/</base-directory> <include>**/*.x?l</include> <include>**/*.xhtml</include> <include>**/*.java</include> <exclude>example-descriptor.xml</exclude> <case-sensitive>false</case-sensitive> </config>
Element |
Purpose |
Format |
Default |
base-directory
|
Directory under which files and directories are scanned, referred to below
as the
search directory.
|
A file: or oxf: URL. The URL may be relative to
the location of the containing XPL file.
Note
The oxf: protocol works only with resource managers that allow
accessing the actual path of the file. These include the Filesystem
and WebApp resource manager.
|
None.
|
include
|
Specifies which files are included |
Apache Ant pattern.
|
None.
|
exclude
|
Specifies which files are excluded |
Apache Ant pattern.
|
None.
|
case-sensitive
|
Whether include and exclude patterns are case-sensitive.
|
true or
false .
|
true
|
default-excludes
|
Whether a set of default exclusion rules must be automatically
loaded. The list is as follows:
-
Miscellaneous typical temporary files
- **/*~
- **/#*#
- **/.#*
- **/%*%
- **/._*
-
CVS
- **/CVS
- **/CVS/**
- **/.cvsignore
-
SCCS
-
Visual SourceSafe
-
Subversion
-
Mac
|
true or
false .
|
false
|
image-metadata/basic-info
|
Whether basic image metadata must be extracted.
|
true or
false .
|
false
|
image-metadata/exif-info
|
Whether Exif image metadata must be extracted.
|
true or
false .
|
false
|
image-metadata/iptc-info
|
Whether iptc image metadata must be extracted.
|
true or
false .
|
false
|
4. Output Format
4.1. Basic Output
The image format starts with a root directory element with a
name and path attribute. The name attribute
specifies the name of the search directory, e.g. web . The
path attribute specifies an absolute path to that directory.
The root element then contains a hierarchical structure of directory
and file elements found. For example:
<directory name="address-book" path="c:\Documents and Settings\John Doe\OPS\src\examples\web\examples\address-book"> <directory name="initialization" path="initialization"> <file last-modified-ms="1101487772375" last-modified-date="2004-11-26T17:49:32.375" size="1250" path="initialization\init-database.xpl" name="init-database.xpl"/> <file last-modified-ms="1101512191718" last-modified-date="2004-11-27T00:36:31.718" size="2410" path="initialization\init-script.xpl" name="init-script.xpl"/> </directory> <file last-modified-ms="1101488200406" last-modified-date="2004-11-26T17:56:40.406" size="5618" path="model.xpl" name="model.xpl"/> <file last-modified-ms="1101484041437" last-modified-date="2004-11-26T16:47:21.437" size="941" path="page-flow.xml" name="page-flow.xml"/> <file last-modified-ms="1121104181591" last-modified-date="2005-07-11T19:49:41.591" size="3165" path="view.xsl" name="view.xsl"/> <file last-modified-ms="1093118707000" last-modified-date="2004-08-21T22:05:07.000" size="934" path="xforms-model.xml" name="xforms-model.xml"/> </directory>
directory elements contain basic information about a matched directory:
Name |
Value |
path
|
Path to the directory, relative to the parent directory. Includes the
current directory name.
|
name
|
Local directory name.
|
Note
The path attribute on the root element is an absolute path from
a filesystem root. The path on child directory
element are relative to their parent directory element.
file elements contain basic information about a matched file:
Name |
Value |
last-modified-ms
|
Timestamp of last modification in milliseconds.
|
last-modified-date
|
Timestamp of last modification in XML xs:dateTime format.
|
size
|
Size of the file in bytes.
|
path
|
Path to the file, relative to the parent directory. Includes the file name.
|
name
|
Local file name.
|
4.2. Image Metadata
When the configuration's image-metadata element is specified,
metadata about images is extracted.
Note
Images are identified by reading the beginning of the files. This means
that extracting image metadata is usually more expensive in time than just
producing regular file metadata.
When an image is identified, an image-metadata element is
available under the corresponding file element:
When image-metadata/basic-info is true in the
configuration, a basic-info element is created under
image-metadata :
Element Name |
Element Value |
content-type
|
Media type of the file: image/jpeg , image/gif ,
image/png . Other image/* values may be
produced for other image formats.
|
width
|
Image width, if found.
|
height
|
Image height, if found.
|
comment
|
Image comment, if found (JPEG only).
|
When image-metadata/exif-info is true in the
configuration, zero or more exif-info elements are created under
image-metadata . Each element has an attribute containing the
name of the category of Exif information. Basic Exif information
has the name Exif . Other names may include Canon
Makernote for a Canon camera, Interoperability , etc. Under
each exif-info element, zero or more param elements
are contained, with the following sub-elements:
Element Name |
Element Value |
id
|
The Exif parameter id. For example, 271 denotes the make
of the camera
|
name
|
A default English name for the given parameter id, when known, for
example Make .
|
value
|
The value of the parameter, for example Canon .
|
This is an example of file element with image metadata:
<file last-modified-ms="1120343217984" last-modified-date="2005-07-03T00:26:57.984" size="961130" path="image0001.jpg" name="image0001.jpg"> <image-metadata> <basic-info> <content-type>image/jpeg</content-type> <width>2272</width> <height>1704</height> </basic-info> <exif-info name="Exif"> <param> <id>271</id> <name>Make</name> <value>Canon</value> </param> <param> <id>272</id> <name>Model</name> <value>Canon PowerShot S40</value> </param>... </exif-info>... </image-metadata> </file>
When image-metadata/iptc-info is true in the
configuration, zero or more iptc-info elements are created under
image-metadata . Each element has an attribute containing the
name of the category of IPTC information. The children element of
iptc-info are the same as for exif-info .
4.3. Other Metadata
The Directory Scanner does not provide metadata about other files at the moment,
but the processor could be extended to support more metadata, about image
formats but also about other file formats such as sound files, etc.
5. Ant Patterns
Note
This section of the documentation is reproduced from a section of the Apache Ant
Manual, with minor adjustments.
Patterns are used for the inclusion and exclusion of files. These patterns look
very much like the patterns used in DOS and UNIX:
'*' matches zero or more characters, '?' matches one character.
In general, patterns are considered relative paths, relative to a task dependent
base directory (the dir attribute in the case of <fileset> ). Only
files found below that base directory are considered. So while a pattern like
../foo.java is possible, it will not match anything when applied since
the base directory's parent is never scanned for files.
*.java matches
.java ,
x.java and
FooBar.java , but not
FooBar.xml (does not end with
.java ).
?.java matches
x.java ,
A.java , but not
.java or
xyz.java (both don't have one character before
.java ).
Combinations of
* 's and
? 's are allowed.
Matching is done per-directory. This means that first the first directory in
the pattern is matched against the first directory in the path to match. Then
the second directory is matched, and so on. For example, when we have the pattern
/?abc/*/*.java and the path
/xabc/foobar/test.java , the first
?abc is matched with
xabc , then
* is matched with
foobar , and finally
*.java is matched with
test.java .
They all match, so the path matches the pattern.
To make things a bit more flexible, we add one extra feature, which makes it
possible to match multiple directory levels. This can be used to match a
complete directory tree, or a file anywhere in the directory tree.
To do this,
**
must be used as the name of a directory.
When
** is used as the name of a
directory in the pattern, it matches zero or more directories.
For example:
/test/** matches all files/directories under
/test/ ,
such as
/test/x.java ,
or
/test/foo/bar/xyz.html , but not
/xyz.xml .
There is one "shorthand" - if a pattern ends
with
/
or
\ , then
**
is appended.
For example,
mypackage/test/ is interpreted as if it were
mypackage/test/** .
**/CVS/*
|
Matches all files in
CVS
directories that can be located
anywhere in the directory tree.
Matches:
CVS/Repository
org/apache/CVS/Entries
org/apache/jakarta/tools/ant/CVS/Entries
But not:
org/apache/CVS/foo/bar/Entries
(foo/bar/ part does not match)
|
org/apache/jakarta/**
|
Matches all files in the
org/apache/jakarta
directory tree.
Matches:
org/apache/jakarta/tools/ant/docs/index.html
org/apache/jakarta/test.xml
But not:
org/apache/xyz.java
(jakarta/ part is missing).
|
org/apache/**/CVS/*
|
Matches all files in
CVS directories
that are located anywhere in the directory tree under
org/apache .
Matches:
org/apache/CVS/Entries
org/apache/jakarta/tools/ant/CVS/Entries
But not:
org/apache/CVS/foo/bar/Entries
(foo/bar/ part does not match)
|
**/test/**
|
Matches all files that have a
test
element in their path, including
test as a filename.
|
When these patterns are used in inclusion and exclusion, you have a powerful
way to select just the files you want.
|