156 lines
7.0 KiB
HTML
156 lines
7.0 KiB
HTML
|
<HTML>
|
||
|
<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
|
||
|
<BODY>
|
||
|
<h1>APR Canonical Filename</h1>
|
||
|
|
||
|
<h2>Requirements</h2>
|
||
|
|
||
|
<p>APR porters need to address the underlying discrepancies between
|
||
|
file systems. To achieve a reasonable degree of security, the
|
||
|
program depending upon APR needs to know that two paths may be
|
||
|
compared, and that a mismatch is guarenteed to reflect that the
|
||
|
two paths do not return the same resource</p>.
|
||
|
|
||
|
<p>The first discrepancy is in volume roots. Unix and pure deriviates
|
||
|
have only one root path, "/". Win32 and OS2 share root paths of
|
||
|
the form "D:/", D: is the volume designation. However, this can
|
||
|
be specified as "//./D:/" as well, indicating D: volume of the
|
||
|
'this' machine. Win32 and OS2 also may employ a UNC root path,
|
||
|
of the form "//server/share/" where share is a share-point of the
|
||
|
specified network server. Finally, NetWare root paths are of the
|
||
|
form "server/volume:/", or the simpler "volume:/" syntax for 'this'
|
||
|
machine. All these non-Unix file systems accept volume:path,
|
||
|
without a slash following the colon, as a path relative to the
|
||
|
current working directory, which APR will treat as ambigious, that
|
||
|
is, neither an absolute nor a relative path per se.</p>
|
||
|
|
||
|
<p>The second discrepancy is in the meaning of the 'this' directory.
|
||
|
In general, 'this' must be eliminated from the path where it occurs.
|
||
|
The syntax "path/./" and "path/" are both aliases to path. However,
|
||
|
this isn't file system independent, since the double slash "//" has
|
||
|
a special meaning on OS2 and Win32 at the start of the path name,
|
||
|
and is invalid on those platforms before the "//server/share/" UNC
|
||
|
root path is completed. Finally, as noted above, "//./volume/" is
|
||
|
legal root syntax on WinNT, and perhaps others.</p>
|
||
|
|
||
|
<p>The third discrepancy is in the context of the 'parent' directory.
|
||
|
When "parent/path/.." occurs, the path must be unwound to "parent".
|
||
|
It's also critical to simply truncate leading "/../" paths to "/",
|
||
|
since the parent of the root is root. This gets tricky on the
|
||
|
Win32 and OS2 platforms, since the ".." element is invalid before
|
||
|
the "//server/share/" is complete, and the "//server/share/../"
|
||
|
seqence is the complete UNC root "//server/share/". In relative
|
||
|
paths, leading ".." elements are significant, until they are merged
|
||
|
with an absolute path. The relative form must only retain the ".."
|
||
|
segments as leading segments, to be resolved once merged to another
|
||
|
relative or an absolute path.</p>
|
||
|
|
||
|
<p>The fourth discrepancy occurs with acceptance of alternate character
|
||
|
codes for the same element. Path seperators are not retained within
|
||
|
the APR canonical forms. The OS filesystem and APR (slashed) forms
|
||
|
can both be returned as strings, to be used in the proper context.
|
||
|
Unix, Win32 and Netware all accept slashes and backslashes as the
|
||
|
same path seperator symbol, although unix strictly accepts slashes.
|
||
|
While the APR form of the name strictly uses slashes, always consider
|
||
|
that there could be a platform that actually accepts slashes as a
|
||
|
character within a segment name.</p>
|
||
|
|
||
|
<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
|
||
|
filesystems mounted in Unix. Case insensitivity can permit the same
|
||
|
file to slip through in both it's proper case and alternate cases.
|
||
|
Simply changing the case is insufficient for any character set beyond
|
||
|
ASCII, since various dilectic forms of characters suffer from one to
|
||
|
many or many to one translations. An example would be u-umlaut, which
|
||
|
might be accepted as a single character u-umlaut, a two character
|
||
|
sequence u and the zero-width umlaut, the upper case form of the same,
|
||
|
or perhaps even a captial U alone. This can be handled in different
|
||
|
ways depending on the purposes of the APR based program, but the one
|
||
|
requirement is that the path must be absolute in order to resolve these
|
||
|
ambiguities. Methods employed include comparison of device and inode
|
||
|
file uniqifiers, which is a fairly fast operation, or quering the OS
|
||
|
for the true form of the name, which can be much slower. Only the
|
||
|
acknowledgement of the file names by the OS can validate the equality
|
||
|
of two different cases of the same filename.</p>
|
||
|
|
||
|
<p>The sixth discrepancy, illegal or insignificant characters, is especially
|
||
|
significant in non-unix file systems. Trailing periods are accepted
|
||
|
but never stored, therefore trailing periods must be ignored for any
|
||
|
form of comparison. And all OS's have certain expectations of what
|
||
|
characters are illegal (or undesireable due to confusion.)</p>
|
||
|
|
||
|
<p>A final warning, canonical functions don't transform or resolve case
|
||
|
or character ambiguity issues until they are resolved into an absolute
|
||
|
path. The relative canonical path, while useful, while useful for URL
|
||
|
or similar identifiers, cannot be used for testing or comparison of file
|
||
|
system objects.</p>
|
||
|
|
||
|
<hr>
|
||
|
|
||
|
<h2>Canonical API</h2>
|
||
|
|
||
|
Functions to manipulate the apr_canon_file_t (an opaque type) include:
|
||
|
|
||
|
<ul>
|
||
|
<li>Create canon_file_t (from char* path and canon_file_t parent path)
|
||
|
<li>Merged canon_file_t (from path and parent, both canon_file_t)
|
||
|
<li>Get char* path of all or some segments
|
||
|
<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
|
||
|
<li>Compare two canon_file_t structures for file equality
|
||
|
</ul>
|
||
|
|
||
|
<p>The path is corrected to the file system case only if is in absolute
|
||
|
form. The apr_canon_file_t should be preserved as long as possible and
|
||
|
used as the parent to create child entries to reduce the number of expensive
|
||
|
stat and case canonicalization calls to the OS.</p>
|
||
|
|
||
|
<p>The comparison operation provides that the APR can postpone correction
|
||
|
of case by simply relying upon the device and inode for equivilance. The
|
||
|
stat implementation provides that two files are the same, while their
|
||
|
strings are not equivilant, and eliminates the need for the operating
|
||
|
system to return the proper form of the name.</p>
|
||
|
|
||
|
<p>In any case, returning the char* path, with a flag to request the proper
|
||
|
case, forces the OS calls to resolve the true names of each segment. Where
|
||
|
there is a penality for this operation and the stat device and inode test
|
||
|
is faster, case correction is postponed until the char* result is requested.
|
||
|
On platforms that identify the inode, device, or proper name interchangably
|
||
|
with no penalities, this may occur when the name is initially processed.</p>
|
||
|
|
||
|
<hr>
|
||
|
|
||
|
<h2>Unix Example</h2>
|
||
|
|
||
|
<p>First the simplest case:</p>
|
||
|
|
||
|
<pre>
|
||
|
Parse Canonical Name
|
||
|
accepts parent path as canonical_t
|
||
|
this path as string
|
||
|
|
||
|
Split this path Segments on '/'
|
||
|
|
||
|
For each of this path Segments
|
||
|
If first Segment
|
||
|
If this Segment is Empty ([nothing]/)
|
||
|
Append this Root Segment (don't merge)
|
||
|
Continue to next Segment
|
||
|
Else is relative
|
||
|
Append parent Segments (to merge)
|
||
|
Continue with this Segment
|
||
|
If Segment is '.' or empty (2 slashes)
|
||
|
Discard this Segment
|
||
|
Continue with next Segment
|
||
|
If Segment is '..'
|
||
|
If no previous Segment or previous Segment is '..'
|
||
|
Append this Segment
|
||
|
Continue with next Segment
|
||
|
If previous Segment and previous is not Root Segment
|
||
|
Discard previous Segment
|
||
|
Discard this Segment
|
||
|
Continue with next Segment
|
||
|
Append this Relative Segment
|
||
|
Continue with next Segment
|
||
|
</pre>
|
||
|
|
||
|
</BODY>
|
||
|
</HTML>
|