Minimalist Online Docmentation

This is an area where I keep my personal notes while I'm working on a solution to a problem. These aren't in any order, and aren't necessarily consistent with any implementation of MOD, much less the current version.

File and directory order

OK, wanting to reorganize the tutorial has made me realize that not having any way to specify the order of files in a directory is a big PITA. I absolutely do not want to rely on something outside of the file itself (such as a file in the directory dictating the order, or tags in a document indicating what the next one should be) to determine this, since it's far too kludgy for my taste; no maintainability, and it goes against the philosophy of simplicity that's been carried out pretty well so far. I can see a couple possibilities:

filenames/directories that start with a number are assumed to want to be in an order, and will be converted to an integer and ordered accordingly.
start doing everything in alphanumerical order, period.

In either case, the question remains of whether to mix files/directories. Does a directory "3/" come between files "2.mod" and "4.mod", or after them? Very good points for either case.

All in all, nothing strikes me yet as a particularly good solution. So far, I'm still committed to the ideal that all of the information necessary to express the index layout can be intuitively represented by the filesystem itself.

Here's a shot at a ruleset:

=me_first files are abolutely first
Files and directories that begin with a number are ordered, and occur after =me_first files
Uncategorized files go after ordered files/dirs
=me_last files go before the uncategorized directories
Uncategorized directories are absolutely last

An alternate depiction:

+-                               <-- files that contain =me_first
|  numbered files & directories
+-
|  uncategorized files
+-                               <-- files that contain =me_last
|  uncategorized directories
+-

Note that this is identical to the current behavior except for the insertion of the new "numbered files & directories" item. Perhaps a better depection would be:

        +-
        |   files that contain =me_first
        |                                   <-- numbered files & directories
 files <    uncategorized files
        |
        |   files that contain =me_last
        +- 
  dirs <    uncategorized directories
        +-

This would be an awful lot cleaner if directories couldn't be numbered, since it intruduces the inconsistency of moving directories into the previously exclusive file space, but being able to use them as containers in an ordered hierarchy seems just too attractive. Perhaps it should just be encouraged that either one technique (=me_first, =me_last) or the other (numbered files & dirs) be used, but both can confusing. An additional tag should probably be added that allows the number to be hidden in the index (=hide_number?).

Here's another thought; is it =me_first and =me_last that are introducing the confusion? Is there some way to do away with them? Doing away with =me_first would be easy, since adding a low number would do the same job. No such luck with =me_last.

One more problem; none of this address the very reasonable desire to want things in alphabetical order.

OK, new tack. How about a config file option? (default_sort = name|type|both ?)

name: files and dirs mixed and ordered alphanumerically by name. =me_first and =me_last push to absolute top and bottom.
type: files first, then dirs. default. =me_first and =me_last push to top or bottom of type.
both: alphanumeric within type. =me_first and =me_last push to top or bottom of type.

This is acutally sounding like the best so far, especially if there is some clean way to control it by directory... how about existence of a file "sort.name". This would also allow the default behavior to remain unchanged.

Keep flushing this out; there's some way of working this to allow me_first/last control for directories as well; the "type" sort method probably isn't necessary, since it's covered by "both". There's a better name than default_sort that will make me_first/list behaviour obvious, especially if they're changed to something like the tag "=this_file_first", and the file "=this_dir_first". So, the so-called default_sort option really wouldn't control sort method at all (it'd all be alphanumeric), it would only control the "mixture" of files & dirs. What to call it? files called =mingled_dirs & =unmingled dirs?

First step is to implement sorted unmingled_dirs as the new default behavior, and to change me_first/last to this_file_first/last. From there we can look at adding control for mingled dirs and this_folder_first/last.

The mingled_dirs option might need to be in the config file (rather than being able to control it at the directory level), so that we know how to sort nodes before we start the find(); otherwise we could be partway through a directory and suddenly discover that we're supposed to be sorting things in another way.

Reexamination of naming

File tags

=topic, =subtopic, =subtopic2 ...
=title
=description
=keywords
=ignore, =end_ignore
implies that formatting won't be performed either. How about =ignore_tags, =end_ignore_tags
=no_format, =end_no_format
don't like the reverse logic. how about =verbatim, =end_verbatim? although that implies no tag processing...
=use_topic_index
don't like it. it feels very ambiguous.
=me_first, =me_last
in the face of being able to do this with directories, should probably change to =this_file_first, =this_file_last.

Template variables

$Title$
$Description$
$Keywords$
$Index$
$Topic_Index$
still don't like it, very ambiguous.
$Body$
$MOD_Source$
$Last_Modified$
$Next$, $Prev$

Config file variables

destination
web_site_url
folder_icon_url, file_icon_url, bullet_icon_url, here_icon_url
default_template
status_dir
this one kind of bothers me, not the name, but whether we should be specifying something else instead, such as the mod dir...

File extensions

.mod
.txt
.tmpl

File and directory names

.mod/
.mod/config
.mod/default.tmpl
still torn as to whether this should go at the root of the source tree instead. I guess it shouldn't, since you could very possibly have a file default.mod.
.mod/status/
.mod/status/status.
=this_folder_first, =this_folder_last
mixed feelings on this one, since it introduces the concept of control files that clutter the source and have nothing directly to do with content. It should be =this_folder_first and =this_folder_last (rather than dir) to accomodate conventions on other platforms.
=mingle_folders, =unmingle_folders
again, mixed feelings on the necessity of introducing control files. I guess the names themselves are reasonable, though.

Directory links?

Should directories in the index be links? Only if there's an index.html?

My initial reaction was that yes, they should be a link if there's an index.html file in that directory. After all, it would be a valid URL.

I decided against it when I thought about what the "you are here" icon would do: you don't want it to point to the directory after you've clicked on it, since that wouldn't really be the file that it's displaying. Since it's going to point to some other file, this means that in terms of UI design, someone clicks on a target, and visually they are taken somewhere _other_ than where they clicked -- that arrow is going to jump down the list, potentially several items away if index.mod doesn't contain a =me_first tag.

I didn't really find that disobedience palatable in terms of user interface design, and I couldn't come up with an alternate behavior for that arrow. So, directories are not links. Perhaps in future versions the index will be expandable and collapsable in a way that does a good job of clearly conveying the change in state. That would be an acceptable directory link behavior.

Source & dest file organization

There's a certain ugliness to the fact that the source tree is all human-maintained, but that the destination tree will probably contain a mixture of auto-generated html and human-maintained images, html and data files. This will require some more thought, as I'm not sure if this is solely a function of the webmaster's organization, or whether any design changes could be made to mod2html to facilitate good separation of manual and automatic content. In my mind, the cleanest organization would be:

    www
     +-images                 $srcdir  = /www/src;
     +-data                   $destdir = /www/html;
     +-src
     |  +-dir1
     |  +-dir2
     |     +-dir3
     +-html
        +-dir1
        +-dir2
           +-dir3

Or perhaps this modification (making sure none of the directories directly under src are called "images" or "data"), which gets rid of the redundant "html/" in the URLs:

  |
  +-src
  |  +-dir1                   $srcdir  = /src;
  |  +-dir2                   $destdir = /www;
  |     +-dir3
  +-www
     +-images
     +-data
     +-dir1
     +-dir2
        +-dir3

This means that 100% of the content in images, data and src are human-maintained, and 100% of the data in html is generated, so you could blow the html directory away at any time. It's also possible to simply blend the two trees, where $srcdir == $destdir. This creates a cleaner presentation from the end viewer's perspective, and allows more flexibility in web site design, but it requires much more attention to maintain, since mistakes (such as accidentally overwriting a file, or making changes to an automatically generated file which are subsequently lost) are far more likely.

Partial update strategy

Just some random notes I took while figuring out how to implement and use mod.status.

Not going to use md5sum, since rebuilding a file really isn't that expensive, but doing the sums could be. In addition, being able to touch a file to have it rebuilt is really the best interface, and doesn't have an equivalent with md5sum.

keep 2 hashes (old & new) that contain filename => mtime

need to recreate all dest files if any are true:

any of the files have changed name/position
we're creating a tar file

allowed to skip a source file if all are true:

not creating a tar file
no source files have changed name/position
this source file has not been modified
the proper dest file for this source file exists

we're allowed to delete a destination file if all are true:

not creating a tar file
it is in the %old_status, but has no corresponding entry in %new_status
the user has said it's ok

don't update the status if you're just creating a tar file

Update to partial updates

The desire to maintain multiple configurations for the same source illustrated that a single mod.status file was going to be insufficient for partial updates. Here are some additional notes on partial updates that detail the status file's redesign and reimplementation.

mod.status definitely needs to keep the names of the mod files themselves, since we need to know the difference between a modified file and a newly created file. Timestamp is less important, since the timestamp of the status file itself could represent the time of last update.

The problem arises when we have one source with multiple config files; each destination should have its own set of status info to be done properly. Destinations that appear on the command line don't use any status info, so they're not a problem. The status info kept for each dest should include template and config file info as well, since these could change name or mtime, and the tree should be subsequently recreated.

So, should all status info continue to be maintained in a single file with multiple sections for each destination (unlikely, too cumbersome to program), or should we have separate status files for each dest. The problem with the latter is:

Where do we store all these status files without cluttering the source tree?
How do we generate unique usable names for them? I'd prefer not to have the user have to name files that they don't maintain in the config file.

One solution would be to have a ".mod/" directory at the source tree that contained the files "status.12NKz5XM5JeKI" and such, where the second field is a hash of the destination (crypt, using a common salt). This file could be annotated with comments to make it a bit friendlier for cleanup. The name of the status file could also be included in the debug output.

Could we detect "stale" status files? Stale status file occurs when a config file changes its destination. If status files are "keyed" on destination, we wouldn't have any way of knowing what the previous destination was. This is very unfortunate, since I really don't like the idea of garbage left behind.

In summary:

One status file per destination. It contains filenames only, of all the source files (mod and txt), the config file, and the templates. Its mtime represents the last update time for that destination.

Here are some possible scenarios for a given source tree and how we'll handle them:

one config file, one destination: This was the original setup; easy to handle -- one status file. Used to be called "mod.status".
one config file, multiple destinations: There currently isn't any way to specify multiple destinations in a single config file. This means that the other destinations must be specified on the command line. This is also easy since command-line destinations don't use status, so this still only uses one status file per source tree.
multiple config files, one destination: This is an odd scenario, but possible. Especially the single case of testing a new config using the command line. Although useful scenarios are possible, for now we'll apply the rule of one status file per destination. This means that there's no way to do partial updates if you're switching config files around, command line or otherwise.
multiple config files, multiple destinations: This is a very likely scenario, the one that prompted this discussion, and the renaming of the "mod.status" file; the common scenario is wanting to maintain two versions of the same content. Each destination has its own status file so that updates to one destination don't affect the status of the other destination. The names of the status files contain a hash of the destination so that they'll be unique and predictable (although unfortunately not very readable). For this reason we'll keep them in a subdirectory.

Random thoughts

Should verbose warn about files missing =description or =meta tags? Maybe this is an example for a "-pedantic" command line option?

Control file reorganization

The introduction of support for multiple config files introduced multiple status files and probably multiple templates as well. Unfortunately, this is really starting to clutter the root of the srcdir. I'm currently thinking about moving everything into a $srdir/.mod directory:

	$srcdir/
	  |
	  +-.mod/
	  |    |
	  |    +-config
	  |    +-default.tmpl
	  |    +-status/
	  |        |
	  |        +-mod.status.49140baa019934cf8c961b2ce886ae38
	  |        +-mod.status.34d9583cd49c584ef30958340b580945
	  +-index.mod
	  +-cray/
	  |   |
	  |   +- index.mod
	 ...

This would keep the srcdir nice and clean, and still hide the unfriendly status files. The default config file location would go from "mod.conf" to ".mod/config" (in fact, there's no reason why it couldn't look for both, and just prefer ".mod/config"). Multiple config files and templates could be kept in the .mod directory as well. In fact, there would be nothing forcing anyone to use this setup (except for the status files) since the config file can be specified on the command line, and the template can be specified in the config file.

To simplify these changes in the future, I should probably have a hash devoted to pathnames.

Templates

On the subject of reorganization, I think it could be really helpful to have more flexibility for specifying templates. How about the following set of rules for determining which template to apply to a given file "filename.mod" or "filename.txt":

if "filename.tmpl" exists for the given source file, it is used as the template.
if "dirname.tmpl" exists for a given dirname, that template becomes the default template for the tree below that directory.
if a source file exists that has the same name as "dirname.tmpl", the template will be used for that file as well.
otherwise, the default template specified in the config file will be applied.

It would probably be easiest to apply these rules during the find() operation, and store each node's template in the node attributes. This way the modifications to second_pass() should be relatively minor. Given a complete node, a subroutine shouldn't have too much trouble figuring out what its template should be.

Guidance

I read the following quote while reading Yvon Chouinard's essay The Next Hundred Years (which is an incredibly worthwhile read, go, read it now!). I thought it represented my goals for this project as well.

Have you ever thought, not only about the airplane but about whatever man builds, that all of man's industrial efforts, all his computations and calculations, all the nights spent working over draughts and blueprints, invariably culminate in the production of a thing whose sole and guiding principle is the ultimate principle of simplicity? It is as if there were a natural law which ordained that to achieve this end, to refine the curve of a piece of furniture, or a ship's keel, or the fuselage of an airplane, until gradually it partakes of the elementary purity of the curve of the human breast or shoulder, there must be experimentation of several generations of craftsmen.
In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away, when a body has been stripped down to its nakedness.
(Antoine de St.-Exup�ry, Wind, Sand and Stars. New York: Harcourt Brace Jovanovich, 1968, 41-42)

On the same note, I recently came across Almost Free Text, which seems to have similar goals to MOD, but only for one document at a time. For that purpose, it accomplishes its goals far better than MOD does (no tags at all!). Good role model.