diff options
author | Ralph Amissah <ralph.amissah@gmail.com> | 2023-07-04 23:52:33 -0400 |
---|---|---|
committer | Ralph Amissah <ralph.amissah@gmail.com> | 2023-07-08 15:57:51 -0400 |
commit | 30b5b153afb272e3a7c5c6e76775cffecaea0a78 (patch) | |
tree | a5fe5e6b0857af06051da6ae08d708fddd678151 /markup/pod/sisu-manual/media/text/en | |
parent | homepage updates, re-read (diff) |
add sisu spine description to sisu_markup, review
Diffstat (limited to 'markup/pod/sisu-manual/media/text/en')
-rw-r--r-- | markup/pod/sisu-manual/media/text/en/sisu_markup.sst | 241 |
1 files changed, 241 insertions, 0 deletions
diff --git a/markup/pod/sisu-manual/media/text/en/sisu_markup.sst b/markup/pod/sisu-manual/media/text/en/sisu_markup.sst index ae87b76..f9a27d3 100644 --- a/markup/pod/sisu-manual/media/text/en/sisu_markup.sst +++ b/markup/pod/sisu-manual/media/text/en/sisu_markup.sst @@ -50,6 +50,247 @@ make: :A~ @title-author-date +:B~ SiSU Description + +1~ SiSU Description + +SiSU is an object-centric, lightweight markup based, document structuring, +parser, publishing and search tool for document collections. It is command line +oriented and generates static content that is currently made searchable at an +object level through an SQL database. Markup helps define (delineate) objects +(primarily various types of text block) which are tracked in sequence, +substantive objects being numbered sequentially by the program for object +citation. + +!_ Summary. +An object is a unit of text within a document the most common being a paragraph. +Objects include individual headings, paragraphs, tables, grouped text of various +types such as code blocks and within poems, verse. Objects have properties and +attributes, of particular significance are headings and their levels which +provide document structure. A heading is an object with a heirarchical value, +that conceptually contains other objects (such as paragraphs and possibly +sub-headings etc.). Objects are tracked sequentially as they relate to each +other object within a document and substantive objects are numbered +sequentially, for citation purposes. Notably footnotes are not objects in +themselves, rather belonging to the object from which they are referenced, and +following their own numbering sequence. From heading objects (linked) tables of +content may be generated, and if additional metadata is provided book type +indexes can be generated that link back to the objects to which they relate. + +!_ Unpacking this a bit further. +SiSU as a concept independent of its markup language and the parsers that have +been implemented, is based on the following ideas: + +!_ Object-Centricity. On objects: +In SiSU objects are the fundamental unit from which larger constructs within a +document and the document itself is built. Breaking the document into objects +provides interesting possibilities. + +!_ Objects are fundamental building blocks: +Conceptually within SiSU, objects are the building blocks or individual units of +construction of a document. Objects are usually blocks of text, the most common +of which is the paragraph, other examples include: individual headings, tables, +grouped text of various types which include code blocks and verse within poems, +... and as mentioned an object could also, for example, be an image. Objects can +be formatted and placed as needed, providing flexibility and enabling multiple +types of representation across disperate formats and text recepticle, examples +including html, epub, latex (in the past mind-maps) and sql (populated at an +object level, and thereby providing search with that degree of granularity). + +!_ Sequential. Objects have sequence: +That objects have sequence, goes largely without saying, this follows +authorship, it is part of the definition of a document and how a document is +written to convey meaning. + +!_ Object Numbers & Citation. Substantive objects are numbered for citation purposes: +Most objects within a document are meant by the author to be a substantive part +of the document. All such objects are numbered sequentially and can be +referenced thereby for citation purposes. Object numbers provide the possibility +of citing/locating text precisely across different document formats and +different languages (assuming the document has been translated). For search it +also makes it possible to identify precisely where search criteria is met within +in each document in the form of an index or to view those precise text objects +before deciding which documents are of interest. Additionally the use of objects +(and that objects are numbered) frees the possibility to represent the document +in the manner considered most suitable to a specific document format wilst +retaining its structural (and citation) integrity). + +!_ Characteristics. Objects have properties and attributes: +Objects have properties (and may have attributes). By properties I here refer to +the fundamental type of object, be it a heading, a paragraph, table, verse etc. +Attributes extend further and may include other things that one might wish to +associate with the object (examples not necessarily currently available/ +implemented in SiSU might include, formatting whether it is indented, or +metadata e.g. the associated language, or programming language for a code block) + +!_ Document structure. Heading objects hold documents structure: +Heading objects hold documents structure through their heading level property. +The types of document of interest to SiSU have structure that is captured by the +heading level property. Headings are individual objects like any other with the +additional properties that (i) they may be regarded as containing the other +objects following them sequentially (until the next heading of a similar or +higher level), heading objects may include other headings (sub-headings), and +(ii) that they have a heirarchy, the root "heading" being the document title. \\ +A complication was intruduced to provide greater flexibility across document +output formats. Headings have two sets of levels, the level under which +substantive text occurs, this would be a chapter or segment level, and above +that in the heirarchy if needed are document section separators, book, section, +part. + +!_ Non-objects +Most but not all parts of a document are treated as objects. Notably footnotes +are not objects in themselves, rather belonging to the object from which they +are referenced, and following their own numbering sequence. From heading objects +(linked) tables of content may be generated, and if additional metadata is +provided book type indexes can be generated that link back to the objects to +which they relate. + +!_ The Document Header. +SiSU document have headers which contain document metadata, at a minimum the +document title and author. In addition the document header may contain markup +instruction (e.g. how to identify headings within the document, in which case +those headings need not be found and treated accordingly) + +SiSU parsers have now been implemented in different programming paradigms and +languages a couple of times, the chosen markup has been left unchanged though +the document headers have been modified. + +This is the core of sisu, beyond which there is more but largely in the form of +choices based on ... existing output formats and of implementation detail, +deciding what attributes of objects, or within objects should be supported, +extending markup to allow for the generation of book indexes from if tagging +provided. + +2~ Older Descriptions + +Here is a description that has been used for the original sisu (scribe): + +With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax +in your text editor of choice, SiSU can generate various document formats, most +of which share a common object numbering system for locating content, including +plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF +files, and populate an SQL database with objects (roughly paragraph-sized +chunks) so searches may be performed and matches returned with that degree of +granularity. Think of being able to finely match text in documents, using common +object numbers, across different output formats (same object identifier for pdf, +epub or html) and across languages if you have translations of the same document +(same object identifier across languages). For search, your criteria is met by +these documents at these locations within each document (equally relevant across +different output formats and languages). To be clear (if obvious) page numbers +provide none of this functionality. Object numbering is particularly suitable +for "published" works (finalized texts as opposed to works that are frequently +changed or updated) for which it provides a fixed means of reference of content. +Document outputs can also share provided semantic meta-data. + +2~ ... + +SiSU is less about document layout than it is about finding a way using little +markup to construct an abstract representation of a document that makes it +possible to produce multiple representations of it which may be rather different +from each other and used for different purposes, whether layout and publishing, +scrollworthy online viewing/ reading, or content search. To be able to take +advantage from its minimal preparation starting point of some of the strengths +of rather different established ways of representing documents for different +purposes, whether for search (relational database, or indexed flat files +generated for that purpose whether of complete documents, or say of files made +up of objects), online or other electronic viewing (e.g. html, xml, epub), or +paper publication (e.g. pdf via latex)... + +The solution arrived at is to extract structural information about the document +(document sections and headings within the document, available through pattern +matching or markup) and tracking objects (which primarily are defined units of +text such as paragraphs, headings, tables, verse, etc. but also images) which +can be reconstituted as the same documents with relevant object identification +numbers so text (objects) can be referenced across different output formats and +presentations. + +SiSU generates tables of content, and through its markup the means for metadata +to be provided for the generation of book style indexes for a document (that +again due to document object numbers are the same and equally relevant across +all document formats). Per document classifying/organizing metadata can also be +provided for automated document curation. + +... there have also been working experiments with sisu markup source, two way +conversion/representation of sisu document markup source in mind-mapping +(software kdissert was used for its strong focus on producing documents (now +apparently called semantik)); also po4a software for translators has been used +successfuly in its regular text mode for sisu markup in translation, (which is +more an attribute of po4a than of sisu, but) which is of interest due to +sisu/spine's object citation numbering being available across translations. Open +Document Format text (odf:odt), has been an output, but much more interesting +(and requested by potential users of sisu/spine) would be the ability of a word +processor to save text/a document in sisu markup, making alternative document +processing and presentations with sisu possible. + +also worth mention, in the relatively long history of this project, there has +been work done on extracting hash representations of each object, that could +hypothetically be shared to prove the content of a document without sharing its +content, or of identifying which objects change; these hashes can also be used +as unique identifiers in a database or as identifying filenames if individual +objects are saved. + +SiSU has evolved, the current implementation focuses on one primary use-case, +books and literary writings. However the concept on which it is based has wider +application. Here is a prevously posted souvenir from my encounter with an IBM +software evaluator in London June 2004 that came about through a chance +encounter with an IBM manager at a Linux Expo, who was curious about my interest +in Gnu/Linux with my legal background... on hearing that I also wrote software, +he suggested, maybe IBM should have a look at it. I was interested, the meeting +was set up... with an IBM, Software Innovations evaluator<br>His response after +the meeting: + +"Ralph \\ Good to meet with you today, I was very impressed with your +software. \\ /{ [colleague's name (also posted to an IBM colleague)] }/ - in +summary - Ralph has built an application that runs on linux and takes ASCII +documents and pulls them apart in to the smallest constituent parts, storing +them as XML, PDF and HTML, the HTML are hyperlinked up so the document can be +browsed in its full form. the format and text data created is stored in a +database.<br>This has potential in any place that needs the power of full text +search whilst holding the structural concepts of the document i.e. legal, +pharma, education, research.. which ones we need to figure out, ..." + +Special interest was expressed in the search implications of SiSU. To +paraphrase, the company has document management systems dealing with hundreds of +thousands of texts, these tell you which documents match your search criteria, +but cannot inform you where within a text these matches were found without +opening the documents. This is achieved through defining document objects and +making them the building block of the document, trackable document objects (that +can be placed back in the context of the document or corpus of documents if part +of a collection). SiSU's early design was to - abstract documents to their +structure, and identified objects, numbered in a citable way (as pointed out +document object hashes can be of use for the purpose). + +2~ SiSU Spine + +SiSU Spine is the new generator for documents prepared in sisu markup, written +in D as opposed to the original sisu which was first shared in Ruby. + +Spine code has not as yet been made publicly available. + +As compared with the original sisu generator sisu spine: + +- Spine uses the same document markup for the document body, but uses yaml for +document headers (which contains document metadata and configuration details), +the original sisu has a bespoke markup for headers. + +- Spine (written in D) is considerably faster at generating native output than +sisu (written in Ruby), on last test at least 60 times faster (what took 1 +minute takes 1 second; 1 hour a minute :-) (admittedly some time ago, ruby has +been getting faster, hopefully this is not over over promising). + +- Spine produces fewer document outputs types than sisu (html, epub, (odt, +latex) and populates sql db for search) + +- As regards non-native output, so far Spine has greater separation of what it +does and largely leaves calling the external program to the user, e.g.: latex +output is a native output in the sense that it is generated directly by spine, +but the pdfs that can be produced from these are produced through use of an +external program xelatex, which produces fine output but is a very much slower +process. + +- (where both produce the same output type, generally) Spine generally produces +more up to date output format representations. + :B~ SiSU Markup ={ SiSU markup:test } |