Commit | Line | Data |
---|---|---|
58c4a39d AW |
1 | @c -*-texinfo-*- |
2 | @c This is part of the GNU Guile Reference Manual. | |
3 | @c Copyright (C) 2013 Free Software Foundation, Inc. | |
4 | @c See the file guile.texi for copying conditions. | |
5 | ||
6 | @node SXML | |
7 | @section SXML | |
8 | ||
3e31e75a AW |
9 | SXML is a native representation of XML in terms of standard Scheme data |
10 | types: lists, symbols, and strings. For example, the simple XML | |
11 | fragment: | |
58c4a39d | 12 | |
3e31e75a AW |
13 | @example |
14 | <parrot type="African Grey"><name>Alfie</name></parrot> | |
58c4a39d AW |
15 | @end example |
16 | ||
3e31e75a | 17 | may be represented with the following SXML: |
58c4a39d | 18 | |
3e31e75a AW |
19 | @example |
20 | (parrot (@@ (type "African Grey)) (name "Alfie")) | |
21 | @end example | |
58c4a39d | 22 | |
3e31e75a AW |
23 | SXML is very general, and is capable of representing all of XML. |
24 | Formally, this means that SXML is a conforming implementation of the | |
25 | @uref{XML Information Set,http://www.w3.org/TR/xml-infoset/} standard. | |
58c4a39d | 26 | |
3e31e75a AW |
27 | Guile includes several facilities for working with XML and SXML: |
28 | parsers, serializers, and transformers. | |
58c4a39d | 29 | |
3e31e75a AW |
30 | @menu |
31 | * SXML Overview:: XML, as it was meant to be | |
32 | * Reading and Writing XML:: Convenient XML parsing and serializing | |
33 | * SSAX:: Custom functional-style XML parsers | |
34 | * Transforming SXML:: Munging SXML with @code{pre-post-order} | |
35 | * SXML Tree Fold:: Fold-based SXML transformations | |
36 | * SXPath:: XPath for SXML | |
37 | * sxml apply-templates:: A more XSLT-like approach to SXML transformations | |
38 | * sxml ssax input-parse:: The SSAX tokenizer, optimized for Guile | |
39 | @end menu | |
58c4a39d | 40 | |
3e31e75a AW |
41 | @node SXML Overview |
42 | @subsection SXML Overview | |
58c4a39d | 43 | |
3e31e75a | 44 | (This section needs to be written; volunteers welcome.) |
58c4a39d | 45 | |
58c4a39d | 46 | |
3e31e75a AW |
47 | @node Reading and Writing XML |
48 | @subsection Reading and Writing XML | |
58c4a39d | 49 | |
3e31e75a AW |
50 | The @code{(sxml simple)} module presents a basic interface for parsing |
51 | XML from a port into the Scheme SXML format, and for serializing it back | |
52 | to text. | |
58c4a39d | 53 | |
3e31e75a AW |
54 | @example |
55 | (use-modules (sxml simple)) | |
56 | @end example | |
58c4a39d | 57 | |
a14b6e18 | 58 | @deffn {Scheme Procedure} xml->sxml [string-or-port] [#:namespaces='()] @ |
1488753a | 59 | [#:declare-namespaces?=#t] [#:trim-whitespace?=#f] @ |
e10c2509 AW |
60 | [#:entities='()] [#:default-entity-handler=#f] @ |
61 | [#:doctype-handler=#f] | |
58c4a39d | 62 | Use SSAX to parse an XML document into SXML. Takes one optional |
a14b6e18 AW |
63 | argument, @var{string-or-port}, which defaults to the current input |
64 | port. Returns the resulting SXML document. If @var{string-or-port} is | |
65 | a port, it will be left pointing at the next available character in the | |
66 | port. | |
3e31e75a | 67 | @end deffn |
58c4a39d | 68 | |
1488753a AW |
69 | As is normal in SXML, XML elements parse as tagged lists. Attributes, |
70 | if any, are placed after the tag, within an @code{@@} element. The root | |
71 | of the resulting XML will be contained in a special tag, @code{*TOP*}. | |
72 | This tag will contain the root element of the XML, but also any prior | |
73 | processing instructions. | |
74 | ||
75 | @example | |
a14b6e18 | 76 | (xml->sxml "<foo/>") |
1488753a | 77 | @result{} (*TOP* (foo)) |
a14b6e18 | 78 | (xml->sxml "<foo>text</foo>") |
1488753a | 79 | @result{} (*TOP* (foo "text")) |
a14b6e18 | 80 | (xml->sxml "<foo kind=\"bar\">text</foo>") |
1488753a | 81 | @result{} (*TOP* (foo (@@ (kind "bar")) "text")) |
a14b6e18 | 82 | (xml->sxml "<?xml version=\"1.0\"?><foo/>") |
1488753a AW |
83 | @result{} (*TOP* (*PI* xml "version=\"1.0\"") (foo)) |
84 | @end example | |
85 | ||
86 | All namespaces in the XML document must be declared, via @code{xmlns} | |
87 | attributes. SXML elements built from non-default namespaces will have | |
88 | their tags prefixed with their URI. Users can specify custom prefixes | |
89 | for certain namespaces with the @code{#:namespaces} keyword argument to | |
90 | @code{xml->sxml}. | |
91 | ||
92 | @example | |
a14b6e18 | 93 | (xml->sxml "<foo xmlns=\"http://example.org/ns1\">text</foo>") |
1488753a | 94 | @result{} (*TOP* (http://example.org/ns1:foo "text")) |
a14b6e18 AW |
95 | (xml->sxml "<foo xmlns=\"http://example.org/ns1\">text</foo>" |
96 | #:namespaces '((ns1 . "http://example.org/ns1"))) | |
1488753a | 97 | @result{} (*TOP* (ns1:foo "text")) |
a14b6e18 AW |
98 | (xml->sxml "<foo xmlns:bar=\"http://example.org/ns2\"><bar:baz/></foo>" |
99 | #:namespaces '((ns2 . "http://example.org/ns2"))) | |
1488753a AW |
100 | @result{} (*TOP* (foo (ns2:baz))) |
101 | @end example | |
102 | ||
e10c2509 AW |
103 | By default, namespaces passed to @code{xml->sxml} are treated as if they |
104 | were declared on the root element. Passing a false | |
105 | @code{#:declare-namespaces?} argument will disable this behavior, | |
106 | requiring in-document declarations of namespaces before use.. | |
1488753a AW |
107 | |
108 | @example | |
a14b6e18 | 109 | (xml->sxml "<foo><ns2:baz/></foo>" |
1488753a | 110 | #:namespaces '((ns2 . "http://example.org/ns2"))) |
e10c2509 | 111 | @result{} (*TOP* (foo (ns2:baz))) |
a14b6e18 | 112 | (xml->sxml "<foo><ns2:baz/></foo>" |
1488753a | 113 | #:namespaces '((ns2 . "http://example.org/ns2")) |
e10c2509 AW |
114 | #:declare-namespaces? #f) |
115 | @result{} error: undeclared namespace: `bar' | |
1488753a AW |
116 | @end example |
117 | ||
118 | By default, all whitespace in XML is significant. Passing the | |
119 | @code{#:trim-whitespace?} keyword argument to @code{xml->sxml} will trim | |
120 | whitespace in front, behind and between elements, treating it as | |
121 | ``unsignificant''. Whitespace in text fragments is left alone. | |
122 | ||
123 | @example | |
a14b6e18 | 124 | (xml->sxml "<foo>\n<bar> Alfie the parrot! </bar>\n</foo>") |
e10c2509 | 125 | @result{} (*TOP* (foo "\n" (bar " Alfie the parrot! ") "\n")) |
a14b6e18 | 126 | (xml->sxml "<foo>\n<bar> Alfie the parrot! </bar>\n</foo>" |
1488753a | 127 | #:trim-whitespace? #t) |
e10c2509 | 128 | @result{} (*TOP* (foo (bar " Alfie the parrot! "))) |
1488753a AW |
129 | @end example |
130 | ||
131 | Parsed entities may be declared with the @code{#:entities} keyword | |
132 | argument, or handled with the @code{#:default-entity-handler}. By | |
133 | default, only the standard @code{<}, @code{>}, @code{&}, | |
134 | @code{'} and @code{"} entities are defined, as well as the | |
135 | @code{&#@var{N};} and @code{&#x@var{N};} (decimal and hexadecimal) | |
136 | numeric character entities. | |
137 | ||
138 | @example | |
a14b6e18 | 139 | (xml->sxml "<foo>&</foo>") |
1488753a | 140 | @result{} (*TOP* (foo "&")) |
a14b6e18 | 141 | (xml->sxml "<foo> </foo>") |
1488753a | 142 | @result{} error: undefined entity: nbsp |
a14b6e18 | 143 | (xml->sxml "<foo> </foo>") |
1488753a | 144 | @result{} (*TOP* (foo "\xa0")) |
a14b6e18 | 145 | (xml->sxml "<foo> </foo>" |
1488753a AW |
146 | #:entities '((nbsp . "\xa0"))) |
147 | @result{} (*TOP* (foo "\xa0")) | |
a14b6e18 | 148 | (xml->sxml "<foo> &foo;</foo>" |
1488753a AW |
149 | #:default-entity-handler |
150 | (lambda (port name) | |
151 | (case name | |
152 | ((nbsp) "\xa0") | |
153 | (else | |
154 | (format (current-warning-port) | |
155 | "~a:~a:~a: undefined entitity: ~a\n" | |
156 | (or (port-filename port) "<unknown file>") | |
157 | (port-line port) (port-column port) | |
158 | name) | |
159 | (symbol->string name))))) | |
160 | @print{} <unknown file>:0:17: undefined entitity: foo | |
161 | @result{} (*TOP* (foo "\xa0 foo")) | |
162 | @end example | |
163 | ||
e10c2509 AW |
164 | By default, @code{xml->sxml} skips over the @code{<!DOCTYPE>} |
165 | declaration, if any. This behavior can be overridden with the | |
166 | @code{#:doctype-handler} argument, which should be a procedure of three | |
167 | arguments: the @dfn{docname} (a symbol), @dfn{systemid} (a string), and | |
168 | the internal doctype subset (as a string or @code{#f} if not present). | |
169 | ||
170 | The handler should return keyword arguments as multiple values, as if it | |
171 | were calling its continuation with keyword arguments. The continuation | |
172 | accepts the @code{#:entities} and @code{#:namespaces} keyword arguments, | |
173 | in the same format that @code{xml->sxml} itself takes. These entities | |
174 | and namespaces will be prepended to those given to the @code{xml->sxml} | |
175 | invocation. | |
176 | ||
177 | @example | |
178 | (define (handle-foo docname systemid internal-subset) | |
179 | (case docname | |
180 | ((foo) | |
181 | (values #:entities '((greets . "<i>Hello, world!</i>")))) | |
182 | (else | |
183 | (values)))) | |
184 | ||
185 | (xml->sxml "<!DOCTYPE foo><p>&greets;</p>" | |
186 | #:doctype-handler handle-foo) | |
187 | @result{} (*TOP* (p (i "Hello, world!"))) | |
188 | @end example | |
189 | ||
190 | If the document has no doctype declaration, the @var{doctype-handler} is | |
191 | invoked with @code{#f} for the three arguments. | |
192 | ||
193 | In the future, the continuation may accept other keyword arguments, for | |
194 | example to validate the parsed SXML against the doctype. | |
195 | ||
3e31e75a AW |
196 | @deffn {Scheme Procedure} sxml->xml tree [port] |
197 | Serialize the SXML tree @var{tree} as XML. The output will be written to | |
58c4a39d AW |
198 | the current output port, unless the optional argument @var{port} is |
199 | present. | |
3e31e75a | 200 | @end deffn |
58c4a39d | 201 | |
3e31e75a | 202 | @deffn {Scheme Procedure} sxml->string sxml |
58c4a39d AW |
203 | Detag an sxml tree @var{sxml} into a string. Does not perform any |
204 | formatting. | |
3e31e75a AW |
205 | @end deffn |
206 | ||
207 | @node SSAX | |
208 | @subsection SSAX: A Functional XML Parsing Toolkit | |
209 | ||
210 | Guile's XML parser is based on Oleg Kiselyov's powerful XML parsing | |
211 | toolkit, SSAX. | |
212 | ||
213 | @subsubsection History | |
214 | ||
215 | Back in the 1990s, when the world was young again and XML was the | |
216 | solution to all of its problems, there were basically two kinds of XML | |
217 | parsers out there: DOM parsers and SAX parsers. | |
218 | ||
219 | A DOM parser reads through an entire XML document, building up a tree of | |
220 | ``DOM objects'' representing the document structure. They are very easy | |
221 | to use, but sometimes you don't actually want all of the information in | |
222 | a document; building an object tree is not necessary if all you want to | |
223 | do is to count word frequencies in a document, for example. | |
224 | ||
225 | SAX parsers were created to give the programmer more control on the | |
226 | parsing process. A programmer gives the SAX parser a number of | |
227 | ``callbacks'': functions that will be called on various features of the | |
228 | XML stream as they are encountered. SAX parsers are more efficient, but | |
229 | much harder to user, as users typically have to manually maintain a | |
230 | stack of open elements. | |
231 | ||
232 | Kiselyov realized that the SAX programming model could be made much | |
233 | simpler if the callbacks were formulated not as a linear fold across the | |
234 | features of the XML stream, but as a @emph{tree fold} over the structure | |
235 | implicit in the XML. In this way, the user has a very convenient, | |
236 | functional-style interface that can still generate optimal parsers. | |
237 | ||
238 | The @code{xml->sxml} interface from the @code{(sxml simple)} module is a | |
239 | DOM-style parser built using SSAX, though it returns SXML instead of DOM | |
240 | objects. | |
241 | ||
242 | @subsubsection Implementation | |
243 | ||
244 | @code{(sxml ssax)} is a package of low-to-high level lexing and parsing | |
245 | procedures that can be combined to yield a SAX, a DOM, a validating | |
246 | parser, or a parser intended for a particular document type. The | |
247 | procedures in the package can be used separately to tokenize or parse | |
248 | various pieces of XML documents. The package supports XML Namespaces, | |
249 | internal and external parsed entities, user-controlled handling of | |
250 | whitespace, and validation. This module therefore is intended to be a | |
251 | framework, a set of ``Lego blocks'' you can use to build a parser | |
252 | following any discipline and performing validation to any degree. As an | |
253 | example of the parser construction, this file includes a semi-validating | |
254 | SXML parser. | |
255 | ||
256 | SSAX has a ``sequential'' feel of SAX yet a ``functional style'' of DOM. | |
257 | Like a SAX parser, the framework scans the document only once and | |
258 | permits incremental processing. An application that handles document | |
259 | elements in order can run as efficiently as possible. @emph{Unlike} a | |
260 | SAX parser, the framework does not require an application register | |
261 | stateful callbacks and surrender control to the parser. Rather, it is | |
262 | the application that can drive the framework -- calling its functions to | |
263 | get the current lexical or syntax element. These functions do not | |
264 | maintain or mutate any state save the input port. Therefore, the | |
265 | framework permits parsing of XML in a pure functional style, with the | |
266 | input port being a monad (or a linear, read-once parameter). | |
267 | ||
268 | Besides the @var{port}, there is another monad -- @var{seed}. Most of | |
58c4a39d | 269 | the middle- and high-level parsers are single-threaded through the |
3e31e75a AW |
270 | @var{seed}. The functions of this framework do not process or affect |
271 | the @var{seed} in any way: they simply pass it around as an instance of | |
272 | an opaque datatype. User functions, on the other hand, can use the seed | |
273 | to maintain user's state, to accumulate parsing results, etc. A user | |
274 | can freely mix his own functions with those of the framework. On the | |
275 | other hand, the user may wish to instantiate a high-level parser: | |
276 | @code{SSAX:make-elem-parser} or @code{SSAX:make-parser}. In the latter | |
58c4a39d AW |
277 | case, the user must provide functions of specific signatures, which are |
278 | called at predictable moments during the parsing: to handle character | |
3e31e75a | 279 | data, element data, or processing instructions (PI). The functions are |
58c4a39d AW |
280 | always given the @var{seed}, among other parameters, and must return the |
281 | new @var{seed}. | |
282 | ||
283 | From a functional point of view, XML parsing is a combined | |
3e31e75a | 284 | pre-post-order traversal of a ``tree'' that is the XML document itself. |
58c4a39d | 285 | This down-and-up traversal tells the user about an element when its |
3e31e75a AW |
286 | start tag is encountered. The user is notified about the element once |
287 | more, after all element's children have been handled. The process of | |
288 | XML parsing therefore is a fold over the raw XML document. Unlike a | |
289 | fold over trees defined in [1], the parser is necessarily | |
290 | single-threaded -- obviously as elements in a text XML document are laid | |
291 | down sequentially. The parser therefore is a tree fold that has been | |
292 | transformed to accept an accumulating parameter [1,2]. | |
58c4a39d AW |
293 | |
294 | Formally, the denotational semantics of the parser can be expressed as | |
295 | ||
296 | @smallexample | |
297 | parser:: (Start-tag -> Seed -> Seed) -> | |
298 | (Start-tag -> Seed -> Seed -> Seed) -> | |
299 | (Char-Data -> Seed -> Seed) -> | |
300 | XML-text-fragment -> Seed -> Seed | |
301 | parser fdown fup fchar "<elem attrs> content </elem>" seed | |
302 | = fup "<elem attrs>" seed | |
303 | (parser fdown fup fchar "content" (fdown "<elem attrs>" seed)) | |
304 | ||
305 | parser fdown fup fchar "char-data content" seed | |
306 | = parser fdown fup fchar "content" (fchar "char-data" seed) | |
307 | ||
308 | parser fdown fup fchar "elem-content content" seed | |
309 | = parser fdown fup fchar "content" ( | |
310 | parser fdown fup fchar "elem-content" seed) | |
311 | @end smallexample | |
312 | ||
313 | Compare the last two equations with the left fold | |
314 | ||
315 | @smallexample | |
316 | fold-left kons elem:list seed = fold-left kons list (kons elem seed) | |
317 | @end smallexample | |
318 | ||
319 | The real parser created by @code{SSAX:make-parser} is slightly more | |
320 | complicated, to account for processing instructions, entity references, | |
321 | namespaces, processing of document type declaration, etc. | |
322 | ||
3e31e75a AW |
323 | The XML standard document referred to in this module is |
324 | @uref{http://www.w3.org/TR/1998/REC-xml-19980210.html} | |
58c4a39d AW |
325 | |
326 | The present file also defines a procedure that parses the text of an XML | |
327 | document or of a separate element into SXML, an S-expression-based model | |
3e31e75a AW |
328 | of an XML Information Set. SXML is also an Abstract Syntax Tree of an |
329 | XML document. SXML is similar but not identical to DOM; SXML is | |
58c4a39d | 330 | particularly suitable for Scheme-based XML/HTML authoring, SXPath |
3e31e75a AW |
331 | queries, and tree transformations. See SXML.html for more details. |
332 | SXML is a term implementation of evaluation of the XML document [3]. | |
333 | The other implementation is context-passing. | |
334 | ||
335 | The present frameworks fully supports the XML Namespaces Recommendation: | |
336 | @uref{http://www.w3.org/TR/REC-xml-names/}. | |
58c4a39d | 337 | |
3e31e75a | 338 | Other links: |
58c4a39d AW |
339 | |
340 | @table @asis | |
341 | @item [1] | |
342 | Jeremy Gibbons, Geraint Jones, "The Under-appreciated Unfold," Proc. | |
343 | ICFP'98, 1998, pp. 273-279. | |
344 | ||
345 | @item [2] | |
346 | Richard S. Bird, The promotion and accumulation strategies in | |
347 | transformational programming, ACM Trans. Progr. Lang. Systems, | |
348 | 6(4):487-504, October 1984. | |
349 | ||
350 | @item [3] | |
351 | Ralf Hinze, "Deriving Backtracking Monad Transformers," Functional | |
352 | Pearl. Proc ICFP'00, pp. 186-197. | |
353 | ||
354 | @end table | |
355 | ||
356 | @subsubsection Usage | |
3e31e75a AW |
357 | @deffn {Scheme Procedure} current-ssax-error-port |
358 | @end deffn | |
58c4a39d | 359 | |
3e31e75a AW |
360 | @deffn {Scheme Procedure} with-ssax-error-to-port port thunk |
361 | @end deffn | |
58c4a39d | 362 | |
3e31e75a | 363 | @deffn {Scheme Procedure} xml-token? _ |
58c4a39d AW |
364 | @verbatim |
365 | -- Scheme Procedure: pair? x | |
366 | Return `#t' if X is a pair; otherwise return `#f'. | |
367 | ||
368 | ||
369 | @end verbatim | |
3e31e75a | 370 | @end deffn |
58c4a39d | 371 | |
3e31e75a AW |
372 | @deffn {Scheme Syntax} xml-token-kind token |
373 | @end deffn | |
58c4a39d | 374 | |
3e31e75a AW |
375 | @deffn {Scheme Syntax} xml-token-head token |
376 | @end deffn | |
58c4a39d | 377 | |
3e31e75a AW |
378 | @deffn {Scheme Procedure} make-empty-attlist |
379 | @end deffn | |
58c4a39d | 380 | |
3e31e75a AW |
381 | @deffn {Scheme Procedure} attlist-add attlist name-value |
382 | @end deffn | |
58c4a39d | 383 | |
a4b4fbbd JE |
384 | @deffn {Scheme Procedure} attlist-null? x |
385 | Return @code{#t} if @var{x} is the empty list, else @code{#f}. | |
3e31e75a | 386 | @end deffn |
58c4a39d | 387 | |
3e31e75a AW |
388 | @deffn {Scheme Procedure} attlist-remove-top attlist |
389 | @end deffn | |
58c4a39d | 390 | |
3e31e75a AW |
391 | @deffn {Scheme Procedure} attlist->alist attlist |
392 | @end deffn | |
58c4a39d | 393 | |
3e31e75a AW |
394 | @deffn {Scheme Procedure} attlist-fold kons knil lis1 |
395 | @end deffn | |
58c4a39d | 396 | |
3e31e75a AW |
397 | @deffn {Scheme Procedure} define-parsed-entity! entity str |
398 | Define a new parsed entity. @var{entity} should be a symbol. | |
58c4a39d AW |
399 | |
400 | Instances of &@var{entity}; in XML text will be replaced with the string | |
401 | @var{str}, which will then be parsed. | |
3e31e75a | 402 | @end deffn |
58c4a39d | 403 | |
3e31e75a | 404 | @deffn {Scheme Procedure} reset-parsed-entity-definitions! |
58c4a39d | 405 | Restore the set of parsed entity definitions to its initial state. |
3e31e75a | 406 | @end deffn |
58c4a39d | 407 | |
3e31e75a AW |
408 | @deffn {Scheme Procedure} ssax:uri-string->symbol uri-str |
409 | @end deffn | |
58c4a39d | 410 | |
3e31e75a AW |
411 | @deffn {Scheme Procedure} ssax:skip-internal-dtd port |
412 | @end deffn | |
58c4a39d | 413 | |
3e31e75a AW |
414 | @deffn {Scheme Procedure} ssax:read-pi-body-as-string port |
415 | @end deffn | |
58c4a39d | 416 | |
3e31e75a AW |
417 | @deffn {Scheme Procedure} ssax:reverse-collect-str-drop-ws fragments |
418 | @end deffn | |
58c4a39d | 419 | |
3e31e75a AW |
420 | @deffn {Scheme Procedure} ssax:read-markup-token port |
421 | @end deffn | |
58c4a39d | 422 | |
3e31e75a AW |
423 | @deffn {Scheme Procedure} ssax:read-cdata-body port str-handler seed |
424 | @end deffn | |
58c4a39d | 425 | |
3e31e75a AW |
426 | @deffn {Scheme Procedure} ssax:read-char-ref port |
427 | @end deffn | |
58c4a39d | 428 | |
3e31e75a AW |
429 | @deffn {Scheme Procedure} ssax:read-attributes port entities |
430 | @end deffn | |
58c4a39d | 431 | |
3e31e75a AW |
432 | @deffn {Scheme Procedure} ssax:complete-start-tag tag-head port elems entities namespaces |
433 | @end deffn | |
58c4a39d | 434 | |
3e31e75a AW |
435 | @deffn {Scheme Procedure} ssax:read-external-id port |
436 | @end deffn | |
58c4a39d | 437 | |
3e31e75a AW |
438 | @deffn {Scheme Procedure} ssax:read-char-data port expect-eof? str-handler seed |
439 | @end deffn | |
58c4a39d | 440 | |
3e31e75a AW |
441 | @deffn {Scheme Procedure} ssax:xml->sxml port namespace-prefix-assig |
442 | @end deffn | |
58c4a39d | 443 | |
3e31e75a AW |
444 | @deffn {Scheme Syntax} ssax:make-parser . kw-val-pairs |
445 | @end deffn | |
58c4a39d | 446 | |
3e31e75a AW |
447 | @deffn {Scheme Syntax} ssax:make-pi-parser orig-handlers |
448 | @end deffn | |
58c4a39d | 449 | |
3e31e75a AW |
450 | @deffn {Scheme Syntax} ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers |
451 | @end deffn | |
58c4a39d | 452 | |
3e31e75a AW |
453 | @node Transforming SXML |
454 | @subsection Transforming SXML | |
58c4a39d AW |
455 | @subsubsection Overview |
456 | @heading SXML expression tree transformers | |
457 | @subheading Pre-Post-order traversal of a tree and creation of a new tree | |
458 | @smallexample | |
459 | pre-post-order:: <tree> x <bindings> -> <new-tree> | |
460 | @end smallexample | |
461 | ||
462 | where | |
463 | ||
464 | @smallexample | |
465 | <bindings> ::= (<binding> ...) | |
466 | <binding> ::= (<trigger-symbol> *preorder* . <handler>) | | |
467 | (<trigger-symbol> *macro* . <handler>) | | |
468 | (<trigger-symbol> <new-bindings> . <handler>) | | |
469 | (<trigger-symbol> . <handler>) | |
470 | <trigger-symbol> ::= XMLname | *text* | *default* | |
471 | <handler> :: <trigger-symbol> x [<tree>] -> <new-tree> | |
472 | @end smallexample | |
473 | ||
474 | The pre-post-order function visits the nodes and nodelists | |
3e31e75a | 475 | pre-post-order (depth-first). For each @code{<Node>} of the form |
58c4a39d | 476 | @code{(@var{name} <Node> ...)}, it looks up an association with the |
3e31e75a AW |
477 | given @var{name} among its @var{<bindings>}. If failed, |
478 | @code{pre-post-order} tries to locate a @code{*default*} binding. It's | |
479 | an error if the latter attempt fails as well. Having found a binding, | |
58c4a39d AW |
480 | the @code{pre-post-order} function first checks to see if the binding is |
481 | of the form | |
482 | ||
483 | @smallexample | |
484 | (<trigger-symbol> *preorder* . <handler>) | |
485 | @end smallexample | |
486 | ||
3e31e75a | 487 | If it is, the handler is 'applied' to the current node. Otherwise, the |
58c4a39d AW |
488 | pre-post-order function first calls itself recursively for each child of |
489 | the current node, with @var{<new-bindings>} prepended to the | |
3e31e75a AW |
490 | @var{<bindings>} in effect. The result of these calls is passed to the |
491 | @var{<handler>} (along with the head of the current @var{<Node>}). To be | |
58c4a39d | 492 | more precise, the handler is _applied_ to the head of the current node |
3e31e75a AW |
493 | and its processed children. The result of the handler, which should also |
494 | be a @code{<tree>}, replaces the current @var{<Node>}. If the current | |
58c4a39d AW |
495 | @var{<Node>} is a text string or other atom, a special binding with a |
496 | symbol @code{*text*} is looked up. | |
497 | ||
498 | A binding can also be of a form | |
499 | ||
500 | @smallexample | |
501 | (<trigger-symbol> *macro* . <handler>) | |
502 | @end smallexample | |
503 | ||
3e31e75a | 504 | This is equivalent to @code{*preorder*} described above. However, the |
58c4a39d AW |
505 | result is re-processed again, with the current stylesheet. |
506 | ||
507 | @subsubsection Usage | |
3e31e75a | 508 | @deffn {Scheme Procedure} SRV:send-reply . fragments |
58c4a39d AW |
509 | Output the @var{fragments} to the current output port. |
510 | ||
511 | The fragments are a list of strings, characters, numbers, thunks, | |
3e31e75a | 512 | @code{#f}, @code{#t} -- and other fragments. The function traverses the |
58c4a39d | 513 | tree depth-first, writes out strings and characters, executes thunks, |
3e31e75a | 514 | and ignores @code{#f} and @code{'()}. The function returns @code{#t} if |
58c4a39d AW |
515 | anything was written at all; otherwise the result is @code{#f} If |
516 | @code{#t} occurs among the fragments, it is not written out but causes | |
517 | the result of @code{SRV:send-reply} to be @code{#t}. | |
3e31e75a AW |
518 | @end deffn |
519 | ||
520 | @deffn {Scheme Procedure} foldts fdown fup fhere seed tree | |
521 | @end deffn | |
522 | ||
523 | @deffn {Scheme Procedure} post-order tree bindings | |
524 | @end deffn | |
525 | ||
526 | @deffn {Scheme Procedure} pre-post-order tree bindings | |
527 | @end deffn | |
528 | ||
529 | @deffn {Scheme Procedure} replace-range beg-pred end-pred forest | |
530 | @end deffn | |
531 | ||
532 | @node SXML Tree Fold | |
533 | @subsection SXML Tree Fold | |
534 | @subsubsection Overview | |
535 | @code{(sxml fold)} defines a number of variants of the @dfn{fold} | |
536 | algorithm for use in transforming SXML trees. Additionally it defines | |
537 | the layout operator, @code{fold-layout}, which might be described as a | |
538 | context-passing variant of SSAX's @code{pre-post-order}. | |
539 | ||
540 | @subsubsection Usage | |
541 | @deffn {Scheme Procedure} foldt fup fhere tree | |
542 | The standard multithreaded tree fold. | |
543 | ||
544 | @var{fup} is of type [a] -> a. @var{fhere} is of type object -> a. | |
545 | @end deffn | |
546 | ||
547 | @deffn {Scheme Procedure} foldts fdown fup fhere seed tree | |
548 | The single-threaded tree fold originally defined in SSAX. @xref{SSAX}, | |
549 | for more information. | |
550 | @end deffn | |
551 | ||
552 | @deffn {Scheme Procedure} foldts* fdown fup fhere seed tree | |
553 | A variant of @code{foldts} that allows pre-order tree | |
554 | rewrites. Originally defined in Andy Wingo's 2007 paper, | |
555 | @emph{Applications of fold to XML transformation}. | |
556 | @end deffn | |
557 | ||
558 | @deffn {Scheme Procedure} fold-values proc list . seeds | |
559 | A variant of @code{fold} that allows multi-valued seeds. Note that the | |
560 | order of the arguments differs from that of @code{fold}. @xref{SRFI-1 | |
561 | Fold and Map}. | |
562 | @end deffn | |
563 | ||
564 | @deffn {Scheme Procedure} foldts*-values fdown fup fhere tree . seeds | |
565 | A variant of @code{foldts*} that allows multi-valued | |
566 | seeds. Originally defined in Andy Wingo's 2007 paper, @emph{Applications | |
567 | of fold to XML transformation}. | |
568 | @end deffn | |
569 | ||
570 | @deffn {Scheme Procedure} fold-layout tree bindings params layout stylesheet | |
571 | A traversal combinator in the spirit of @code{pre-post-order}. | |
572 | @xref{Transforming SXML}. | |
573 | ||
574 | @code{fold-layout} was originally presented in Andy Wingo's 2007 paper, | |
575 | @emph{Applications of fold to XML transformation}. | |
576 | ||
577 | @example | |
578 | bindings := (<binding>...) | |
579 | binding := (<tag> <bandler-pair>...) | |
580 | | (*default* . <post-handler>) | |
581 | | (*text* . <text-handler>) | |
582 | tag := <symbol> | |
583 | handler-pair := (pre-layout . <pre-layout-handler>) | |
584 | | (post . <post-handler>) | |
585 | | (bindings . <bindings>) | |
586 | | (pre . <pre-handler>) | |
587 | | (macro . <macro-handler>) | |
588 | @end example | |
589 | ||
590 | @table @var | |
591 | @item pre-layout-handler | |
592 | A function of three arguments: | |
593 | ||
594 | @table @var | |
595 | @item kids | |
596 | the kids of the current node, before traversal | |
597 | ||
598 | @item params | |
599 | the params of the current node | |
600 | ||
601 | @item layout | |
602 | the layout coming into this node | |
603 | ||
604 | @end table | |
605 | ||
606 | @var{pre-layout-handler} is expected to use this information to return a | |
607 | layout to pass to the kids. The default implementation returns the | |
608 | layout given in the arguments. | |
609 | ||
610 | @item post-handler | |
611 | A function of five arguments: | |
612 | ||
613 | @table @var | |
614 | @item tag | |
615 | the current tag being processed | |
616 | ||
617 | @item params | |
618 | the params of the current node | |
619 | ||
620 | @item layout | |
621 | the layout coming into the current node, before any kids were processed | |
622 | ||
623 | @item klayout | |
624 | the layout after processing all of the children | |
625 | ||
626 | @item kids | |
627 | the already-processed child nodes | |
628 | ||
629 | @end table | |
58c4a39d | 630 | |
3e31e75a AW |
631 | @var{post-handler} should return two values, the layout to pass to the |
632 | next node and the final tree. | |
58c4a39d | 633 | |
3e31e75a AW |
634 | @item text-handler |
635 | @var{text-handler} is a function of three arguments: | |
58c4a39d | 636 | |
3e31e75a AW |
637 | @table @var |
638 | @item text | |
639 | the string | |
58c4a39d | 640 | |
3e31e75a AW |
641 | @item params |
642 | the current params | |
58c4a39d | 643 | |
3e31e75a AW |
644 | @item layout |
645 | the current layout | |
646 | ||
647 | @end table | |
648 | ||
649 | @var{text-handler} should return two values, the layout to pass to the | |
650 | next node and the value to which the string should transform. | |
651 | ||
652 | @end table | |
653 | @end deffn | |
58c4a39d | 654 | |
3e31e75a AW |
655 | @node SXPath |
656 | @subsection SXPath | |
58c4a39d AW |
657 | @subsubsection Overview |
658 | @heading SXPath: SXML Query Language | |
659 | SXPath is a query language for SXML, an instance of XML Information set | |
3e31e75a AW |
660 | (Infoset) in the form of s-expressions. See @code{(sxml ssax)} for the |
661 | definition of SXML and more details. SXPath is also a translation into | |
58c4a39d AW |
662 | Scheme of an XML Path Language, @uref{http://www.w3.org/TR/xpath,XPath}. |
663 | XPath and SXPath describe means of selecting a set of Infoset's items or | |
664 | their properties. | |
665 | ||
666 | To facilitate queries, XPath maps the XML Infoset into an explicit tree, | |
667 | and introduces important notions of a location path and a current, | |
3e31e75a AW |
668 | context node. A location path denotes a selection of a set of nodes |
669 | relative to a context node. Any XPath tree has a distinguished, root | |
58c4a39d AW |
670 | node -- which serves as the context node for absolute location paths. |
671 | Location path is recursively defined as a location step joined with a | |
3e31e75a AW |
672 | location path. A location step is a simple query of the database |
673 | relative to a context node. A step may include expressions that further | |
674 | filter the selected set. Each node in the resulting set is used as a | |
675 | context node for the adjoining location path. The result of the step is | |
58c4a39d AW |
676 | a union of the sets returned by the latter location paths. |
677 | ||
678 | The SXML representation of the XML Infoset (see SSAX.scm) is rather | |
3e31e75a | 679 | suitable for querying as it is. Bowing to the XPath specification, we |
58c4a39d AW |
680 | will refer to SXML information items as 'Nodes': |
681 | ||
682 | @example | |
683 | <Node> ::= <Element> | <attributes-coll> | <attrib> | |
684 | | "text string" | <PI> | |
685 | @end example | |
686 | ||
687 | This production can also be described as | |
688 | ||
689 | @example | |
690 | <Node> ::= (name . <Nodeset>) | "text string" | |
691 | @end example | |
692 | ||
693 | An (ordered) set of nodes is just a list of the constituent nodes: | |
694 | ||
695 | @example | |
696 | <Nodeset> ::= (<Node> ...) | |
697 | @end example | |
698 | ||
3e31e75a AW |
699 | Nodesets, and Nodes other than text strings are both lists. A <Nodeset> |
700 | however is either an empty list, or a list whose head is not a symbol. A | |
58c4a39d | 701 | symbol at the head of a node is either an XML name (in which case it's a |
3e31e75a | 702 | tag of an XML element), or an administrative name such as '@@'. This |
58c4a39d | 703 | uniform list representation makes processing rather simple and elegant, |
3e31e75a | 704 | while avoiding confusion. The multi-branch tree structure formed by the |
58c4a39d AW |
705 | mutually-recursive datatypes <Node> and <Nodeset> lends itself well to |
706 | processing by functional languages. | |
707 | ||
708 | A location path is in fact a composite query over an XPath tree or its | |
3e31e75a AW |
709 | branch. A singe step is a combination of a projection, selection or a |
710 | transitive closure. Multiple steps are combined via join and union | |
711 | operations. This insight allows us to @emph{elegantly} implement XPath | |
58c4a39d | 712 | as a sequence of projection and filtering primitives -- converters -- |
3e31e75a | 713 | joined by @dfn{combinators}. Each converter takes a node and returns a |
58c4a39d | 714 | nodeset which is the result of the corresponding query relative to that |
3e31e75a | 715 | node. A converter can also be called on a set of nodes. In that case it |
58c4a39d AW |
716 | returns a union of the corresponding queries over each node in the set. |
717 | The union is easily implemented as a list append operation as all nodes | |
3e31e75a AW |
718 | in a SXML tree are considered distinct, by XPath conventions. We also |
719 | preserve the order of the members in the union. Query combinators are | |
58c4a39d | 720 | high-order functions: they take converter(s) (which is a Node|Nodeset -> |
3e31e75a | 721 | Nodeset function) and compose or otherwise combine them. We will be |
58c4a39d AW |
722 | concerned with only relative location paths [XPath]: an absolute |
723 | location path is a relative path applied to the root node. | |
724 | ||
725 | Similarly to XPath, SXPath defines full and abbreviated notations for | |
3e31e75a AW |
726 | location paths. In both cases, the abbreviated notation can be |
727 | mechanically expanded into the full form by simple rewriting rules. In | |
58c4a39d | 728 | case of SXPath the corresponding rules are given as comments to a sxpath |
3e31e75a | 729 | function, below. The regression test suite at the end of this file shows |
58c4a39d | 730 | a representative sample of SXPaths in both notations, juxtaposed with |
3e31e75a | 731 | the corresponding XPath expressions. Most of the samples are borrowed |
58c4a39d AW |
732 | literally from the XPath specification, while the others are adjusted |
733 | for our running example, tree1. | |
734 | ||
735 | @subsubsection Usage | |
3e31e75a AW |
736 | @deffn {Scheme Procedure} nodeset? x |
737 | @end deffn | |
58c4a39d | 738 | |
3e31e75a AW |
739 | @deffn {Scheme Procedure} node-typeof? crit |
740 | @end deffn | |
58c4a39d | 741 | |
3e31e75a AW |
742 | @deffn {Scheme Procedure} node-eq? other |
743 | @end deffn | |
58c4a39d | 744 | |
3e31e75a AW |
745 | @deffn {Scheme Procedure} node-equal? other |
746 | @end deffn | |
58c4a39d | 747 | |
3e31e75a AW |
748 | @deffn {Scheme Procedure} node-pos n |
749 | @end deffn | |
58c4a39d | 750 | |
3e31e75a | 751 | @deffn {Scheme Procedure} filter pred? |
58c4a39d AW |
752 | @verbatim |
753 | -- Scheme Procedure: filter pred list | |
754 | Return all the elements of 2nd arg LIST that satisfy predicate | |
755 | PRED. The list is not disordered - elements that appear in the | |
756 | result list occur in the same order as they occur in the argument | |
3e31e75a AW |
757 | list. The returned list may share a common tail with the argument |
758 | list. The dynamic order in which the various applications of pred | |
58c4a39d AW |
759 | are made is not specified. |
760 | ||
761 | (filter even? '(0 7 8 8 43 -4)) => (0 8 8 -4) | |
762 | ||
763 | ||
764 | @end verbatim | |
3e31e75a | 765 | @end deffn |
58c4a39d | 766 | |
3e31e75a AW |
767 | @deffn {Scheme Procedure} take-until pred? |
768 | @end deffn | |
58c4a39d | 769 | |
3e31e75a AW |
770 | @deffn {Scheme Procedure} take-after pred? |
771 | @end deffn | |
58c4a39d | 772 | |
3e31e75a AW |
773 | @deffn {Scheme Procedure} map-union proc lst |
774 | @end deffn | |
58c4a39d | 775 | |
3e31e75a AW |
776 | @deffn {Scheme Procedure} node-reverse node-or-nodeset |
777 | @end deffn | |
58c4a39d | 778 | |
3e31e75a AW |
779 | @deffn {Scheme Procedure} node-trace title |
780 | @end deffn | |
58c4a39d | 781 | |
3e31e75a AW |
782 | @deffn {Scheme Procedure} select-kids test-pred? |
783 | @end deffn | |
58c4a39d | 784 | |
3e31e75a | 785 | @deffn {Scheme Procedure} node-self pred? |
58c4a39d AW |
786 | @verbatim |
787 | -- Scheme Procedure: filter pred list | |
788 | Return all the elements of 2nd arg LIST that satisfy predicate | |
789 | PRED. The list is not disordered - elements that appear in the | |
790 | result list occur in the same order as they occur in the argument | |
3e31e75a AW |
791 | list. The returned list may share a common tail with the argument |
792 | list. The dynamic order in which the various applications of pred | |
58c4a39d AW |
793 | are made is not specified. |
794 | ||
795 | (filter even? '(0 7 8 8 43 -4)) => (0 8 8 -4) | |
796 | ||
797 | ||
798 | @end verbatim | |
3e31e75a AW |
799 | @end deffn |
800 | ||
801 | @deffn {Scheme Procedure} node-join . selectors | |
802 | @end deffn | |
803 | ||
804 | @deffn {Scheme Procedure} node-reduce . converters | |
805 | @end deffn | |
806 | ||
807 | @deffn {Scheme Procedure} node-or . converters | |
808 | @end deffn | |
809 | ||
810 | @deffn {Scheme Procedure} node-closure test-pred? | |
811 | @end deffn | |
812 | ||
813 | @deffn {Scheme Procedure} node-parent rootnode | |
814 | @end deffn | |
815 | ||
816 | @deffn {Scheme Procedure} sxpath path | |
817 | @end deffn | |
818 | ||
819 | @node sxml ssax input-parse | |
820 | @subsection (sxml ssax input-parse) | |
821 | @subsubsection Overview | |
822 | A simple lexer. | |
823 | ||
824 | The procedures in this module surprisingly often suffice to parse an | |
825 | input stream. They either skip, or build and return tokens, according to | |
826 | inclusion or delimiting semantics. The list of characters to expect, | |
827 | include, or to break at may vary from one invocation of a function to | |
828 | another. This allows the functions to easily parse even | |
829 | context-sensitive languages. | |
830 | ||
831 | EOF is generally frowned on, and thrown up upon if encountered. | |
832 | Exceptions are mentioned specifically. The list of expected characters | |
833 | (characters to skip until, or break-characters) may include an EOF | |
834 | "character", which is to be coded as the symbol, @code{*eof*}. | |
835 | ||
836 | The input stream to parse is specified as a @dfn{port}, which is usually | |
837 | the last (and optional) argument. It defaults to the current input port | |
838 | if omitted. | |
839 | ||
840 | If the parser encounters an error, it will throw an exception to the key | |
841 | @code{parser-error}. The arguments will be of the form @code{(@var{port} | |
842 | @var{message} @var{specialising-msg}*)}. | |
843 | ||
844 | The first argument is a port, which typically points to the offending | |
845 | character or its neighborhood. You can then use @code{port-column} and | |
846 | @code{port-line} to query the current position. @var{message} is the | |
847 | description of the error. Other arguments supply more details about the | |
848 | problem. | |
849 | ||
850 | @subsubsection Usage | |
851 | @deffn {Scheme Procedure} peek-next-char [port] | |
852 | @end deffn | |
853 | ||
854 | @deffn {Scheme Procedure} assert-curr-char expected-chars comment [port] | |
855 | @end deffn | |
856 | ||
857 | @deffn {Scheme Procedure} skip-until arg [port] | |
858 | @end deffn | |
58c4a39d | 859 | |
3e31e75a AW |
860 | @deffn {Scheme Procedure} skip-while skip-chars [port] |
861 | @end deffn | |
58c4a39d | 862 | |
3e31e75a AW |
863 | @deffn {Scheme Procedure} next-token prefix-skipped-chars break-chars [comment] [port] |
864 | @end deffn | |
58c4a39d | 865 | |
3e31e75a AW |
866 | @deffn {Scheme Procedure} next-token-of incl-list/pred [port] |
867 | @end deffn | |
58c4a39d | 868 | |
3e31e75a AW |
869 | @deffn {Scheme Procedure} read-text-line [port] |
870 | @end deffn | |
58c4a39d | 871 | |
3e31e75a AW |
872 | @deffn {Scheme Procedure} read-string n [port] |
873 | @end deffn | |
58c4a39d | 874 | |
3e31e75a AW |
875 | @deffn {Scheme Procedure} find-string-from-port? _ _ . _ |
876 | Looks for @var{str} in @var{<input-port>}, optionally within the first | |
877 | @var{max-no-char} characters. | |
878 | @end deffn | |
879 | ||
880 | @node sxml apply-templates | |
881 | @subsection (sxml apply-templates) | |
882 | @subsubsection Overview | |
883 | Pre-order traversal of a tree and creation of a new tree: | |
884 | ||
885 | @smallexample | |
886 | apply-templates:: tree x <templates> -> <new-tree> | |
887 | @end smallexample | |
888 | ||
889 | where | |
890 | ||
891 | @smallexample | |
892 | <templates> ::= (<template> ...) | |
893 | <template> ::= (<node-test> <node-test> ... <node-test> . <handler>) | |
894 | <node-test> ::= an argument to node-typeof? above | |
895 | <handler> ::= <tree> -> <new-tree> | |
896 | @end smallexample | |
897 | ||
898 | This procedure does a @emph{normal}, pre-order traversal of an SXML | |
899 | tree. It walks the tree, checking at each node against the list of | |
900 | matching templates. | |
901 | ||
902 | If the match is found (which must be unique, i.e., unambiguous), the | |
903 | corresponding handler is invoked and given the current node as an | |
904 | argument. The result from the handler, which must be a @code{<tree>}, | |
905 | takes place of the current node in the resulting tree. The name of the | |
906 | function is not accidental: it resembles rather closely an | |
907 | @code{apply-templates} function of XSLT. | |
908 | ||
909 | @subsubsection Usage | |
910 | @deffn {Scheme Procedure} apply-templates tree templates | |
911 | @end deffn | |
58c4a39d | 912 |