summaryrefslogtreecommitdiff
path: root/examples/json-parser.org
blob: 737f5e8d83fcc17e8cdd98aabcb77dac0d9fdaf7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530

* A JSON parser written with =parzival=

  This file introduces the =parzival= Common Lisp library by building a simple
  JSON parser. The parser that is completed will not be a "full" JSON parser in
  that it won't parse the full range of number forms in the JSON spec, nor will
  it parse escaped quote characters inside of strings. The code is for
  educational purposes only.

  While as [[https://en.wikipedia.org/wiki/Parzival][Parzival]] sought the Holy Grail through much trial and folly,
  =parzival= searches the space of context free languages while avoiding the
  unholy =<fail<=.

** Preliminary Concepts & Conventions

  =parzival= is a Common Lisp library that provides a domain-specific language
  for writing parsers. Specifically, =parzival= defines a parser to be a
  function that accepts a stream of characters as input and returns structured
  data. For example, a parser might accept a string containing HTML and return a
  data structure representing the hierarchical structure of an HTML document. 

  The =parzival= language uses the *parser-combinator* approach to parser
  construction. Here, the word *combinator* is used to mean "higher-order
  function". If a parser is a function that accepts character stream and returns
  some data, then a parser-combinator is a function that accepts and returns
  parsers.  


*** By your parsers combined....

   The basic notion of the parser-combinator approach is to build large parsers
   by combining smaller ones. You create utility combinators that stick parsers
   together to make bigger parsers. A common approach, and one that you will
   expore in what follows, is to create small parsers for different kinds of
   values, each of which can operate on their own, but that become more powerful
   when combined. Think [[https://en.wikipedia.org/wiki/Captain_Planet_and_the_Planeteers][Captain Planet]] or [[https://en.wikipedia.org/wiki/Voltron][Voltron]], but for parsing text into
   structured data.

*** More About Parsers  

    The characterization of parsers above i *almost* accurate. In =parzival=, a
    parser is a function that accepts a stream of characters, and retunrs three
    values: a possible result, a boolean indicating whether the parse succeed or
    failed, and a stream which may or may not be identical to the stream the
    parser was passed as its argument.

    **NOTE:** The third value, the stream, may not be a part of the final design.
  
    For the most part, however, when you are building your own parsers, you
    don't have to think about the other two values. =parzival= provides
    abstractions that let you work focus your attention on the results of a
    parse while managing success states and stream modifications for
    you.

*** Naming Conventions

    In order to quickly and easily distinguish between parsers and
    parser-combinators while you are building your projects, =parzival= adopts a
    simple naming convention.  

    1. The name of a parser should begin and end with =<= symbols. E.g. =<foo<=
    2. The name of a function that accepts or returns a parser should begin with
       =<<=, E.g. =<<foo=

    There are a few more little naming conventions that the =parzival= package
    adopts, such as a =?= indicating that the parser's value is "optional" and a
    =~= indicating that the parser is "rewinding", i.e. it will rewind the
    stream to a certain point before failing. You do not need to adhere to these
    littler rules if they would chafe you.
 
    These conventions may seem grotesque to some Lispers, but in practice they
    have helped keep clear the meaning of =parzival= code. The =<= symbol was
    chosen to be suggesting of an arrow pointing from right to left, or perhaps
    stream of data flowing in and flowing out, from right to left.

** A JSON Parser

   Now you get to see now fun and easy it is to use =parzival= to make "real"
   parsers.

*** JSON for you

    If you are not familar with [[https://en.wikipedia.org/wiki/JSON][JSON]], you should check it out. The [[https://json.org][json.org]] site
    is especially relevant if you want to create a parser for the the format.

    You will be using the following JSON document as reference for the remainder
    of the tutorial:

#+begin_src json

{"name" : "Boutade",
 "languages" : [ {"lang" : "Common Lisp",
                 "proficiency" : null,
                 "lovesIt" : true }

              , {"lang" : "Rust",
                 "proficiency" : 0.8,
                 "lovesIt" : true,
                 "isAshamedToLoveIt" : true}

              , {"lang" : "Haskell",
                 "proficiency" : 0.5,
                 "lovesIt" : "sometimes, in some ways"} 
              ],
  "pizzaOrder" : ["Tempeh Italian Sausage", "Spinach", "Mushrooms", "Red Pepper Flakes"],
  "isCool" : false,
  "isFunny" : false,
  "thinksPeopleAreLaughing" : true,
  "beHonest_thinksPeopleAreLaughing" : false
}

#+end_src

  Time to start writing code to parse the above!
    
*** Lisp boiler plate

    Open your text editor and define the package you will use throughout:

#+begin_src lisp
;; Assuming you have parzival available to you, e.g. via ~/quicklisp/local-projects/

(defpackage "json-parzival"
  (:use :cl :parzival))

(in-package "json-parzival")

#+end_src

*** Parsing JSON Values  

    A JSON value is one of the following things:
    
    1. string, e.g. ="Hey whats happening?"=
    2. number, e.g. =33.234=
    3. boolean, e.g. =true= or =false=
    4. =null=
    5. object, e.g. ={"foo" : "bar"}=
    6. array, e.g. =["hey", "I'm", "an", "array"]=

    In the following you will build a parser called =<json-value<= that will
    accpet some JSON and return a Lisp structure. The =<json-value<= parser will
    be built out of several smaller parsers, one for each of the items in the
    above list.  

    
*** Matching Whitespace

    The first parser you will write matches any number of whitespace characters.
    You will want to ignore much of the whitespace in your input string, so this
    is your most basic utility parser.

#+begin_src lisp

(defvar <whitespace< (<<* (<<plus <space< <newline<)))

#+end_src

   The definition of =<whitespace<= demonstrates a surprising number of concepts
   central to the =parzival= way.  Here's how it works:

   First the =<space<= and =<newline<= parsers, which are built-in to
   =parzival=, each accept a single character of input, matching a space or a
   newline character repsectively, and returning that character as the parse
   result.

   Next, the =<<plus= combinator is used to create a new parser that will
   succeed if either one of its arguments succeeds. In this case, you make a
   parser =(<<plus <space< <newline<)= that will match and return *either* a
   space character *or* a newline character.  Pretty cool.

   Finally, the combinator =<<*= creates a new parser that matches zero or more
   of its argument, returning the results in a list. That is, if =<foo<= is a
   parser, then the new parser =(<<* <foo<)= will consume the input by parsing
   =<foo= over and over again until =<foo= fails to parse any more input,
   returning all those parse results in a list.

   *ASIDE*: The =*= is meant to be reminiscent of the Kleene-star that you might be
   familiar with from common regular expression syntaxes.)

   To better see how all this fits together, check out this REPL session:

#+begin_src lisp
;; matching and returning a single space from a string of spaces
|json-parzival|> (parse-string <space< "    ")    
#\ 
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {10027DD303}>

;; failing to match a space, input starts with an #\x character
|json-parzival|> (parse-string <space< "xyz")     
NIL
NIL
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {100287CE53}>

;; matching and returning a newline
|json-parzival|> (parse-string (<<or <space< <newline<) "
")                                                
#\Newline
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {100298F343}>

;; failing to match either, empty string
|json-parzival|> (parse-string (<<or <space< <newline<) "") 
NIL
NIL
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {1002C54CC3}>

;; finally, matching a bunch of spaces or newlines and returning them as a list
|json-parzival|> (parse-string (<<* (<<or <space< <newline<)) "     
  ")
(#\  #\  #\  #\  #\  #\Newline #\  #\ )
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {1002D16473}>

;; equivalently, just use your newly minted <whitespace< parser
|json-parzival|> (parse-string <whitespace< "     
  ")
(#\  #\  #\  #\  #\  #\Newline #\  #\ )
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {1002ED8693}>
#+end_src

   Looking at the above, you already begin to see how new parsers can be built
   up from old. We can go from a parser that accepts a single character, to one
   that accepts either of two characters, to one that accepts a list of zero or
   more of either of those two characters in a single clean line of code. Nice
   stuff!


**** Ignoring Whitespace

     Now that you can match strings of whitespace you will create a parser to
     ignore whitespace. Specifially, you'll write a parser that effectively
     "strips" the whitespace around some other parser. The combinator doing the
     heavy lifting is called =<<brackets=, here is its docstring:

#+begin_quote

(<<BRACKETS LEFT CENTER RIGHT) parses CENTER delimited by LEFT and RIGHT. The
results of LEFT and RIGHT are ignored. (<<BRACKETS LEFT CENTER RIGHT) fails if
any of its three arguments fails.

#+end_quote
     
    In case it isn't clear, =<<brackets= lets you parse strings like ="[stuff]"=
    or ="{{432}}"= that begin and end with some kind of opening and closing
    "brackets" that serve only to delmite the values you're interested in.
    
    Here's how you can use =<<brackets= to create a new combinator that is used
    to ignore the whitespace surrounding some text of interest:

#+begin_src lisp

(defun <<strip (parser)
   "Ignore whitespace around PARSER"
   (<<brackets <whitespace< parser <whitespace<))

#+end_src

   Well thats enough utilities for now. Your next step is to parse actual JSON values!

*** The simple JSON values: null, bool, and, number

    Here you will build small, stand-alone, parsers capable of transforming a
    few kinds of JSON formatted expressions into Common Lisp atoms.  Specifically
    
    - ="null"= will parse to =null=
    - ="true"= will parse to =T= 
    - ="false"= will parse to =NIL=
    - and a number string like ="33.421"= will become =33.421=

   The first new combinator you need to understand is called =<<string=. Its job
   is to create a parser that matches exactly one string and return it. For
   example, evaluating =(<<string "goobers")= returns a parser that matches
   exactly the string ="goobers"= in the input and returns that same string. If
   ="goobers"= is not found in the input, then the parse fails.  

   The next combinator is slightly interesting, and it is called =<<map=. If
   =my-cool-parser= is a parser that results in a value =x=, then evaluating
   =(<<map func my-cool-parser)= returns a new parser that results in a the
   value =(funcall func x)=. That is, =<<map= changes the output of a successful
   parse. If =my-cool-parser= fails, then so does =(<<map func my-cool-parser)=.

   A concrete example should help:

#+begin_src lisp

(defvar <json-null< (<<map (lambda (null) :null) 
                           (<<string "null")))

#+end_src
   
   So, =<json-null<= matches exactly the string ="null"= using =(<<string "null")=. 
   If the match was successful, instead of returning just returning
   ="null"= you feed it to the function =(lambda (null) :null)=, which just
   ignores its argument and returns =:null=.
   
   Similarly you can define =<json-bool<=:

#+begin_src lisp

(defvar <json-bool< (<<map (lambda (bool) (if (equal bool "true") T NIL))
                           (<<plus (<<string "true") (<<string "false"))))

#+end_src

   The =<json-bool<= will only succed if it matches either the string ="true"=
   or the string ="false"=. You then check which string was matched and return
   =T= or =NIL=.

   Finally, you're going to cheat a bit on number parsing and just use the
   built-in number parser that ships standard with =parzival=.

#+begin_src lisp

(defvar <json-num< <real<)

#+end_src

   It should be re-emphasized that =<json-num<= doesn't actually parse the full
   range of number forms that JSON is supposed to support. This parser is purely
   educational.

*** The JSON String Parser

    The only new forms you must understand in order to write =<json-string<= are
    =<<char= and =<<sat=. Both combinators are used to create parsers that
    accept a single character. Quite simply, evaluating =(<<char #\x)= returns a
    parser that accepts exactly the character =#\x=, and =(<<sat a-predicate)=
    returns a parser that accepts a character =c= if =(funcall a-predicate c)=
    is true. In fact, =<<char= is written in terms of =<<sat=. With that in
    mind, you can proceed.

    #+begin_src lisp

(defvar <json-string< (<<brackets (<<char #\")
                                  (<<to-string (<<* (<<sat (lambda (c) (not (eql c #\"))))))
                                  (<<char #\")))

    #+end_src

    So you can see that =<json-string<= matchs a double quote character, then
    matches zero or more characters that are not a double quote character, then
    another double quote character. The result is returned as a lisp string by
    means of the =<<to-string= combinator, which uses =<<map= and =concatenate=
    under the hood.

    
*** Some trickery with mutually recursive parsers

    You're nearly ready to write your =<json-value<= parser. You're only missing
    =<json-object<= and =<json-array<=, but there's a sticky point to consider:
    A JSON value can be anything from that list of six values in [[Parsing JSON Values]] above.

    But here's the rub. A JSON value can be an JSON object, but a JSON object
    can contain JSON values.  And the same can be said of JSON arrays.  

    Recalling that a parser is just a function, you can defer the definition of
    these two parsers by defining them as functions that simply call some other
    function that has yet to be defined. Here's how it works:

#+begin_src lisp

(defun <json-array< (stream)
   (funcall <real-json-array< stream))

(defun <json-object< (stream)
   (funcall <real-json-object< stream))

;; The <<or combinator is like <<plus, but it accepts two or more arguments.
(defvar <json-value<
   (<<or <json-num< <json-bool< <json-null< <json-string< #'<json-array< #'<json-object<))

#+end_src

  So there it is! You've "defined" you JSON parser, but only in a sense. You
  still need to fill in the missing values for =<real-json-object<= and
  =<real-json-array<=.

*** Parsing sequences sperated by tokens

    To parse JSON arrays and JSON objects, you will make use of the handy
  =<<sep-by= combinator. The job of =<<sep-by= is to produce a parser that
  parses a list of values seperated by some token. In your case, you want to
  parse arrays like =[1,2,3,4]=  and key-value pairs 
  like ={"foo" : true, "bar" : false}= which are seperated by commas.

  In fact, because commas surrounded by whitespace are used to separate JSON
  values in both objects as well as arrays, you should define a handy helper to
  parse that separator:

#+begin_src lisp

(defvar <comma-sep< (<<strip (<<char #\,)))

#+end_src

  Now you're ready to define =<real-json-array<= :

#+begin_src lisp

(defvar <real-json-array<
  (<<char-brackets #\[
                   (<<sep-by <json-value< <comma-sep<)
                   #\]))

#+end_src

  You can almost read what its doing there: Parse a list of comma-separated JSON
  values that are brackted by the =[= and =]= characters.  Incredible!

  The parser for JSON objects seems like it should be very similar. It too is a
  comma-separated list of things, but those things are key-value pairs instead
  of simple json values. Pretend that you have a parser called
  =<json-key-value-pair<= already written. Then you can write:

#+begin_src lisp

(defvar <real-json-object<
   (<<char-brackets #\[
                    (<<sep-by <json-key-value-pair< <comma-sep<)
                    #\]))

#+end_src

  But of course, that wont compile because you haven't actually written that
  missing parser. To do so, you will use the final core combinator that
  =parzival= provides, it is called =<<bind=.

  
*** The <<bind Combinator a.k.a. What's a Monad? 

    Before you read an explanation of how bind works, it might help to see an example:

#+begin_src lisp
> (parse-string (<<bind <nat<
                        (lambda (count) (<<times count (<<char #\x))))
                "4xxxx")
(#\x #\x #\x #\x)
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {1003BE2F33}>

|json-parzival|> (parse-string (<<bind <nat<
                                      (lambda (count) (<<times count (<<char #\x))))
                               "4xxx")
NIL
NIL
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {1003CF29D3}>

#+end_src
  
  So what's happening here? First, the parser =<nat<= parses a natural number,
  like =3= or =432=, and evaluating the combinator =(<<times n parser1)= returns
  a parser that parses =parser1= exactly =n= times.

  The above fragment parses a number, and then uses that number to create a new
  parser that will parse the character =#\x= exactly that number of times.

  You can see that "4xxxx" will succeed but "4xxx" will fail.

  With that in mind, its time to write =<json-key-value-pair<= and wrap up this
  document.

  #+begin_src lisp

(defvar <json-key-value-pair<
  (<<bind (<<brackets <whitespace<
                      <json-string<
                      (<<and <whitespace<
                             (<<char #\:)
                             <whitespace<))
          (lambda (key)
            (<<map (lambda (value) (cons key value))
                   <json-value<))))


  #+end_src

  A quick explaination: the call to =<<brackets= produces a parser that matches
  the key as a string, ensuring that a =#\:= occurs after the key. Then, using
  that key, you return a parser that reads any JSON value and returns the cons
  of the key and that value.  Ta da!

  
    
*** Demo


#+begin_src lisp

|json-parzival|> (defvar json-txt "{\"name\" : \"Boutade\",
 \"languages\" : [ {\"lang\" : \"Common Lisp\",
                 \"proficiency\" : null,
                 \"lovesIt\" : true }

              , {\"lang\" : \"Rust\",
                 \"proficiency\" : 0.8,
                 \"isAshamedToLoveIt\" : true}

              , {\"lang\" : \"Haskell\",
                 \"proficiency\" : 0.5,
                 \"lovesIt\" : \"sometimes, in some ways\"} 
              ],
  \"religion\" : \"Goofism\",
  \"pizzaOrder\" : [\"Tempeh Italian Sausage\", \"Spinach\", \"Mushrooms\", \"Red Pepper Flakes\"],
  \"isCool\" : false,
  \"isFunny\" : false,
  \"thinksPeopleAreLaughing\" : true,
  \"beHonest_thinksPeopleAreLaughing\" : false
}")
JSON-TXT

|json-parzival|> (parse json-txt <json-value< t)
(("name" . "Boutade")
 ("languages"
  (("lang" . "Common Lisp") ("proficiency" . :NULL) ("loves-it" . T))
  (("lang" . "Rust") ("proficiency" . 0.8) ("is-ashamed-to-love-it" . T))
  (("lang" . "Haskell") ("proficiency" . 0.5)
   ("loves-it" . "sometimes, in some ways")))
 ("religion" . "Goofism")
 ("pizzaOrder" "Tempeh Italian Sausage" "Spinach" "Mushrooms"
  "Red Pepper Flakes")
 ("isCool") ("isFunny") ("thinksPeopleAreLaughing" . T)
 ("beHonest_thinksPeopleAreLaughing"))
T
#<REPLAY-STREAMS:STATIC-TEXT-REPLAY-STREAM {10033FE0A3}>


#+end_src