source: Scripal instructions, the pattern to use
code: compiled Scripal instructions
text: text to parse
match: some condition is fulfilled, an operand matches portions of the text
a match doesn't always produce a result, it's a logical condition
result: result string and/or result positions, results are stored in an array, as CSV or JSON and returned at the end of the match process
a result may only occur in case of a match
result region: a result string, part of the text, with a begin and end
result positions: relating to results in the original text, a byte position corresponding to the result region or a virtual position (one side may be nPos)
nPos: nPos denotes an invalid position or no position
tags: relating to the result, a tag describes the result
names: result strings may be given a name for further use
template: a template holds part of the source to be used repeatedly
All operations are of the form
|condition| operator |attributes| |operand| |(...)|
where |element| denotes, the element is optional or element1|element2 means, one of many can be chosen
conditions are keywords controlling the program flow
operators are executed one by one, controlling the type of operation (examples: match, loop etc.)
an operand may be logical, a block operand, value(s), text or a name
attributes can be used to further specify the behavior of operands
controls may be found anywhere and separate operators or operands, denote operand types etc. and are often special characters like ';' '"' '[' and others
example: match 'dog';
will match 'dog' at the current position in text
example: matchEnd 'dog';
will match 'dog' at the current position in text and finalize result
reset : reset (throw away) result, next match will start new result
end : end result region at end of last match, store result text and/or result position, if region is open at end of text, it is stored as well, no end necessary
endlast : end result region before start of last match, if region is open at end of text, it is stored as well, no end necessary
replace : end result region at end of last match, replace last result/position with new one
name text : result is only used as a name specified after operand, not stored. The text denoted by name may be used for further matches.
example: name #code#; match find ( #code# );
will store the current result as label 'code' (much like a variable) and can be used for further matches
nameAdd text : result is stored and also used as a name specified after operand
tag text : tag result, results have tags with the same index to explain the result in detail
example: matchEnd 'test'; tag 'test tag' ;
will store the current result under tag 'test tag' to denote the result text and position
setMatch : set last operation as matched without true result for condition logic
setNomatch : set last operation as not matched for condition logic
loop : repeat search in block if combined with block end {} or from the beginning of the code if not preceded by curly bracket
if end of text is reached, loop will break
example: match find('test'); loop;
or
match 'pre'; ifMatch {match find('test'); loop;}
example: { match find( all(letter char)); end; } total 2
must match at least 2 times
- moveon number : move position pointer in text by number of code points, if number is omitted -> move ahead one code point
does not influence match logic
ifMatch : if last operation was a match, perform the following operation(s), either single operation or entire block in {}
ifNomatch : if last operation wasn't a match, perform the following operation(s), either single operation or entire block in {}
ifMatched : if any match has occurred so far, perform the following operation(s), either single operation or entire block in {}
ifNoMatched : if no match has occurred so far, perform the following operation(s), either single operation or entire block in {}
else : alternative branch either of block or previous operator
a logical operand has sub operands and is of the form |attribute| logic |value|[value range]| ( operand1 operand2 ...)
a block operand is of the form |attribute| block |value|[value range]|
a number is of the form |attribute| number or |attribute| [number range]
text is of the form
|attribute| 'text'
|attribute| "text"
|attribute| `text`
a name is of the form |attribute| #name#
value or number ranges are embedded in [x, x] where @ = infinity, example: [0, @]
many operands push the position pointer further
'text', "text", `text`: match text
example: match ('test')
match the word 'test'
So any of the three delimiters '', "", `` may be used.
If you have text with double quotes, embed in quotes.
example: ' it is no "surprise" '
special attributes: ~
#name# : match text held in given name
example: match (#ph#)
match the word held in placeholder called 'ph'
special attributes: ~
attribute number or
attribute [number min,number max]
example: match [3,5]
match all numbers >= 3 and <= 5
The number should stand isolated like a word, as in '100 times' or 'I was born in 1989.'
To find numbers in phrases use the pure attribute. Example '0xff6a' or 'hgf-300-GHT' etc.
Here the embedded numbers may be found by using the pure attribute.
Decimal point in Scripal is always '.', example: 0.355. A decimal point in text to scan is set in the config, may be locale specific.
special attributes: hex, oct, bin, pure
logical operands don't set results themselves but combine or test other operands
example: all('my' blank 'friend' blank 'Greg')
will match only if all terms in this order are given 'my friend Greg'
example: any('cat' 'dog' 'hamster' 'tyrannosaurus rex')
will match if any one of the terms occur
example: each('my' blank 'cat' blank 'is' blank 'hungry')
will match if 'my', 'my cat', also 'my cat is' etc. is found. 'each' matches as long as possible in the order given
example: every('dog' 'cat' 'hamster')
will match 'dog', also 'cat' and 'dog', also 'dog' and 'cat', the order is not relevant, the 'every' operand matches as long as possible in any order
example: find('dog')
will move the position pointer until 'dog' is found
example: findAt[1,5] ('dog')
will move the position pointer until 'dog' is found, only if it the term is found 1 up to 5 characters from the current position on
example: find( break('cat') 'dog')
will look for the term 'dog' , but if 'cat' is found, the find operation will be aborted
example: repeat[1,3] (space)
will succeed if at least one and up to three spaces haven been found
example: int isNumber()
will succeed if the entire match is an integer number
see attribute description further down
example: isWord('grape' any(eow 's' 'fruit'))
will succeed if the sole words 'grape' (eow), 'grapes' or 'grapefruit' have been found
without eow the word 'grape' would not be found
example: isUpper('GRAPE')
will succeed if the sole word 'GRAPE' is all in upper case
example: isLower('grape')
will succeed if the sole word 'grape' is all in lower case
block operands set matches and results
space : match if text is single space
blank : match all contiguous white space in text
char[begin[, end]] : match any code point
if single single value or range is specified -> match Unicode value or range
example: match char
match any single character
example: match (char[65])
match character 'A'
example: match (char[65,67])
match character 'A', 'B' or 'C'
letter : match if text is a letter, that is a character which is part of an alphabet
digit : match if text is a digit, recognizes digits in most languages, not just 0,1,2..., see config.translateDigits
word : match if text is any word
bos : match beginning of sentence, first non-space character in a sentence
BOM does not count as sentence begin
sentences start at the beginning of text and after EOS markers
(only returns position [pos, nPos)]
eos : match end of sentence sequence, that is EOF markers like '.'
bol : match beginning of line (includes beginning of text, if first character)
might be BOM if present, the BOM marks the beginning of the first line
(only returns position [pos,nPos])
eol : match all consecutive end-of-line control characters like linefeed
bot : match begin of text, first byte in text (might be start of BOM if present)
(only returns position [pos,nPos])
eot : match end of text, position after last byte in text
(only returns position [nPos,pos])
bow : match beginning of word
(only returns position [pos,nPos])
eow : match end of word (that is character after word) (only returns position [nPos,pos])
bomark : match byte order mark
move[count] : move position pointer count characters ahead
expand result region
attributes further specify the behaviour of operands
example: match (~'dog')
will match 'dog' case insensitive, 'Dog' will match as well
example: match (!'dog')
will match any text but 'dog'
example: match (int 145)
will match the number 145, not 145.13 or -145'
example: match (pure 145)
will match 'a145b' or '.145.' at the end of a sentence
hex : use hexadecimal format
(use for number matching only)
example: match (hex a015f)
oct : use octal format
(use for number matching only)
example: match (oct 0157)
bin : use binary format
(use for number matching only)
example: match (bin 01001)
at : conditional match, set position pointer to begin of match but produce no result
example: match (find (at 'test'))
will find the word 'test' and set match condition to true, but no result
the position pointer is at the first 't' of 'test'
example: match (find (skip 'test'))
will find the word 'test' and set match condition to true, but no result
the position pointer is after the last 't' of 'test'
example: match (find (test 'is')); ifMatch match find ('contained')
find the word 'contained' only if anywhere in the text the word 'is' was found, only match condition is set to true
example: match (try 'dog')
will match 'dog', but will also match if 'dog' ist not found
try will include text but only if possible, in any case the condition is fulfilled
example: match repeat[2,@]( last ~'S' letter )
will match all contiguous letters, the last letter must be 's' or 'S'
example: match repeat[2,@]( notlast ~'S' letter )
will match all contiguous letters, the last letter cannot be 's' or 'S'
example: match ( unless 'honey ' 'bee' )
will match the word 'bee' if not preceded by 'honey '
controls have various functions, grouping operator blocks, ending lines etc.
{} : operator block
\n \r \n\r : line end and operator end
; : operator end
example: match ( 'test' );
// : comment up to end of line
example: match ( 'test' ) // remark here
// final remark
Scripal is UTF-8 based, all internals use this encoding. Results are byte positions! not character positions, since they are quite useless for further processing in many programming languages.
Results regions may be open!, that is only one side is a byte position, denotig every character from here [byte pos, nPos] or up to here [nPos, byte pos]. nPos represents the NULL position.
Result positions are related to the position in the UTF-8 representation of the text.
If the character encoding is different, positions cost time to be calculated, they are set to NaN (not a number). Set config.positionType to POS_OFFSET for byte positions in the native encoding or POS_COUNT for character count (0...).
These last two settings produce extra proccesing time, use if positions are needed.
nPos and is denoted as -1 in JSON results, NaN as -2. In other result types , the words are written out.
Example: position [2,7] means the result expands over bytes 2 up to 7, so 7 would by the last byte of the last code point of the result region.
Virtual positions are open ranges:
For example: In case of operand 'bow' (begin of word), only the byte position is returned with an end, no region is involved: [8,nPos] would indicate, the word starts at byte position 8.
Results are held in arrays or may be obtained in JSON or CSV format.
\<name = template> : denote template
Templates are used for repetitive expressions, like macros. By specifying the name of the template you use a certain code fragment several times. A change in the template causes a change in all instances.
< roadMarker = { any( ~'avenue' ~'ave.' ~'road' ~'street' ~'boulevard' ~'drive' ~'lane' ) } > match find( int[1,10000] blank repeat[1,3]( !( < roadMarker > ) word ) blank < roadMarker > ) ifMatch { matchEnd ( ',' blank int[1,@] repeat[1,3]( blank word ) at any(',' eol eot )) }
defines roadMarker as a template to match any road type
the entire expression will match an address like
1007 Mountain Drive, 63527 Gotham City
define a template in the source by using
< name = { code } >
templates may have arguments
<1> <2> ..
example: < person = { match find ( <1> space <2>) } >
which will be substituted by caller arguments in:
< person {'mike'}{'myers'} >
<1> and <2> in template person will be substitued with 'mike' and 'myers' so template person will match for first name and surname
The configuraton data can be set by special methods. The configuration is thread bound, all Scripal objects in a thread share the same configuration.
Configuration data may be specified at the beginning of source in %xxxxx%, and this configuration is only used for the given object.
example: % = { "showCode" : true, "posSign" : "+" }
If a config file exists at the default location, it is used implicitly. Create by calling
scripal -c reset
default path is:
Linux : ~/.config/scripal/scripal.cnf
MS Windows : .\scripal.cnf
config values:
debugCompile : boolean, compile source with debug option (default: false)
debugRun : boolean, run code with debug option (default: false)
showCode : boolean, show compiled code (default: false)
measureTime : boolean, if true measure process time in milliseconds (default: false)
useEmpty : boolean, also store empty results in replace/split operations (default: false)
translateDigits : boolean, set true to convert digits in current language to 0,1 .. slight performance penalty (default: false)
verboseResult : boolean, set true for very explicit results (default: false)
decimalPoint : character, decimal point in current language (default: '.')
thousandsSep : character, thousands separator in current language (default: ',')
posSign : character, + sign in current language (default: '+')
negSig : character, - sign in current language (default: '-')
encoding : integer, best fit encoding used in environment (default: ENC_DEFAULT to guess best encoding), see paragraph encodings further down
logEncoding : integer, encoding used in logging (file or console) (default: ENC_DEFAULT to guess best encoding), see paragraph encodings further down
maxFileSize : integer, maximum size of file to load in MB used in file matching, prevent out-of-memory errors (default: 1000)
patternNearest : integer, pattern used for nearest match (default: PATTERN_LEVENPLUS_WORD)
PATTERN_LEVEN_WORD = 1 : Levenshtein distance, find similar word
PATTERN_LEVENPLUS_WORD = 2: Levenshtein distance, optimized for word match
PATTERN_LEVEN = 3 : Levenshtein distance, find similar phrase
PATTERN_JARO = 100 : Jaro distance, find similar phrase
PATTERN_JARO_WINKLER = 101 : Jaro Winkler distance, find similar phrase
PATTERN_JARO_WINKLER_WORD = 102 : Jaro Winkler distance, find similar word
patternBlock : integer, pattern used for block match (default: PATTERN_JARO)
PATTERN_JARO = 100 : Jaro distance, compare block using Jaro distance
PATTERN_JARO_WINKLER = 101 : Jaro Winkler distance, compare block using Jaro-Winkler distance
PATTERN_JARO_WINKLER_WORD = 102 : Jaro Winkler distance, compare block using Jaro-Winkler word based distance
positionType = type of result positions, (default : POS_UTF8)
POS_UTF8 = 1: result positions relate to UTF-8 or are NaN if other encoding is used (fast)
POS_OFFSET = 2: result positions relate to native encoding of text and is byte offset
POS_COUNT = 3: result position relates to character count (0,1..)
logChannel : string, log channel to use, 'default', 'stdout', 'buffer' or a filename (default : 'default') default setting: log to stdout in Scripal binary, don't log in Scripal library
pdfReader : string, path to binary pdftotext (see install document) (default : 'pdftotext')
frmReader : string, path to binary pandoc (see install document)(default : 'pandoc')
appPath : string, path to application data (templates)(default : './' under Windows, '~/.config/scripal under Linux)
sentenceEnd : array, sentence end signs (default: '.', '!', '?')
separators : array, word separators (default: ' ', '.', '!', '?', ',', ';', ':', '/', '(', ')', '[', ']', '{', '}')
abbreviations; : array, abbreviations to distinguish word and end of sentence (default: empty set)
Scripal uses UTF-8 internally, so matching a UTF-8 string or file is the fastest operation. It does support various other encodings: UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, Latin and Windows codepages.
The internal encodings are: