I'm writing an application that will parse a script in a custom language (based slightly on C syntax and Allman style formatting) and am looking for a better (read: faster) way of parsing blocks of the script code into string arrays than the way I'm currently doing it (the current method will do, but it was more for debug than anything else).
The script contents are currently read from a file into a string array and passed to the method.
Here's a script block template:
loop [/* some conditional */ ]
{
/* a whole bunch of commands that are to be read into
* a List<string>, then converted to a string[] and
* passed to the next step for execution */
/* some command that has a bracket delimited set of
* properties or attributes */
{
/* some more commands to be acted on */
}
}
Basically, the curly bracket blocks can be nested (just like in any other C-based language), and I'm looking for the best way to find individual blocks like this.
The curly bracket delimited blocks will ALWAYS be formatted like this - the contents of the brackets will start on the line after the open bracket and will be followed by a bracket on the line after the final attribute/command/comment/whatever.
An example might be:
loop [ someVar <= 10 ]
{
informUser "Get ready to do something"
readValue
{
valueToLookFor = 0x54
timeout = 10 /* in seconds */
}
}
This would tell the app to loop whilst someVar is less than 10 (sorry for the sucking eggs comment). Each time round, we pass a message to the user and look for a specific value from somewhere (with a timeout of 10 seconds).
Here's how I'm doing it at the minute (note: the method that calls this passes the entire string[] containing the current script into it with an index to read from):
private string[] findEntireBlock(string[] scriptContents, int indexToReadFrom,
out int newIndex)
{
newIndex = 0;
int openBraceCount = 0; // for '{' char count
int closeBraceCount = 0; // for '}' char count
int openSquareCount = 0; // for '[' char count
int closeSquareCount = 0; // for ']' char count
List<string> fullblock = new List<string>();
for (int i = indexToReadFrom; i < scriptContents.Length; i++)
{
if (scriptContents[i].Contains('}'))
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
newIndex = i;
fullblock.Add(scriptContents[i]);
break;
}
}
else
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
fullblock.Add(scriptContents[i]);
}
}
}
if (openBraceCount == closeBraceCount &&
openSquareCount == closeSquareCount)
return fullblock.ToArray();
else
//throw new exception, the number of open brackets doesn't match
//the number of close brackets
}
I agree that this might be a slightly obtuse and slow method to follow, that's why I'm asking for any ideas on how to re-implement this for speed and clarity (if a balance can be met, that is).
I'm looking to stay away from RegEx, because I can't use it to maintain a bracket count and I'm not sure on whether you can write a RegEx statement (is that the correct term?) that can act recursively. I was thinking of working from the inside outward, but I'm convinced that would be quite slow.
I'm not looking for someone to re-write it for me, but a general idea on algorithms or techniques/libraries that I could use that would improve my method.
As a side question, how do compilers deal with multiple nested brackets in source code?