A modern, high-performance lexical analysis and parsing system with comprehensive PCRE2 support and CognitiveGraph integration
DevelApp.StepLexer is a zero-copy, UTF-8 native lexical analyzer designed for high-performance pattern parsing with advanced PCRE2 support. It implements a forward-only parsing architecture with ambiguity resolution capabilities, making it suitable for both regex pattern parsing and source code tokenization.
The main lexical analyzer that processes input text and generates tokens using a two-phase approach.
public class StepLexer
{
// Two-phase tokenization methods
public bool Phase1_LexicalScan(ZeroCopyStringView input);
public bool Phase2_Disambiguation();
// Configuration
public void AddRule(TokenRule rule);
public void Initialize(ReadOnlyMemory<byte> input, string fileName = "");
}
Key Methods:
Phase1_LexicalScan(): Fast lexical analysis with ambiguity detectionPhase2_Disambiguation(): Resolves ambiguities and constructs final token streamAddRule(): Adds a token rule to the lexerInitialize(): Initializes the lexer with input dataRepresents tokens that can be split into multiple alternatives during ambiguity resolution.
public class SplittableToken
{
public ZeroCopyStringView Text { get; }
public TokenType Type { get; set; }
public int Position { get; }
public List<SplittableToken>? Alternatives { get; set; }
// Split token into alternatives
public void Split(params (ZeroCopyStringView text, TokenType type)[] alternatives);
}
Properties:
Text: Zero-copy view of token contentType: Enumerated token classificationPosition: Character position in sourceAlternatives: List of alternative interpretations for ambiguous tokensZero-allocation string view for efficient text processing.
public readonly struct ZeroCopyStringView : IEquatable<ZeroCopyStringView>
{
public int Length { get; }
public bool IsEmpty { get; }
// Efficient slicing without allocation
public ZeroCopyStringView Slice(int start, int length);
// Direct span access
public ReadOnlySpan<byte> AsSpan();
// UTF-8 to string conversion when needed
public override string ToString();
}
Key Features:
Token classification system for regex patterns and source code.
public enum TokenType
{
// Regex pattern tokens
Literal, // Literal character token in regex patterns
EscapeSequence, // Escape sequence token (e.g., \n, \t, \\)
CharacterClass, // Character class token (e.g., [a-z], [^0-9])
GroupStart, // Group start token (opening parenthesis)
GroupEnd, // Group end token (closing parenthesis)
SpecialGroup, // Special group token (e.g., (?:), (?=), (?!))
Quantifier, // Quantifier token (e.g., *, +, ?, {n,m})
LazyQuantifier, // Lazy quantifier token (e.g., *?, +?, ??)
Alternation, // Alternation token (pipe symbol |)
StartAnchor, // Start anchor token (caret ^)
EndAnchor, // End anchor token (dollar sign $)
AnyChar, // Any character token (dot .)
HexEscape, // Hexadecimal escape token (e.g., \x41)
UnicodeEscape, // Unicode escape token (e.g., \u0041, \U00000041)
UnicodeProperty, // Unicode property token (e.g., \p{L}, \P{N})
InlineModifier, // Inline modifier token (e.g., (?i), (?m), (?s))
LiteralText, // Literal text token in \Q...\E construct
RegexComment, // Comment token in (?#...) construct
// Source code tokens
Identifier, // Identifier token in source code
Number, // Numeric literal token
String, // String literal token
Keyword, // Keyword token (language-specific reserved words)
Operator, // Operator token (arithmetic, logical, assignment operators)
Whitespace, // Whitespace token (spaces, tabs, line breaks)
Comment, // Comment token (single-line and multi-line comments)
Punctuation // Punctuation token (semicolons, commas, brackets)
}
High-level parser controller for pattern processing with encoding conversion support.
public class PatternParser
{
public PatternParser(ParserType parserType);
// Zero-copy pattern parsing
public bool ParsePattern(ReadOnlySpan<byte> utf8Pattern, string terminalName);
// Pattern parsing with encoding conversion
public bool ParsePattern(ReadOnlySpan<byte> sourceBytes,
Encoding sourceEncoding,
string terminalName);
public bool ParsePattern(ReadOnlySpan<byte> sourceBytes,
string encodingName,
string terminalName);
public bool ParsePattern(ReadOnlySpan<byte> sourceBytes,
int codePage,
string terminalName);
// Stream-based parsing with encoding
public bool ParsePatternFromStream(Stream stream,
Encoding sourceEncoding,
string terminalName);
public bool ParsePatternFromStreamWithAutoDetect(Stream stream,
string terminalName);
// Access parsed tokens
public List<SplittableToken> GetTokens();
}
Parser Types:
ParserType.Regex: Regular expression pattern parsingParserType.Grammar: Grammar-based pattern parsingLibrary-based converter using System.Text.Encoding for hundreds of character encodings.
public static class EncodingConverter
{
// Convert from any encoding to UTF-8 using Encoding object
public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes,
Encoding sourceEncoding);
// Convert using encoding name (e.g., "shift_jis", "GB2312", "ISO-8859-1")
public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes,
string encodingName);
// Convert using code page number (e.g., 932 for shift_jis, 1252 for Windows-1252)
public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes,
int codePage);
// Auto-detect encoding from BOM and convert
public static byte[] ConvertToUTF8WithAutoDetect(ReadOnlySpan<byte> sourceBytes);
// Detect encoding from BOM
public static (Encoding encoding, int bomLength) DetectEncodingFromBOM(ReadOnlySpan<byte> bytes);
// Utility methods
public static EncodingInfo[] GetAvailableEncodings();
public static bool IsEncodingAvailable(string encodingName);
public static bool IsEncodingAvailable(int codePage);
}
Key Features:
Example Encodings:
// By Encoding object
var bytes = EncodingConverter.ConvertToUTF8(sourceBytes, Encoding.UTF8);
// By name (hundreds supported!)
var shiftJIS = EncodingConverter.ConvertToUTF8(sourceBytes, "shift_jis");
var gb2312 = EncodingConverter.ConvertToUTF8(sourceBytes, "GB2312");
var latin1 = EncodingConverter.ConvertToUTF8(sourceBytes, "ISO-8859-1");
// By code page
var windows1252 = EncodingConverter.ConvertToUTF8(sourceBytes, 1252);
var shiftJIS932 = EncodingConverter.ConvertToUTF8(sourceBytes, 932);
// Auto-detect from BOM
var autoDetected = EncodingConverter.ConvertToUTF8WithAutoDetect(sourceBytes);
The StepLexer handles parsing ambiguities through token splitting:
// Example: Ambiguous quantifier interpretation
var token = new SplittableToken(view, TokenType.Literal, 0);
// Split into alternatives when ambiguity detected
token.Split(
(view.Slice(0, 1), TokenType.Literal),
(view.Slice(0, 2), TokenType.RangeCount)
);
Comprehensive Unicode handling with zero-copy efficiency:
// Unicode code points: \x{1F600}
// Unicode properties: \p{L}, \P{N}
// Unicode newlines: \R (any Unicode newline sequence)
using DevelApp.StepLexer;
using System.Text;
// Create pattern parser
var parser = new PatternParser(ParserType.Regex);
// Prepare UTF-8 input
var pattern = @"\d{2,4}-\w+@[a-z]+\.com";
var utf8Data = Encoding.UTF8.GetBytes(pattern);
// Parse the pattern
bool success = parser.ParsePattern(utf8Data.AsSpan(), "email_pattern");
if (success)
{
var results = parser.GetResults();
Console.WriteLine($"Phase 1 tokens: {results.Phase1TokenCount}");
Console.WriteLine($"Ambiguous tokens: {results.AmbiguousTokenCount}");
Console.WriteLine($"Parsing: {(success ? "SUCCESS" : "FAILED")}");
}
using DevelApp.StepLexer;
// Create pattern parser for regex
var parser = new PatternParser(ParserType.Regex);
// Parse pattern with zero-copy
var pattern = @"[a-zA-Z][a-zA-Z0-9]*";
var utf8Pattern = Encoding.UTF8.GetBytes(pattern);
bool success = parser.ParsePattern(utf8Pattern, "identifier");
if (success)
{
var results = parser.GetResults();
Console.WriteLine($"Pattern parsed successfully!");
Console.WriteLine($"Phase 1 tokens: {results.Phase1TokenCount}");
}
// Unicode-aware pattern
var unicodePattern = @"\p{L}+\x{20}\p{N}+";
var utf8Data = Encoding.UTF8.GetBytes(unicodePattern);
var parser = new PatternParser(ParserType.Regex);
bool success = parser.ParsePattern(utf8Data.AsSpan(), "unicode_pattern");
if (success)
{
Console.WriteLine("Unicode pattern parsed successfully!");
}
using DevelApp.StepLexer;
using System.Text;
// Pattern in UTF-16 format
var pattern = @"\d{2,4}-\w+";
var utf16Bytes = Encoding.Unicode.GetBytes(pattern);
// Convert to UTF-8 and parse using Encoding object
var parser = new PatternParser(ParserType.Regex);
bool success = parser.ParsePattern(utf16Bytes, Encoding.Unicode, "pattern");
// Or use encoding by name
var shiftJISBytes = Encoding.GetEncoding("shift_jis").GetBytes(pattern);
bool sjisSuccess = parser.ParsePattern(shiftJISBytes, "shift_jis", "pattern");
// Or use code page number
var windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(pattern);
bool cpSuccess = parser.ParsePattern(windows1252Bytes, 1252, "pattern");
if (success)
{
var tokens = parser.GetTokens();
// Process tokens...
}
using DevelApp.StepLexer;
using System.IO;
// Read pattern from file with BOM for auto-detection
var parser = new PatternParser(ParserType.Regex);
using var stream = File.OpenRead("pattern.txt");
// Auto-detect encoding from BOM and parse
bool success = parser.ParsePatternFromStreamWithAutoDetect(stream, "pattern");
if (success)
{
Console.WriteLine("Pattern parsed successfully!");
}
using DevelApp.StepLexer;
// Read bytes from any source
byte[] sourceBytes = File.ReadAllBytes("pattern.dat");
// Detect encoding from BOM
var (encoding, bomLength) = EncodingConverter.DetectEncodingFromBOM(sourceBytes);
Console.WriteLine($"Detected encoding: {encoding.EncodingName}, BOM: {bomLength} bytes");
// Convert to UTF-8
byte[] utf8Bytes = EncodingConverter.ConvertToUTF8(sourceBytes, encoding);
// Parse as UTF-8
var parser = new PatternParser(ParserType.Regex);
parser.ParsePattern(utf8Bytes, "pattern");
using DevelApp.StepLexer;
using System.IO;
using System.Text;
// Check if an encoding is available
if (EncodingConverter.IsEncodingAvailable("GB2312"))
{
// Pattern file encoded in GB2312 (Simplified Chinese)
byte[] gb2312Bytes = File.ReadAllBytes("chinese_pattern.txt");
var parser = new PatternParser(ParserType.Regex);
// Convert and parse using encoding name
bool success = parser.ParsePattern(gb2312Bytes, "GB2312", "chinese_pattern");
if (success)
{
Console.WriteLine("GB2312 pattern processed successfully!");
}
}
using DevelApp.StepLexer;
// Get all available encodings
var encodings = EncodingConverter.GetAvailableEncodings();
Console.WriteLine($"Total encodings available: {encodings.Length}");
Console.WriteLine("\nSample encodings:");
foreach (var encoding in encodings.Take(10))
{
Console.WriteLine($" {encoding.Name} (Code Page: {encoding.CodePage}) - {encoding.DisplayName}");
}
// Output examples:
// utf-8 (Code Page: 65001) - Unicode (UTF-8)
// shift_jis (Code Page: 932) - Japanese (Shift-JIS)
// GB2312 (Code Page: 936) - Chinese Simplified (GB2312)
// ISO-8859-1 (Code Page: 28591) - Western European (ISO)
// windows-1252 (Code Page: 1252) - Western European (Windows)
using DevelApp.StepLexer;
using System.Text;
// Process patterns from different sources
var patterns = new[]
{
(File.ReadAllBytes("pattern_utf8.txt"), Encoding.UTF8),
(File.ReadAllBytes("pattern_utf16.txt"), Encoding.Unicode),
(File.ReadAllBytes("pattern_sjis.txt"), Encoding.GetEncoding("shift_jis")),
(File.ReadAllBytes("pattern_gb2312.txt"), Encoding.GetEncoding("GB2312"))
};
var parser = new PatternParser(ParserType.Regex);
foreach (var (bytes, encoding) in patterns)
{
if (parser.ParsePattern(bytes, encoding, "pattern"))
{
Console.WriteLine($"Successfully parsed {encoding.EncodingName} pattern");
}
}
The StepLexer provides comprehensive error handling:
try
{
var tokens = lexer.TokenizeRegexPattern(input);
}
catch (ENFA_RegexBuild_Exception ex)
{
Console.WriteLine($"Regex parsing error: {ex.Message}");
Console.WriteLine($"Position: {ex.Location}");
}
catch (ENFA_Exception ex)
{
Console.WriteLine($"General lexer error: {ex.Message}");
}
ZeroCopyStringViewThe StepLexer intentionally excludes certain PCRE2 features that conflict with its forward-only architecture:
(?>...))
*+, ++, ?+)
(?R), (?&name))
These limitations are architectural decisions that maintain the lexer’s performance and simplicity advantages.
The StepLexer integrates seamlessly with DevelApp.StepParser for complete parsing solutions:
// Lexer tokenizes input
var tokens = stepLexer.TokenizeRegexPattern(pattern);
// Parser builds parse trees and semantic graphs
var parseResult = stepParser.Parse(tokens);
var cognitiveGraph = parseResult.CognitiveGraph;
Comprehensive test coverage includes:
SplittableToken.Alternatives in ambiguous cases