DevelApp.StepLexer Documentation

A modern, high-performance lexical analysis and parsing system with comprehensive PCRE2 support and CognitiveGraph integration

DevelApp.StepLexer Documentation

Overview

DevelApp.StepLexer is a zero-copy, UTF-8 native lexical analyzer designed for high-performance pattern parsing with advanced PCRE2 support. It implements a forward-only parsing architecture with ambiguity resolution capabilities, making it suitable for both regex pattern parsing and source code tokenization.

Key Features

🚀 Zero-Copy Architecture

🔍 Advanced Pattern Recognition

🏗️ Type-Safe Token System

🌐 Encoding Conversion

Core Components

StepLexer Class

The main lexical analyzer that processes input text and generates tokens using a two-phase approach.

public class StepLexer
{
    // Two-phase tokenization methods
    public bool Phase1_LexicalScan(ZeroCopyStringView input);
    public bool Phase2_Disambiguation();
    
    // Configuration
    public void AddRule(TokenRule rule);
    public void Initialize(ReadOnlyMemory<byte> input, string fileName = "");
}

Key Methods:

SplittableToken Class

Represents tokens that can be split into multiple alternatives during ambiguity resolution.

public class SplittableToken
{
    public ZeroCopyStringView Text { get; }
    public TokenType Type { get; set; }
    public int Position { get; }
    public List<SplittableToken>? Alternatives { get; set; }
    
    // Split token into alternatives
    public void Split(params (ZeroCopyStringView text, TokenType type)[] alternatives);
}

Properties:

ZeroCopyStringView Struct

Zero-allocation string view for efficient text processing.

public readonly struct ZeroCopyStringView : IEquatable<ZeroCopyStringView>
{
    public int Length { get; }
    public bool IsEmpty { get; }
    
    // Efficient slicing without allocation
    public ZeroCopyStringView Slice(int start, int length);
    
    // Direct span access
    public ReadOnlySpan<byte> AsSpan();
    
    // UTF-8 to string conversion when needed
    public override string ToString();
}

Key Features:

TokenType Enumeration

Token classification system for regex patterns and source code.

public enum TokenType
{
    // Regex pattern tokens
    Literal,              // Literal character token in regex patterns
    EscapeSequence,       // Escape sequence token (e.g., \n, \t, \\)
    CharacterClass,       // Character class token (e.g., [a-z], [^0-9])
    GroupStart,           // Group start token (opening parenthesis)
    GroupEnd,             // Group end token (closing parenthesis)
    SpecialGroup,         // Special group token (e.g., (?:), (?=), (?!))
    Quantifier,           // Quantifier token (e.g., *, +, ?, {n,m})
    LazyQuantifier,       // Lazy quantifier token (e.g., *?, +?, ??)
    Alternation,          // Alternation token (pipe symbol |)
    StartAnchor,          // Start anchor token (caret ^)
    EndAnchor,            // End anchor token (dollar sign $)
    AnyChar,              // Any character token (dot .)
    HexEscape,            // Hexadecimal escape token (e.g., \x41)
    UnicodeEscape,        // Unicode escape token (e.g., \u0041, \U00000041)
    UnicodeProperty,      // Unicode property token (e.g., \p{L}, \P{N})
    InlineModifier,       // Inline modifier token (e.g., (?i), (?m), (?s))
    LiteralText,          // Literal text token in \Q...\E construct
    RegexComment,         // Comment token in (?#...) construct
    
    // Source code tokens
    Identifier,           // Identifier token in source code
    Number,               // Numeric literal token
    String,               // String literal token
    Keyword,              // Keyword token (language-specific reserved words)
    Operator,             // Operator token (arithmetic, logical, assignment operators)
    Whitespace,           // Whitespace token (spaces, tabs, line breaks)
    Comment,              // Comment token (single-line and multi-line comments)
    Punctuation           // Punctuation token (semicolons, commas, brackets)
}

PatternParser Class

High-level parser controller for pattern processing with encoding conversion support.

public class PatternParser
{
    public PatternParser(ParserType parserType);
    
    // Zero-copy pattern parsing
    public bool ParsePattern(ReadOnlySpan<byte> utf8Pattern, string terminalName);
    
    // Pattern parsing with encoding conversion
    public bool ParsePattern(ReadOnlySpan<byte> sourceBytes, 
                           Encoding sourceEncoding, 
                           string terminalName);
    
    public bool ParsePattern(ReadOnlySpan<byte> sourceBytes, 
                           string encodingName, 
                           string terminalName);
    
    public bool ParsePattern(ReadOnlySpan<byte> sourceBytes, 
                           int codePage, 
                           string terminalName);
    
    // Stream-based parsing with encoding
    public bool ParsePatternFromStream(Stream stream, 
                                      Encoding sourceEncoding, 
                                      string terminalName);
    
    public bool ParsePatternFromStreamWithAutoDetect(Stream stream, 
                                                     string terminalName);
    
    // Access parsed tokens
    public List<SplittableToken> GetTokens();
}

Parser Types:

EncodingConverter Class

Library-based converter using System.Text.Encoding for hundreds of character encodings.

public static class EncodingConverter
{
    // Convert from any encoding to UTF-8 using Encoding object
    public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes, 
                                       Encoding sourceEncoding);
    
    // Convert using encoding name (e.g., "shift_jis", "GB2312", "ISO-8859-1")
    public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes, 
                                       string encodingName);
    
    // Convert using code page number (e.g., 932 for shift_jis, 1252 for Windows-1252)
    public static byte[] ConvertToUTF8(ReadOnlySpan<byte> sourceBytes, 
                                       int codePage);
    
    // Auto-detect encoding from BOM and convert
    public static byte[] ConvertToUTF8WithAutoDetect(ReadOnlySpan<byte> sourceBytes);
    
    // Detect encoding from BOM
    public static (Encoding encoding, int bomLength) DetectEncodingFromBOM(ReadOnlySpan<byte> bytes);
    
    // Utility methods
    public static EncodingInfo[] GetAvailableEncodings();
    public static bool IsEncodingAvailable(string encodingName);
    public static bool IsEncodingAvailable(int codePage);
}

Key Features:

Example Encodings:

// By Encoding object
var bytes = EncodingConverter.ConvertToUTF8(sourceBytes, Encoding.UTF8);

// By name (hundreds supported!)
var shiftJIS = EncodingConverter.ConvertToUTF8(sourceBytes, "shift_jis");
var gb2312 = EncodingConverter.ConvertToUTF8(sourceBytes, "GB2312");
var latin1 = EncodingConverter.ConvertToUTF8(sourceBytes, "ISO-8859-1");

// By code page
var windows1252 = EncodingConverter.ConvertToUTF8(sourceBytes, 1252);
var shiftJIS932 = EncodingConverter.ConvertToUTF8(sourceBytes, 932);

// Auto-detect from BOM
var autoDetected = EncodingConverter.ConvertToUTF8WithAutoDetect(sourceBytes);

Advanced Features

Ambiguity Resolution

The StepLexer handles parsing ambiguities through token splitting:

// Example: Ambiguous quantifier interpretation
var token = new SplittableToken(view, TokenType.Literal, 0);

// Split into alternatives when ambiguity detected
token.Split(
    (view.Slice(0, 1), TokenType.Literal),
    (view.Slice(0, 2), TokenType.RangeCount)
);

Unicode Support

Comprehensive Unicode handling with zero-copy efficiency:

// Unicode code points: \x{1F600}
// Unicode properties: \p{L}, \P{N}
// Unicode newlines: \R (any Unicode newline sequence)

Performance Optimization

Usage Examples

Basic Tokenization

using DevelApp.StepLexer;
using System.Text;

// Create pattern parser
var parser = new PatternParser(ParserType.Regex);

// Prepare UTF-8 input
var pattern = @"\d{2,4}-\w+@[a-z]+\.com";
var utf8Data = Encoding.UTF8.GetBytes(pattern);

// Parse the pattern
bool success = parser.ParsePattern(utf8Data.AsSpan(), "email_pattern");

if (success)
{
    var results = parser.GetResults();
    Console.WriteLine($"Phase 1 tokens: {results.Phase1TokenCount}");
    Console.WriteLine($"Ambiguous tokens: {results.AmbiguousTokenCount}");
    Console.WriteLine($"Parsing: {(success ? "SUCCESS" : "FAILED")}");
}

Pattern Parser Usage

using DevelApp.StepLexer;

// Create pattern parser for regex
var parser = new PatternParser(ParserType.Regex);

// Parse pattern with zero-copy
var pattern = @"[a-zA-Z][a-zA-Z0-9]*";
var utf8Pattern = Encoding.UTF8.GetBytes(pattern);

bool success = parser.ParsePattern(utf8Pattern, "identifier");

if (success)
{
    var results = parser.GetResults();
    Console.WriteLine($"Pattern parsed successfully!");
    Console.WriteLine($"Phase 1 tokens: {results.Phase1TokenCount}");
}

Unicode Pattern Processing

// Unicode-aware pattern
var unicodePattern = @"\p{L}+\x{20}\p{N}+";
var utf8Data = Encoding.UTF8.GetBytes(unicodePattern);

var parser = new PatternParser(ParserType.Regex);
bool success = parser.ParsePattern(utf8Data.AsSpan(), "unicode_pattern");

if (success)
{
    Console.WriteLine("Unicode pattern parsed successfully!");
}

Encoding Conversion

Converting from Various Encodings to UTF-8

using DevelApp.StepLexer;
using System.Text;

// Pattern in UTF-16 format
var pattern = @"\d{2,4}-\w+";
var utf16Bytes = Encoding.Unicode.GetBytes(pattern);

// Convert to UTF-8 and parse using Encoding object
var parser = new PatternParser(ParserType.Regex);
bool success = parser.ParsePattern(utf16Bytes, Encoding.Unicode, "pattern");

// Or use encoding by name
var shiftJISBytes = Encoding.GetEncoding("shift_jis").GetBytes(pattern);
bool sjisSuccess = parser.ParsePattern(shiftJISBytes, "shift_jis", "pattern");

// Or use code page number
var windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(pattern);
bool cpSuccess = parser.ParsePattern(windows1252Bytes, 1252, "pattern");

if (success)
{
    var tokens = parser.GetTokens();
    // Process tokens...
}

Auto-Detecting Encoding from Stream

using DevelApp.StepLexer;
using System.IO;

// Read pattern from file with BOM for auto-detection
var parser = new PatternParser(ParserType.Regex);
using var stream = File.OpenRead("pattern.txt");

// Auto-detect encoding from BOM and parse
bool success = parser.ParsePatternFromStreamWithAutoDetect(stream, "pattern");

if (success)
{
    Console.WriteLine("Pattern parsed successfully!");
}

Manual Encoding Detection and Conversion

using DevelApp.StepLexer;

// Read bytes from any source
byte[] sourceBytes = File.ReadAllBytes("pattern.dat");

// Detect encoding from BOM
var (encoding, bomLength) = EncodingConverter.DetectEncodingFromBOM(sourceBytes);
Console.WriteLine($"Detected encoding: {encoding.EncodingName}, BOM: {bomLength} bytes");

// Convert to UTF-8
byte[] utf8Bytes = EncodingConverter.ConvertToUTF8(sourceBytes, encoding);

// Parse as UTF-8
var parser = new PatternParser(ParserType.Regex);
parser.ParsePattern(utf8Bytes, "pattern");

Working with Any Encoding by Name

using DevelApp.StepLexer;
using System.IO;
using System.Text;

// Check if an encoding is available
if (EncodingConverter.IsEncodingAvailable("GB2312"))
{
    // Pattern file encoded in GB2312 (Simplified Chinese)
    byte[] gb2312Bytes = File.ReadAllBytes("chinese_pattern.txt");
    var parser = new PatternParser(ParserType.Regex);
    
    // Convert and parse using encoding name
    bool success = parser.ParsePattern(gb2312Bytes, "GB2312", "chinese_pattern");
    
    if (success)
    {
        Console.WriteLine("GB2312 pattern processed successfully!");
    }
}

Discovering Available Encodings

using DevelApp.StepLexer;

// Get all available encodings
var encodings = EncodingConverter.GetAvailableEncodings();

Console.WriteLine($"Total encodings available: {encodings.Length}");
Console.WriteLine("\nSample encodings:");

foreach (var encoding in encodings.Take(10))
{
    Console.WriteLine($"  {encoding.Name} (Code Page: {encoding.CodePage}) - {encoding.DisplayName}");
}

// Output examples:
//   utf-8 (Code Page: 65001) - Unicode (UTF-8)
//   shift_jis (Code Page: 932) - Japanese (Shift-JIS)
//   GB2312 (Code Page: 936) - Chinese Simplified (GB2312)
//   ISO-8859-1 (Code Page: 28591) - Western European (ISO)
//   windows-1252 (Code Page: 1252) - Western European (Windows)

Batch Conversion of Multiple Encodings

using DevelApp.StepLexer;
using System.Text;

// Process patterns from different sources
var patterns = new[]
{
    (File.ReadAllBytes("pattern_utf8.txt"), Encoding.UTF8),
    (File.ReadAllBytes("pattern_utf16.txt"), Encoding.Unicode),
    (File.ReadAllBytes("pattern_sjis.txt"), Encoding.GetEncoding("shift_jis")),
    (File.ReadAllBytes("pattern_gb2312.txt"), Encoding.GetEncoding("GB2312"))
};

var parser = new PatternParser(ParserType.Regex);

foreach (var (bytes, encoding) in patterns)
{
    if (parser.ParsePattern(bytes, encoding, "pattern"))
    {
        Console.WriteLine($"Successfully parsed {encoding.EncodingName} pattern");
    }
}

Error Handling

The StepLexer provides comprehensive error handling:

try
{
    var tokens = lexer.TokenizeRegexPattern(input);
}
catch (ENFA_RegexBuild_Exception ex)
{
    Console.WriteLine($"Regex parsing error: {ex.Message}");
    Console.WriteLine($"Position: {ex.Location}");
}
catch (ENFA_Exception ex)
{
    Console.WriteLine($"General lexer error: {ex.Message}");
}

Design Principles

1. Zero-Copy Performance

2. Forward-Only Architecture

3. Type Safety

4. Unicode First

Limitations

By Design Exclusions

The StepLexer intentionally excludes certain PCRE2 features that conflict with its forward-only architecture:

  1. Atomic Grouping ((?>...))
    • Requires backtracking prevention mechanisms
    • Conflicts with forward-only parsing paradigm
  2. Possessive Quantifiers (*+, ++, ?+)
    • Similar to atomic grouping requirements
    • Would compromise zero-copy performance
  3. Recursive Patterns ((?R), (?&name))
    • Adds complexity to lexer architecture
    • Better handled by grammar-based StepParser

These limitations are architectural decisions that maintain the lexer’s performance and simplicity advantages.

Integration with StepParser

The StepLexer integrates seamlessly with DevelApp.StepParser for complete parsing solutions:

// Lexer tokenizes input
var tokens = stepLexer.TokenizeRegexPattern(pattern);

// Parser builds parse trees and semantic graphs
var parseResult = stepParser.Parse(tokens);
var cognitiveGraph = parseResult.CognitiveGraph;

Testing

Comprehensive test coverage includes:

Best Practices

  1. Use UTF-8 Input: Avoid string-to-byte conversions when possible
  2. Reuse Lexer Instances: Initialize once, tokenize multiple patterns
  3. Handle Alternatives: Check for SplittableToken.Alternatives in ambiguous cases
  4. Error Recovery: Implement robust error handling for invalid patterns
  5. Performance Monitoring: Profile memory usage for large-scale processing

See Also