ENFAStepLexer-StepParser

A modern, high-performance lexical analysis and parsing system with comprehensive PCRE2 support and CognitiveGraph integration

ENFAStepLexer-StepParser

A modern, high-performance lexical analysis and parsing system with comprehensive PCRE2 support and CognitiveGraph integration. The system consists of DevelApp.StepLexer for zero-copy tokenization and DevelApp.StepParser for semantic analysis and grammar-based parsing.

Overview

ENFAStepLexer-StepParser is a complete parsing solution designed for high-performance pattern recognition and semantic analysis. The system uses a two-phase approach: StepLexer handles zero-copy tokenization with PCRE2 support, while StepParser provides grammar-based parsing with CognitiveGraph integration for semantic analysis and code understanding.

Key Features

🚀 DevelApp.StepLexer - Zero-Copy Tokenization

🧠 DevelApp.StepParser - Semantic Analysis

🔧 Advanced Pattern Support

🏗️ Modern Architecture

📚 Comprehensive Documentation

Quick Start

Building the Project

# Clone the repository
git clone https://github.com/DevelApp-ai/ENFAStepLexer-StepPerser.git
cd ENFAStepLexer-StepPerser

# Restore dependencies
dotnet restore

# Build all projects
dotnet build

# Run tests
dotnet test

# Run the demo
cd src/ENFAStepLexer.Demo
dotnet run

Basic StepLexer Usage

using DevelApp.StepLexer;
using System.Text;

// Create a pattern parser for regex
var parser = new PatternParser(ParserType.Regex);

// Parse a regex pattern with zero-copy
string pattern = @"\d{2,4}-\w+@[a-z]+\.com";
var utf8Pattern = Encoding.UTF8.GetBytes(pattern);

bool success = parser.ParsePattern(utf8Pattern, "email_pattern");

if (success)
{
    Console.WriteLine("Pattern compiled successfully!");
    var results = parser.GetResults();
    Console.WriteLine($"Phase 1 tokens: {results.Phase1TokenCount}");
    Console.WriteLine($"Ambiguous tokens: {results.AmbiguousTokenCount}");
}

StepLexer with Encoding Conversion

using DevelApp.StepLexer;
using System.Text;
using System.IO;

// Create a pattern parser
var parser = new PatternParser(ParserType.Regex);

// Parse pattern from UTF-16 encoded bytes
var pattern = @"\w+@\w+\.\w+";
var utf16Bytes = Encoding.Unicode.GetBytes(pattern);

// Automatically converts UTF-16 to UTF-8 for processing
bool success = parser.ParsePattern(utf16Bytes, Encoding.Unicode, "email_pattern");

// Or use encoding by name - supports hundreds of encodings!
var shiftJISBytes = Encoding.GetEncoding("shift_jis").GetBytes(pattern);
bool sjisSuccess = parser.ParsePattern(shiftJISBytes, "shift_jis", "file_pattern");

// Or auto-detect encoding from BOM in a stream
using var stream = File.OpenRead("pattern.txt");
bool streamSuccess = parser.ParsePatternFromStreamWithAutoDetect(
    stream, 
    "file_pattern"
);

if (success || sjisSuccess || streamSuccess)
{
    Console.WriteLine("Pattern parsed with encoding conversion!");
}

Basic StepParser Usage

using DevelApp.StepParser;

// Create parser engine
var engine = new StepParserEngine();

// Load grammar for a simple expression language
var grammar = @"
Grammar: SimpleExpr
TokenSplitter: Space

<NUMBER> ::= /[0-9]+/
<IDENTIFIER> ::= /[a-zA-Z][a-zA-Z0-9]*/
<PLUS> ::= '+'
<MINUS> ::= '-'
<WS> ::= /[ \t\r\n]+/ => { skip }

<expr> ::= <expr> <PLUS> <expr>
        | <expr> <MINUS> <expr>
        | <NUMBER>
        | <IDENTIFIER>
";

engine.LoadGrammarFromContent(grammar);

// Parse source code
var result = engine.Parse("x + 42 - y");

if (result.Success)
{
    Console.WriteLine("Parse successful!");
    var cognitiveGraph = result.CognitiveGraph;
    // Access semantic analysis results
}

StepParser with CognitiveGraph V2 Schema

CognitiveGraph 1.1.0 introduces a V2 schema optimized for massive cognitive graphs. StepParser supports both V1 (default) and V2 schemas:

using DevelApp.StepParser;
using CognitiveGraph.Schema;

// Create parser engine with V2 schema for massive graphs
var engine = new StepParserEngine(SchemaVersion.V2);

var grammar = @"
Grammar: LargeCodebase
<NUMBER> ::= /[0-9]+/
<IDENTIFIER> ::= /[a-zA-Z][a-zA-Z0-9]*/
<expression> ::= <NUMBER> | <IDENTIFIER>
";

engine.LoadGrammarFromContent(grammar);

// Parse multiple files and build a massive cognitive graph
var files = new Dictionary<string, string>
{
    { "module1.txt", "identifier1" },
    { "module2.txt", "identifier2" },
    // ... thousands more files
};

var result = engine.ParseMultipleFiles(files);

if (result.Success)
{
    Console.WriteLine($"Schema Version: {result.CognitiveGraph.SchemaVersion}"); // V2
    // Work with massive cognitive graph optimized for large codebases
}

// Default constructor maintains backward compatibility with V1
var engineV1 = new StepParserEngine(); // Uses SchemaVersion.V1

Architecture

Core Components

  1. DevelApp.StepLexer: Zero-copy lexical analyzer
    • PatternParser: High-level pattern processing controller
    • StepLexer: Core tokenization engine with PCRE2 support
    • ZeroCopyStringView: Memory-efficient string operations
    • SplittableToken: Ambiguity-aware token representation
  2. DevelApp.StepParser: Semantic analysis and grammar parsing
    • StepParserEngine: Main parsing controller with CognitiveGraph integration
    • GrammarDefinition: Complete grammar specification loader
    • TokenRule/ProductionRule: Grammar component definitions
    • IContextStack: Hierarchical context management
    • IScopeAwareSymbolTable: Symbol resolution and scoping

Processing Pipeline

The system uses a two-phase processing approach:

  1. Lexical Analysis Phase (StepLexer):
    • UTF-8 input processing with zero-copy efficiency
    • PCRE2-compatible pattern recognition
    • Ambiguity detection and token splitting
    • Forward-only parsing for predictable performance
  2. Semantic Analysis Phase (StepParser):
    • Grammar-based syntax tree construction
    • CognitiveGraph integration for semantic analysis
    • Context-sensitive parsing with scope management
    • Symbol table construction and resolution

Design Philosophy

PCRE2 Feature Support

✅ Fully Supported (70+ features)

⚠️ Partially Supported

❌ Not Supported (By Design)

The following features are intentionally excluded due to architectural design decisions:

Atomic Grouping ((?>...))

Recursive Pattern Support ((?R), (?&name))

Other Advanced Features

See docs/PCRE2-Support.md for complete feature matrix and detailed explanations.

Project Structure

ENFAStepLexer-StepPerser/
├── src/
│   ├── DevelApp.StepLexer/           # Zero-copy lexical analyzer
│   │   ├── StepLexer.cs              # Core tokenization engine
│   │   ├── PatternParser.cs          # High-level pattern controller
│   │   ├── ZeroCopyStringView.cs     # Memory-efficient string operations
│   │   ├── SplittableToken.cs        # Ambiguity-aware tokens
│   │   └── ...
│   ├── DevelApp.StepParser/          # Grammar-based semantic parser  
│   │   ├── StepParserEngine.cs       # Main parsing controller
│   │   ├── GrammarDefinition.cs      # Grammar specification
│   │   ├── TokenRule.cs              # Lexical analysis rules
│   │   ├── ProductionRule.cs         # Syntax analysis rules
│   │   └── ...
│   ├── DevelApp.StepLexer.Tests/     # StepLexer unit tests
│   ├── DevelApp.StepParser.Tests/    # StepParser unit tests
│   └── ENFAStepLexer.Demo/           # Demo console application
├── docs/
│   ├── StepLexer.md                  # Complete StepLexer documentation
│   ├── StepParser.md                 # Complete StepParser documentation
│   ├── PCRE2-Support.md              # Feature support matrix
│   └── Grammar_File_Creation_Guide.md # DSL development guide
└── README.md                         # This file

Documentation

Component Documentation

Quick Navigation

Contributing

This project welcomes contributions in several areas:

Core Development

  1. Adding new regex features: Extend TokenType enum and implement in StepLexer
  2. Grammar features: Enhance StepParser with new grammar constructs
  3. Performance improvements: Optimize zero-copy operations and memory usage
  4. CognitiveGraph integration: Improve semantic analysis capabilities

Testing and Quality

  1. Comprehensive unit tests: Expand test coverage for edge cases
  2. Performance benchmarks: Add throughput and memory usage benchmarks
  3. Grammar validation: Create test suites for grammar files
  4. Documentation examples: Improve code examples and tutorials

Documentation

  1. API documentation: Enhance inline code documentation
  2. Tutorial content: Create step-by-step guides for common scenarios
  3. Best practices: Document performance optimization techniques
  4. Integration guides: Show integration with other parsing tools

Performance

The StepLexer-StepParser architecture provides:

StepLexer Performance

StepParser Performance

Benchmarks

Future Roadmap

Phase 1 (Immediate)

Phase 2 (Short-term)

Phase 3 (Long-term)

Research Areas

License

This project is derived from @DevelApp/enfaparser but excludes the original license as requested. The enhancements and new code are provided for evaluation and development purposes.

Acknowledgments