Core Components¶
This document provides detailed information about PyBPMN Parser’s core components and their interactions.
Component Overview¶
PyBPMN Parser consists of several major components:
graph TD
A[Parser] --> B[Element Factory]
A --> C[Validator]
B --> D[BPMN Elements]
E[Plugin System] --> B
E --> C
F[Core Utils] --> A
F --> B
Parser Component¶
Main Parse Functions¶
The parser component provides the primary API for parsing BPMN files:
Location: pybpmn_parser/parse.py
def parse(xml_str: str) -> Definitions:
"""Parse BPMN XML string into Definitions object."""
def parse_file(xml_file: Path) -> Definitions:
"""Parse BPMN XML file into Definitions object."""
Parser Flow¶
- Input Validation - Check XML is not empty
- Schema Validation - Validate against BPMN 2.0 XSD
- XML Parsing - Parse XML using lxml
- Element Construction - Create typed BPMN elements
- Plugin Processing - Apply registered plugins
- Return Definitions - Return root Definitions object
Implementation Details¶
The parser uses lxml for XML processing:
import lxml.etree as ET
def parse(xml_str: str) -> Definitions:
# Validate first
validation_result = validate(xml_str)
validation_result.raise_for_errors()
# Parse XML
root: ET.Element = ET.fromstring(xml_str.encode("utf-8"))
# Create Definitions object
return Definitions.parse(root)
Element Factory¶
Purpose¶
The element factory creates typed Python dataclasses from XML elements.
Class Hierarchy¶
BPMN elements follow the BPMN 2.0 specification hierarchy:
BaseElement
├── FlowElement
│ ├── FlowNode
│ │ ├── Activity
│ │ │ ├── Task
│ │ │ ├── SubProcess
│ │ │ └── CallActivity
│ │ ├── Event
│ │ │ ├── StartEvent
│ │ │ ├── EndEvent
│ │ │ └── IntermediateEvent
│ │ └── Gateway
│ │ ├── ExclusiveGateway
│ │ ├── ParallelGateway
│ │ └── InclusiveGateway
│ └── SequenceFlow
├── Artifact
│ ├── TextAnnotation
│ ├── Group
│ └── Association
└── DataObject
Element Construction¶
Elements are constructed using dataclasses:
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class Task(FlowNode):
"""BPMN Task element."""
id: str
name: Optional[str] = None
documentation: List[Documentation] = field(default_factory=list)
is_for_compensation: bool = False
@classmethod
def parse(cls, element: ET.Element) -> "Task":
"""Parse Task from XML element."""
return cls(
id=element.get("id"),
name=element.get("name"),
# ... parse other attributes
)
Validator Component¶
Purpose¶
The validator ensures BPMN documents conform to the BPMN 2.0 specification.
Location: pybpmn_parser/validator.py
Validation Process¶
graph LR
A[XML Input] --> B[Empty Check]
B --> C[Schema Validation]
C --> D[Structural Validation]
D --> E[Result]
Validation Rules¶
- Empty XML Check - Ensure input is not empty
- Schema Validation - Validate against BPMN 2.0 XSD
- Element Validation - Check required attributes
- Reference Validation - Verify ID references exist
Usage¶
from pybpmn_parser.validator import validate
result = validate(xml_string)
if result.errors:
for error in result.errors:
print(f"Validation error: {error}")
else:
print("Valid BPMN")
BPMN Elements¶
Organization¶
BPMN elements are organized by category:
pybpmn_parser/bpmn/
├── activities/ # Tasks, SubProcesses
├── events/ # Start, End, Intermediate Events
├── gateways/ # Exclusive, Parallel, Inclusive
├── common/ # Shared base classes
├── foundation/ # Base BPMN elements
├── infrastructure/ # Definitions, Process
└── collaboration/ # Pools, Lanes, Message Flows
Dataclass Design¶
Elements use Python dataclasses for type safety:
@dataclass
class StartEvent(Event):
"""BPMN Start Event."""
id: str
name: Optional[str] = None
is_interrupting: bool = True
event_definitions: List[EventDefinition] = field(default_factory=list)
Benefits¶
- Type Safety - IDE support and type checking
- Immutability Options - Can make dataclasses frozen
- Default Values - Clean handling of optional attributes
- Auto-generated Methods -
__init__,__repr__,__eq__
Plugin System¶
Architecture¶
The plugin system allows extensions without modifying core code.
Location: pybpmn_parser/plugins/
Plugin Registry¶
class PluginRegistry:
"""Central registry for plugins."""
def __init__(self):
self._plugins = []
def register(self, plugin):
"""Register a plugin."""
self._plugins.append(plugin)
def get_plugins(self):
"""Get all registered plugins."""
return self._plugins
Plugin Interface¶
Plugins implement standard methods:
class Plugin:
"""Base plugin interface."""
namespaces: dict
def parse_extension(self, element, extension_data):
"""Parse extension attributes."""
raise NotImplementedError
def validate_extension(self, element, extension_data):
"""Validate extension attributes."""
return []
Extension Processing¶
- Namespace Detection - Identify which plugins handle which namespaces
- Attribute Extraction - Extract attributes for each namespace
- Plugin Invocation - Call appropriate plugin methods
- Extension Attachment - Attach parsed data to elements
Core Utilities¶
Purpose¶
Core utilities provide shared functionality across components.
Location: pybpmn_parser/core.py
Utility Functions¶
def strtobool(value: str) -> bool:
"""Convert string to boolean."""
value = str(value).lower()
return value in ("y", "yes", "on", "1", "true", "t")
def get_fields_by_metadata(data_class, key, val):
"""Get dataclass fields by metadata."""
# Implementation
Data Flow¶
Parse Flow¶
sequenceDiagram
participant User
participant Parser
participant Validator
participant Factory
participant Plugins
User->>Parser: parse_file(path)
Parser->>Validator: validate(xml)
Validator-->>Parser: validation_result
Parser->>Parser: ET.fromstring(xml)
Parser->>Factory: Definitions.parse(root)
Factory->>Plugins: process_extensions()
Plugins-->>Factory: extension_data
Factory-->>Parser: Definitions
Parser-->>User: Definitions
Element Creation Flow¶
- XML Element - Start with lxml Element
- Attribute Extraction - Extract XML attributes
- Child Processing - Recursively process children
- Extension Processing - Apply plugins
- Dataclass Construction - Create typed Python object
Performance Considerations¶
XML Parsing¶
- Uses lxml (C-based) for fast parsing
- Parses entire document into memory (not streaming)
- Suitable for documents up to ~100MB
Memory Usage¶
- One Python object per BPMN element
- Lightweight dataclasses minimize overhead
- References use IDs (strings) not object pointers
Optimization Strategies¶
# Cache parsed results
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_parse(file_path: str):
return parse_file(Path(file_path))
Extension Points¶
For Plugin Developers¶
- Namespace Handler - Add support for new namespaces
- Custom Validator - Add validation rules
- Element Extensions - Extend element classes
For Library Users¶
- Custom Element Factories - Override element creation
- Validation Callbacks - Add custom validation
- Post-Processing Hooks - Process after parsing
Error Handling¶
Exception Hierarchy¶
BPMNParserError
├── ValidationError
│ ├── SchemaError
│ ├── EmptyXMLError
│ └── StructuralError
└── ParseError
├── MalformedXMLError
└── UnknownElementError
Error Recovery¶
The parser uses fail-fast approach: - Validation errors stop parsing - Schema violations raise immediately - No partial/invalid results returned
Testing Strategy¶
Unit Tests¶
Each component has isolated unit tests:
# tests/test_parse.py
def test_parse_valid_bpmn():
xml = """<definitions>...</definitions>"""
result = parse(xml)
assert isinstance(result, Definitions)
Integration Tests¶
Test component interactions:
# tests/test_integration.py
def test_parse_with_plugins():
register_plugin(MyPlugin())
result = parse_file(Path("extended.bpmn"))
assert result.processes[0].flow_elements
Fixture-based Tests¶
Use real BPMN files:
def test_miwg_suite(miwg_file):
result = parse_file(miwg_file)
assert result.processes
Design Decisions¶
Why Dataclasses?¶
Chosen: Python dataclasses Alternatives Considered: Plain classes, attrs, Pydantic
Rationale: - Built-in to Python 3.7+ - Minimal boilerplate - Good IDE support - Type hints integration
Why lxml?¶
Chosen: lxml Alternatives Considered: xml.etree, xmltodict
Rationale: - Fast C-based parser - XPath support - Schema validation - Industry standard
Why Plugins?¶
Chosen: Plugin architecture Alternatives Considered: Inheritance, monkey-patching
Rationale: - Extensible without modification - Clean separation of concerns - Optional functionality - Community contributions