AxlFind — Byte-Substring Search Engine

See AxlData — Data Structures for an overview of all data modules.

A Boyer-Moore-Horspool byte-substring search (with case-insensitive and whole-word variants, forward and backward) that runs over an abstract random-access byte source — an AxlByteReader. The same engine drives a flat memory block (the built-in AxlMemReader), a AxlTextBuffer (gap buffer), and a AxlPieceTree (out-of-core piece table); the axl_text_buffer_find / axl_piece_tree_find wrappers build the appropriate reader and call axl_find_in_source.

The reader pulls overlapping windows so a match straddling the source’s internal boundaries (a piece boundary, the gap) is never missed; a contiguous source that supplies the optional peek is scanned in place with no copy. A successful find reports an AxlMatch (start + length); length is carried explicitly so the result shape already fits variable-length matchers.

Header: <axl/axl-find.h>

API Reference

Enums

enum AxlFindFlags

Flags for axl_find_in_source and the axl_*_find wrappers.

axl-find.h:

Byte-substring search over an abstract random-access byte source.

The search engine (Boyer-Moore-Horspool, with case-insensitive and whole-word variants) reads through an AxlByteReader — a tiny function-table over whatever holds the bytes — so the same engine drives a flat memory block, an AxlTextBuffer (gap buffer), and an AxlPieceTree (out-of-core piece table). The reader’s read pulls windowed chunks (with overlap to catch matches spanning the source’s internal boundaries); an optional peek lets a contiguous source be scanned in place with no copy.

axl_find_in_source is the single engine; the per-type axl_*_find wrappers (axl_text_buffer_find, axl_piece_tree_find) build the appropriate reader and call it. A successful find reports an AxlMatch (start + length); length is carried explicitly so the result shape already fits variable-length matchers (a future regex / fuzzy engine slots in behind the same reader without reshaping callers).

Byte-oriented; the only “word” notion is for AXL_FIND_WHOLE_WORD, where a word byte is [A-Za-z0-9_]. Single-threaded (UEFI).

Values:

enumerator AXL_FIND_DEFAULT
enumerator AXL_FIND_CASE_INSENSITIVE

ASCII case fold.

enumerator AXL_FIND_BACKWARD

search toward offset 0

enumerator AXL_FIND_WHOLE_WORD

match only at word boundaries

Functions

void axl_mem_reader_init(AxlMemReader *mem, const void *data, size_t len)

Initialize a contiguous in-memory reader over data / len.

Parameters:
  • mem – reader to initialize (caller-owned)

  • data – borrowed bytes (must outlive the search)

  • len – number of bytes

bool axl_find_in_source(const AxlByteReader *reader, const char *needle, size_t needle_len, size_t from_offset, uint32_t flags, AxlMatch *out)

Search reader for the needle_len bytes at needle, scanning from from_offset. Forward (default) returns the lowest match with start >= from_offset; AXL_FIND_BACKWARD returns the highest match with start <= from_offset. AXL_FIND_CASE_INSENSITIVE folds ASCII case; AXL_FIND_WHOLE_WORD requires non-word bytes on both sides. Matches spanning the source’s internal boundaries are handled. Wrap-around is the caller’s job.

Parameters:
  • reader – byte source to scan

  • needle – bytes to find

  • needle_len – length of needle

  • from_offset – where to start scanning

  • flags – AxlFindFlags

  • out – [out] match on success

Returns:

true and fills out on a match; false if not found (or needle_len is 0, or any required argument is NULL).

struct AxlMatch
#include <axl-find.h>

A successful match: the bytes [start, start + length). length equals the needle length for a literal find, but is reported explicitly so the result also fits variable-length matchers.

Public Members

size_t start

byte offset of the match

size_t length

match length in bytes

struct AxlByteReader

Random-access byte source the search engine reads through. An implementation fills the function pointers and ctx; the engine never inspects ctx itself.

Public Members

size_t (*length)(const AxlByteReader *r)

Total number of bytes the reader can serve.

size_t (*read)(const AxlByteReader *r, size_t offset, size_t len, void *buf)

Copy up to len bytes starting at logical offset into buf, returning the number actually copied (fewer than len only when offset + len runs past the end). Always present.

const char *(*peek)(const AxlByteReader *r, size_t offset, size_t len)

OPTIONAL zero-copy fast path: if [offset, offset + len) is stored contiguously, return a direct pointer to those bytes; otherwise return NULL. NULL is always a safe answer — the engine falls back to read. May itself be NULL (no fast path).

void *ctx

implementation data (opaque to the engine)

struct AxlMemReader
#include <axl-find.h>

Built-in AxlByteReader over a flat, contiguous memory block. The reader supports the zero-copy peek path, so searching a memory block performs no copying. Initialize with axl_mem_reader_init and pass &mem.reader to axl_find_in_source. The bytes are borrowed — data must outlive the search.

Public Members

AxlByteReader reader

pass &reader to axl_find_in_source

const char *data
size_t len

AxlRegex — Regular-Expression Matcher

A compiled-pattern regular-expression matcher (<axl/axl-regex.h>) layered on the same AxlByteReader seam. Compile a pattern once with axl_regex_new, then search many times (find-all loops, one pattern across many lines/buffers). It is a Thompson NFA / Pike VM, not a backtracker, so match time is O(pattern × input) for every pattern — there is no catastrophic (“ReDoS”) blow-up. Backreferences are deliberately unsupported because they are not regular and would force backtracking.

Supported: literals; .; greedy and lazy quantifiers * + ? *? +? ??; anchors ^ $; alternation |; grouping and capture ( ); classes [...] / [^...] with ranges; the escapes \d \w \s \D \W \S and friends. Matching is byte-oriented and leftmost (Perl / grep -P priority). Captures via axl_regex_search_captures; compile flags CASELESS / MULTILINE / DOTALL; an ANCHORED match flag. The axl_text_buffer_find_regex / axl_piece_tree_find_regex wrappers run a compiled regex over those sources.

Regular-expression search — a compiled-pattern matcher over the axl-find.h byte-source seam.

A compiled pattern (AxlRegex): parse and compile once, then search many times — the right shape for find-all loops and matching one pattern across many lines or buffers. The matcher reads through the same AxlByteReader the literal find engine uses, so the same AxlRegex searches a flat buffer, an AxlTextBuffer, or an AxlPieceTree; a successful search reports the same AxlMatch.

The engine is a Thompson NFA / Pike VM, NOT a backtracker, so match time is O(pattern × input) for every pattern — there is no catastrophic (“ReDoS”) blowup. That guarantee is why backreferences are deliberately unsupported: they are not regular and would force backtracking.

Supported syntax: literals; .; greedy and lazy quantifiers * + ? *? +? ??; bounded repetition {n} / {n,} / {n,m} / {,m} (counts clamp to 1024; a { that is not a valid interval is a literal); anchors ^ $; alternation |; grouping and capture ( ); character classes [...] / [^...] with a-z ranges; the escapes \d \w \s \D \W \S \n \t \r \f \v and backslash-escaped metacharacters. Bounded repetition desugars to the base quantifiers, so a capture group repeated by an interval resolves to its last match. Named groups are not in this version. Matching is byte-oriented and leftmost (Perl / grep -P priority, not POSIX leftmost-longest).

AXL_AUTOPTR(AxlRegex) re = axl_regex_new("[0-9]+", AXL_REGEX_DEFAULT);
AxlMatch m;
if (re && axl_regex_search_buf(re, text, len, 0, AXL_REGEX_MATCH_DEFAULT, &m))
    printf("first number at %zu, %zu bytes\n", m.start, m.length);

Defines

AXL_REGEX_NO_MATCH

Sentinel start for a capture group that did not participate.

Typedefs

typedef struct AxlRegex AxlRegex

A compiled regular expression. Opaque; create with axl_regex_new, release with axl_regex_free.

Enums

enum AxlRegexFlags

Compile-time options (affect how the pattern is interpreted).

Values:

enumerator AXL_REGEX_DEFAULT
enumerator AXL_REGEX_CASELESS

ASCII case-insensitive matching.

enumerator AXL_REGEX_MULTILINE

^/$ also match at \n boundaries

enumerator AXL_REGEX_DOTALL

. also matches \n

enum AxlRegexMatchFlags

Match-time options, passed per search.

Values:

enumerator AXL_REGEX_MATCH_DEFAULT
enumerator AXL_REGEX_MATCH_ANCHORED

match must start exactly at from_offset

enumerator AXL_REGEX_MATCH_NOTBOL

Treat from_offset as MID-stream, not the start of the text: ^ (and the start-anchor) does NOT match at from_offset (POSIX REG_NOTBOL). Multiline ^ after an embedded \n still matches. For chunked / windowed scanning of a larger source: pass it on every window except the one that begins at the true source start.

enumerator AXL_REGEX_MATCH_NOTEOL

Treat the buffer end as MID-stream, not the end of the text: $ (and the end-anchor) does NOT match at len (POSIX REG_NOTEOL). Multiline $ before an embedded \n still matches. For windowed scanning: pass it on every window except the one ending at true EOF.

Functions

AxlRegex *axl_regex_new(const char *pattern, uint32_t flags)

Compile a pattern.

Parameters:
  • pattern – the regular-expression text (UTF-8 bytes)

  • flags – AxlRegexFlags

Returns:

a compiled regex (free with axl_regex_free), or NULL if the pattern is malformed (the reason is logged) or on allocation failure. Use axl_regex_new_full to recover the error detail.

AxlRegex *axl_regex_new_full(const char *pattern, uint32_t flags, AxlRegexError *err)

Compile a pattern, reporting compile-error detail.

Identical to axl_regex_new but, on a malformed pattern, fills err (when non-NULL) with the byte offset and a static reason string. err is left untouched on success and on allocation failure.

Parameters:
  • pattern – the regular-expression text

  • flags – AxlRegexFlags

  • err – [out] error detail on failure, or NULL

Returns:

a compiled regex, or NULL (see axl_regex_new).

void axl_regex_free(AxlRegex *re)

Free a compiled regex. NULL-safe.

Parameters:
  • re – compiled regex (may be NULL)

size_t axl_regex_capture_count(const AxlRegex *re)

Number of capture groups in the pattern (excluding group 0).

Group 0 is always the overall match; this counts the ( ) groups, which are numbered 1..N left-to-right by opening parenthesis. Use it to size the groups array for axl_regex_search_captures.

Parameters:
  • re – compiled regex

Returns:

the capture-group count (0 if re is NULL).

Find the leftmost match in a byte source.

Scans reader from from_offset and reports the leftmost match (start >= from_offset). With AXL_REGEX_MATCH_ANCHORED the match must begin exactly at from_offset. Iterate by re-searching from m.start + (m.length ? m.length : 1).

The matcher needs a contiguous view of the scanned region: it uses the reader’s zero-copy peek when available, otherwise it reads the region [from_offset, length) into a temporary buffer (so for a non-contiguous out-of-core source the cost is O(region) memory per call — fine for editor-sized regions; windowed matching is future work).

Parameters:
  • re – compiled regex

  • reader – byte source to scan

  • from_offset – where to start scanning

  • match_flags – AxlRegexMatchFlags

  • out – [out] overall match on success

Returns:

true and fills out on a match; false if no match (or any required argument is NULL).

bool axl_regex_search_buf(const AxlRegex *re, const void *data, size_t len, size_t from_offset, uint32_t match_flags, AxlMatch *out)

Find the leftmost match in a contiguous buffer.

Convenience over axl_regex_search for the common contiguous case; the bytes are borrowed and need not outlive the call.

Parameters:
  • re – compiled regex

  • data – bytes to scan

  • len – number of bytes

  • from_offset – where to start scanning

  • match_flags – AxlRegexMatchFlags

  • out – [out] overall match on success

Returns:

true and fills out on a match; false otherwise.

bool axl_regex_search_captures(const AxlRegex *re, const AxlByteReader *reader, size_t from_offset, uint32_t match_flags, AxlMatch *groups, size_t n_groups)

Find the leftmost match and report capture groups.

Like axl_regex_search, but also fills groups with the captured sub-matches: groups[0] is the overall match and groups[k] is the k-th ( ) group. At most n_groups entries are written (size it via axl_regex_capture_count + 1). A group that did not participate in the match (e.g. an unmatched optional or alternation branch) is reported with start == AXL_REGEX_NO_MATCH and length == 0.

Parameters:
  • re – compiled regex

  • reader – byte source to scan

  • from_offset – where to start scanning

  • match_flags – AxlRegexMatchFlags

  • groups – [out] group array (groups[0] = overall)

  • n_groups – capacity of groups

Returns:

true and fills groups on a match; false otherwise.

struct AxlRegexError
#include <axl-regex.h>

Detail for a failed compile (optional out param of axl_regex_new_full).

Public Members

size_t offset

byte offset in the pattern where parsing failed

const char *message

static, human-readable reason (never freed)