~brenns10/sc-regex

c1de3e869abe747670732d0c18d15f0dbcdc06fb — Stephen Brennan 4 months ago 87fb4cd master
docs: Add API documentation!
5 files changed, 244 insertions(+), 88 deletions(-)

A doc/api.rst
M doc/conf.py.in
M doc/index.rst
M doc/meson.build
M include/sc-regex.h
A doc/api.rst => doc/api.rst +74 -0
@@ 0,0 1,74 @@
Regex API Documentation
=======================

Overview
--------

To use the Regex API, you must follow a few simple steps. First, you must
compile the regular expression's string to a ``struct sc_regex *``. Once the
regex is compiled, you may "execute" it on a given input string. You may execute
the compiled regex on as many strings as you'd like. Finally, the regex must be
freed.

Substrings may be captured by a regex. A given regular expression has a single
number of substrings it can capture, and this number is known once the regex has
been compiled. The substrings are indexed in order of the left parenthesis. So,
for example, if the below regular expression matched, the captures would be as
follows:

.. code::

    regex: (ab(cd))(ef)
    capture 0: abcd
    capture 1: cd
    capture 2: ef

API
---

Data Structures
^^^^^^^^^^^^^^^

.. doxygenstruct:: sc_regex
.. doxygenenum:: sc_regex_flags


Functions
^^^^^^^^^

.. doxygenfunction:: sc_regex_compile
.. doxygenfunction:: sc_regex_compile2
.. doxygenfunction:: sc_regex_free
.. doxygenfunction:: sc_regex_exec
.. doxygenfunction:: sc_regex_num_captures
.. doxygenfunction:: sc_regex_get_capture
.. doxygenfunction:: sc_regex_get_captures
.. doxygenfunction:: sc_regex_captures_free

Wide String API
^^^^^^^^^^^^^^^

The wide string API uses the same ``struct sc_regex``. However, compilation,
execution, and access to captures must all be done via functions specific to
this API. It is safe to use :c:func:`sc_regex_num_captures()` for the wide
string API, as well as :c:func:`sc_regex_free()`.

.. doxygenfunction:: sc_regex_wcompile
.. doxygenfunction:: sc_regex_wcompile2
.. doxygenfunction:: sc_regex_wexec
.. doxygenfunction:: sc_regex_get_wcapture
.. doxygenfunction:: sc_regex_get_wcaptures
.. doxygenfunction:: sc_regex_wcaptures_free

Bytecode API
^^^^^^^^^^^^

The bytecode API is documented, but is explicitly considered **unstable**. It is
interesting for exploration and testing, but it is not intended as a
production-ready API. The bytecode API allows you to peer into the
"instructions" of the regex engine to understand the implementation, or even
write your own bytecode and then run them. Here be dragons.

.. doxygenfunction:: sc_regex_read
.. doxygenfunction:: sc_regex_fread
.. doxygenfunction:: sc_regex_write

M doc/conf.py.in => doc/conf.py.in +1 -0
@@ 87,6 87,7 @@ pygments_style = None

# default domain
primary_domain = 'c'
highlight_language = 'c'

# -- Options for HTML output -------------------------------------------------


M doc/index.rst => doc/index.rst +45 -5
@@ 1,14 1,54 @@
Welcome to sc-template documentation!
=====================================
Welcome to sc-regex documentation!
==================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

sc-template
-----------
   api

You should use this as a starting point for your documentation.
sc-regex
--------

sc-regex is a library which contains a regular expression implementation. The
implementation is based on the Pike virtual machine model described by Russ Cox
`here <https://swtch.com/~rsc/regexp/>`_. The advantage of this implementation
is that it is based on finite automata (rather than backtracking), and so its
worst-case runtime is bounded by the length of input strings, O(N). Of course,
it lacks the "non-regular" features which backtracking implementations provide,
such as backreferences and forward lookahead assertions.

Google's `RE2 <https://github.com/google/re2>`_ library is based on this same
principle, and I believe it was originally written by Russ Cox as well. I don't
believe that this library has any advantage compared to RE2 (except maybe that
it is implmented in plain C), and I wouldn't advocate using it over RE2. This
library exists mostly in the "do it yourself" model that I sometimes like to
pursue. Caveat emptor.

This library features:

- Support for most standard regex syntax:

  - Character classes (including ranges and negation)
  - Repetition operators (``+*?``)
  - Alternation operator (``|``)
  - Dot-shorthand for "any character"
  - Captures via parentheses operator

- Easy support for accessing captured strings
- Support for operating over single-byte characters (``char``) in a manner that
  is blind to the character encoding (e.g. ASCII, UTF-8). This is the most
  common and useful mode.
- Support for operating over multi-byte "wide" characters (``wchar_t``), in a
  way which might be expected for tools which decode text into a sequence of
  Unicode code points.
- Notably absent syntax:

  - Non-regular features: backreferences, lookahead/lookbehind assertions
  - Restricted repetition (curly brace) operatior, e.g.: ``a{3,5}`` meaning
    between 3 and 5 a's.

  - Non-capturing parentheses.

Indices and tables
==================

M doc/meson.build => doc/meson.build +1 -0
@@ 74,6 74,7 @@ doxygen_target = custom_target(
# meson.build file
sphinx_files = [
	'index.rst',
	'api.rst',
]
foreach file : sphinx_files
	configure_file(input: file, output: file, copy: true)

M include/sc-regex.h => include/sc-regex.h +123 -83
@@ 1,6 1,8 @@
/*
/**
 * libsc-regex: a simple regular expression library built with a lot of help
 * from Russ Cox's articles
 *
 * @file sc-regex.h
 */

#ifndef SC_REGEX_H


@@ 14,64 16,102 @@
 * Main API
 ***/

/*
 * A regular expression.
/**
 * @struct sc_regex
 * @brief A compiled regular expression
 *
 * Compiled regular expressions are represented by sc-regex as pointers to this
 * structure. They are created by sc_regex_compile() or sc_regex_compile2(), and
 * are freed by sc_regex_free().
 */
struct sc_regex;

enum {
	/* Case insensitive matching */
/**
 * @brief Defines flags values to pass to sc_regex_compile()
 */
enum sc_regex_flags {
	/** @brief Case insensitive matching */
	SC_RE_INSENSITIVE = 1,
};

/*
 * Compile a regular expression. It must be freed by sc_regex_free() after.
 * This interface is deprecated in favor of sc_regex_compile2(). The flags
 * argument defaults to zero when this function is used.
/**
 * @brief Compile a regular expression.
 * @param regex The text form of the regular expression.
 * @returns The compiled bytecode for the regex.
 *
 * @deprecated Please use sc_regex_compile2() instead. The @a flags argument
 * defaults to 0 when this function is used.
 *
 * Compiled regular expressions must be freed by sc_regex_free().
 */
struct sc_regex *sc_regex_compile(const char *regex);

/*
 * Compile a regular expression. It must be freed by sc_regex_free() after.
/**
 * @brief Compile a regular expression.
 * @param regex The text form of the regular expression.
 * @param flags Behavior flags
 * @returns The compiled bytecode for the regex.
 *
 * Compiled regex must be freed by sc_regex_free(). This interface supercedes
 * sc_regex_compile(), and allows regex to be created with flags to change the
 * capabilities of the compiled regex. See #sc_regex_flags for details on
 * the defined flag values.
 */
struct sc_regex *sc_regex_compile2(const char *regex, int flags);

/*
 * Execute a regex on a string.
 *
 * The input string must be nul-terminated. For regular expressions which
 * capture substrings, we simply record the index of the first and last
 * character in the saved string into the `saved` buffer provided by the user.
 * You must provide a buffer that has an appropriate amount of room (e.g. if
 * there are three sets of capturing parens, your buffer should be of length
 * six). Use sc_regex_num_captures() * 2 as the buffer size.
 *
 * Should you want a more convenient way to access captured strings, the
 * function sc_regex_get_captures() returns an array of captured strings.
 *
 * @param r Compiled regular expression to execute.
 * @param input Text to use as input.
/**
 * @brief Execute a regex on a string.
 * @param r Compiled regular expression to execute
 * @param input NUL-terminated string to use as input
 * @param saved Buffer with room for for start/end indices of each captured
 *   string. In other words, the length must be sc_regex_num_captures() * 2. You
 *   may leave this NULL if you don't want the captures.
 * @returns Length of match, or -1 if no match.
 *
 * This function executes the given regex on a nul-terminated input string. If
 * the execution results in a match, then a value of 0 or greater (corresponding
 * to the length of the match) is returned. If there is no match, then the value
 * -1 is returned.
 *
 * For regular expressions which capture substrings, the @a saved paramater is
 * used to store metadata about them. We simply record the index of the first
 * and last character in the saved string into the `saved` buffer provided by
 * the user. If non-null, @a saved must point to a buffer with at least enough
 * space to store indices for all captures in the regex.  For example, if there
 * are three sets of capturing parens, your buffer should be of length six. Use
 * sc_regex_num_captures() * 2 as the buffer size.
 *
 * Captured string indices are stored in order: the numbers at indices 0 and 1
 * correspond to the start and end index of capture 0 within @a input. Indices 2
 * and 3 correspond to the start and end index of capture 1 within @a input,
 * etc. Note that the start index is *included* in the string, and the end index
 * is *excluded*. Thus, a captured string has length `end-start`.
 *
 * Users can access captured strings directly by examining indices, or they can
 * use one of two APIs which returned copies of the saved substrings. For
 * accessing a single captured string, see sc_regex_get_capture(). For accessing
 * every captured substring, see sc_regex_get_captures().
 */
ssize_t sc_regex_exec(struct sc_regex *r, const char *input, size_t **saved);

/*
 * Return the number of saved index slots required by a regex.
/**
 * @brief Return the number of strings which may be captured by a regex.
 * @param r The regular expression bytecode.
 * @returns Number of slots.
 *
 * @warning Please note that this is the number of **strings** which can be
 * captured by a regex. For the @a saved parameter of sc_regex_exec(), you must
 * allocate an array **twice** this size, because we store two indices per
 * captured string.
 */
size_t sc_regex_num_captures(struct sc_regex *r);

/**
 * Return a single string captured by a regex.
 * @brief Return a single string captured by a regex.
 * @param string The string matched by a regex
 * @param indices The indices of each matched string
 * @param capture_index The index of the captured string
 * @returns A newly-allocated string containing the captured string
 *
 * Given the string which a regex matched, and the list of captured indices
 * (set by sc_regex_exec()), return the captured string at an index. The


@@ 81,32 121,28 @@ size_t sc_regex_num_captures(struct sc_regex *r);
 * string "hello, Stephen!". The captured indices (set in the call to
 * sc_regex_exec()) would be: [7, 14]. Thus, the call `sc_regex_get_capture()`
 * would return the string "Stephen".
 *
 * @param string The string matched by a regex
 * @param indices The indices of each matched string
 * @param capture_index The index of the captured string
 * @returns A newly-allocated string containing the captured string
 */
char *sc_regex_get_capture(const char *string, const size_t *indices,
                           size_t capture_index);
/*
 * Return a list of all captured strings.
/**
 * @brief Return a list of all captured strings.
 * @param s String to get strings from.
 * @param l List of captures returned from sc_regex_exec().
 * @param n Number of captures (use sc_regex_num_captures())
 * @returns A new Capture object.
 *
 * For every captured string, create a newly allocated copy of it, and return
 * them all in a newly allocated array of strings. These strings need to be
 * freed when you're done with them. You can either manually free each string
 * and then the array, or you can use sc_regex_captures_free() to do this for
 * you.
 *
 * @param s String to get strings from.
 * @param l List of captures returned from sc_regex_exec().
 * @param n Number of captures (use sc_regex_num_captures())
 * @returns A new Capture object.
 */
char **sc_regex_get_captures(const char *s, const size_t *l, size_t n);

/*
 * Free a capture list from sc_regex_get_captures()
/**
 * @brief Free a capture list from sc_regex_get_captures()
 * @param c Captures to free.
 * @param n Number of captures in the array
 *
 * Since the array and strings were all newly allocated by recap(), they need to
 * be cleaned up. This function does the cleanup. It's nothing complicated - you


@@ 114,13 150,11 @@ char **sc_regex_get_captures(const char *s, const size_t *l, size_t n);
 * that if you want to keep one of the strings from the capture list, you'll
 * have to set its entry in the array to NULL (so free() does nothing), or else
 * do manual cleanup.
 *
 * @param c Captures to free.
 */
void sc_regex_captures_free(char **cap, size_t n);

/*
 * Free a regex object.  You must do this when you're done with it.
/**
 * @brief Free a regex object.
 * @param r regex to free.
 */
void sc_regex_free(struct sc_regex *r);


@@ 129,38 163,47 @@ void sc_regex_free(struct sc_regex *r);
 * Wide-string API additions
 ***/

/*
 * Compile a regular expression that uses wchar_t strings. It must be freed by
 * sc_regex_free() after. This interface is disabled in favor of
 * sc_regex_wcompile2(). The flags argument is zero by default.
 *
/**
 * @brief Compile a regular expression that uses wchar_t strings.
 * @param regex The text form of the regular expression.
 * @returns The compiled bytecode for the regex.
 * @see sc_regex_compile()
 * @deprecated Please use sc_regex_wcompile2()
 *
 * It must be freed by sc_regex_free() after. This interface is deprecated in
 * favor of sc_regex_wcompile2(). The flags argument is zero by default.
 */
struct sc_regex *sc_regex_wcompile(const wchar_t *regex);

/*
 * Compile a regular expression that uses wchar_t strings. It must be freed by
 * sc_regex_free() after.
 *
/**
 * @brief Compile a regular expression that uses wchar_t strings.
 * @param regex The text form of the regular expression.
 * @param flags Behavior flags
 * @returns The compiled bytecode for the regex.
 * @see sc_regex_compile2()
 *
 * It must be freed by sc_regex_free() after.
 */
struct sc_regex *sc_regex_wcompile2(const wchar_t *regex, int flags);

/*
 * Execute a regex on a string.
/**
 * @brief Execute a regex on a string.
 * @param r Compiled regular expression bytecode to execute.
 * @param input Text to use as input.
 * @param saved Out pointer for captured indices.
 * @returns Length of match, or -1 if no match.
 * @see sc_regex_wexec
 */
ssize_t sc_regex_wexec(struct sc_regex *r, const wchar_t *input,
                       size_t **saved);

/**
 * Return a single string captured by a regex.
 * @brief Return a single wide string captured by a regex.
 * @param string The string matched by a regex
 * @param indices The indices of each matched string
 * @param capture_index The index of the captured string
 * @returns A newly-allocated string containing the captured string
 * @see sc_regex_get_wcapture
 *
 * Given the string which a regex matched, and the list of captured indices
 * (set by sc_regex_exec()), return the captured string at an index. The


@@ 168,40 211,36 @@ ssize_t sc_regex_wexec(struct sc_regex *r, const wchar_t *input,
 *
 * As a complete example: given the regex "hello, (\w+)!" and the matching
 * string "hello, Stephen!". The captured indices (set in the call to
 * sc_regex_exec()) would be: [7, 14]. Thus, the call `sc_regex_get_capture()`
 * sc_regex_exec()) would be: [7, 14]. Thus, the call sc_regex_get_wcapture()
 * would return the string "Stephen".
 *
 * @param string The string matched by a regex
 * @param indices The indices of each matched string
 * @param capture_index The index of the captured string
 * @returns A newly-allocated string containing the captured string
 */
wchar_t *sc_regex_get_wcapture(const wchar_t *string, const size_t *indices,
                               size_t capture_index);
/*
 * Convert a string and a capture list into a list of strings.
/**
 * @brief Convert a string and a capture list into a list of strings.
 * @param s String to get strings from.
 * @param l List of captures returned from sc_regex_wexec().
 * @param n Number of saves - use sc_regex_num_captures() if you don't know.
 * @returns A new sc_regex_wcaptures object.
 * @see sc_regex_get_wcaptures
 *
 * This copies each capture into a newly allocated string, and returns them all
 * in a newly allocated array of strings. These things need to be freed when
 * you're done with them. You can either manually free each string and then the
 * array, or you can use sc_regex_wcaptures_free() to do this for you.
 *
 * @param s String to get strings from.
 * @param l List of captures returned from sc_regex_wexec().
 * @param n Number of saves - use sc_regex_num_captures() if you don't know.
 * @returns A new sc_regex_wcaptures object.
 */
wchar_t **sc_regex_get_wcaptures(const wchar_t *s, const size_t *l, size_t n);

/**
 * Free a capture list from sc_regex_get_wcaptures()
 * @brief Free a capture list from sc_regex_get_wcaptures()
 * @param c Captures to free.
 * @param n Number of captures
 * @see sc_regex_captures_free
 *
 * Frees each string in the array, then frees the struct itself. Should you wish
 * to keep one of the strings, set its index to NULL to prevent it from being
 * freed. Otherwise, you may manually free captures in any way you see fit, so
 * long as all strings, and the list, are freed.
 *
 * @param c Captures to free.
 */
void sc_regex_wcaptures_free(wchar_t **cap, size_t n);



@@ 217,7 256,11 @@ void sc_regex_wcaptures_free(wchar_t **cap, size_t n);
 * into the bytecode, which is fun and interesting.
 ***/

/*
/**
 * @brief Read a program (regex) from a string
 * @param str The text representation of the program.
 * @returns The bytecode of the program.
 *
 * Read in a program from a string.  This takes the "assembly like"
 * representation and turns it into compiled instructions.  Every instruction
 * must be on a single line, and spaces are used as delimiters.  Also, labels


@@ 229,21 272,18 @@ void sc_regex_wcaptures_free(wchar_t **cap, size_t n);
 *         split L1 L2
 *     L2:
 *         match
 *
 * @param str The text representation of the program.
 * @returns The bytecode of the program.
 */
struct sc_regex *sc_regex_read(char *str);

/*
 * Reads in a program from a file instead of a string.
/**
 * @brief Reads in a program (regex) from a file containing instructions
 * @param f File to read from.
 * @returns The regex bytecode.
 */
struct sc_regex *sc_regex_fread(FILE *f);

/*
 * Writes a program to the same format as the reread() functions do.
/**
 * @brief Writes program (regex) nstructions to the format of sc_regex_read()
 * @param r The regex to write.
 * @param f The file to write to.
 */