~pierrenn/ripgrep

5d0e666794610bfb858678d41d9fee309acf4b89 — pierrenn 4 months ago 655e332
add hyperscan support and options
A .gitmodules => .gitmodules +3 -0
@@ 0,0 1,3 @@
[submodule "crates/hyperscan"]
	path = crates/hyperscan
	url = https://git.sr.ht/~pierrenn/grep-hyperscan

M Cargo.toml => Cargo.toml +7 -5
@@ 1,17 1,17 @@
[package]
name = "ripgrep"
version = "12.0.0"  #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
authors = ["Andrew Gallant <jamslam@gmail.com>", "pnn <git@pnn.sh>"]
description = """
ripgrep is a line-oriented search tool that recursively searches your current
directory for a regex pattern while respecting your gitignore rules. ripgrep
has first class support on Windows, macOS and Linux.
"""
documentation = "https://github.com/BurntSushi/ripgrep"
homepage = "https://github.com/BurntSushi/ripgrep"
repository = "https://github.com/BurntSushi/ripgrep"
documentation = "https://git.sr.ht/~pierrenn/ripgrep"
homepage = "https://git.sr.ht/~pierrenn/ripgrep"
repository = "https://git.sr.ht/~pierrenn/ripgrep"
readme = "README.md"
keywords = ["regex", "grep", "egrep", "search", "pattern"]
keywords = ["regex", "grep", "egrep", "search", "pattern", "hyperscan"]
categories = ["command-line-utilities", "text-processing"]
license = "Unlicense OR MIT"
exclude = ["HomebrewFormula"]


@@ 37,6 37,7 @@ members = [
  "crates/globset",
  "crates/grep",
  "crates/cli",
  "crates/hyperscan",
  "crates/matcher",
  "crates/pcre2",
  "crates/printer",


@@ 79,6 80,7 @@ walkdir = "2"

[features]
simd-accel = ["grep/simd-accel"]
hyperscan = ["grep/hyperscan"]
pcre2 = ["grep/pcre2"]

[profile.release]

M README.md => README.md +51 -392
@@ 1,432 1,91 @@
ripgrep (rg)
------------
ripgrep is a line-oriented search tool that recursively searches your current
directory for a regex pattern. By default, ripgrep will respect your .gitignore
and automatically skip hidden files/directories and binary files. ripgrep
has first class support on Windows, macOS and Linux, with binary downloads
available for [every release](https://github.com/BurntSushi/ripgrep/releases).
ripgrep is similar to other popular search tools like The Silver Searcher, ack
and grep.
'hyper'grep - a fork of ripgrep (rg) with hyperscan support
-----------------------------------------------------------

[![Build status](https://github.com/BurntSushi/ripgrep/workflows/ci/badge.svg)](https://github.com/BurntSushi/ripgrep/actions)
[![Crates.io](https://img.shields.io/crates/v/ripgrep.svg)](https://crates.io/crates/ripgrep)
[![Packaging status](https://repology.org/badge/tiny-repos/ripgrep.svg)](https://repology.org/project/ripgrep/badges)
This is a fork of [`ripgrep`](https://github.com/BurntSushi/ripgrep) adding support for [hyperscan](https://github.com/intel/hyperscan).

Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org).
This can be useful if for all the conditions below, you have :

1. at the very least several hundreds of regexps to parse simultaneously
2. several dozens GB of data on a fast (>500MBs) disk / not a lot of CPU (<2) to spend for the regexp search
3. regexps rarely changing while your data to parse is changing often

### CHANGELOG

Please see the [CHANGELOG](CHANGELOG.md) for a release history.
The fork was born out of necessity to extract a bunch of [fediverse](https://git.sr.ht/~pierrenn/twitter_escape) addresses from scraped web pages.

### Documentation quick links
We only here describe differences between this fork and the original `ripgrep`. Please refer to the [original `readme`](https://github.com/BurntSushi/ripgrep) for complete infos.

* [Installation](#installation)
* [User Guide](GUIDE.md)
* [Frequently Asked Questions](FAQ.md)
* [Regex syntax](https://docs.rs/regex/1/regex/#syntax)
* [Configuration files](GUIDE.md#configuration-file)
* [Shell completions](FAQ.md#complete)
* [Building](#building)
* [Translations](#translations)

### Installation / Building

### Screenshot of search results
Compared to `ripgrep` we just offer basic installation facilities, e.g. using `cargo`. You can [install `cargo`](https://doc.rust-lang.org/cargo/getting-started/installation.html) if you don't have it already.

[![A screenshot of a sample search with ripgrep](https://burntsushi.net/stuff/ripgrep1.png)](https://burntsushi.net/stuff/ripgrep1.png)
Beforehand, please refer to the original [section of the readme](https://github.com/BurntSushi/ripgrep#building)

Don't forget to install the `hyperscan` library and sources on your system first. Most distributions provide ready-to-go packages (e.g. `libhyperscan-dev` on Debian/Ubuntu) or you can [compile it from source](http://intel.github.io/hyperscan/dev-reference/getting_started.html).

### Quick examples comparing tools
Note that on some environments if you compile from source (e.g. AWS EC-2) you need to add `-fPIC` to the library compilation.

This example searches the entire
[Linux kernel source tree](https://github.com/BurntSushi/linux)
(after running `make defconfig && make -j8`) for `[A-Z]+_SUSPEND`, where
all matches must be words. Timings were collected on a system with an Intel
i7-6900K 3.2 GHz.

Please remember that a single benchmark is never enough! See my
[blog post on ripgrep](https://blog.burntsushi.net/ripgrep/)
for a very detailed comparison with more benchmarks and analysis.

| Tool | Command | Line count | Time |
| ---- | ------- | ---------- | ---- |
| ripgrep (Unicode) | `rg -n -w '[A-Z]+_SUSPEND'` | 452 | **0.136s** |
| [git grep](https://www.kernel.org/pub/software/scm/git/docs/git-grep.html) | `git grep -P -n -w '[A-Z]+_SUSPEND'` | 452 | 0.348s |
| [ugrep (Unicode)](https://github.com/Genivia/ugrep) | `ugrep -r --ignore-files --no-hidden -I -w '[A-Z]+_SUSPEND'` | 452 | 0.506s |
| [git grep](https://www.kernel.org/pub/software/scm/git/docs/git-grep.html) | `LC_ALL=C git grep -E -n -w '[A-Z]+_SUSPEND'` | 452 | 1.150s |
| [The Silver Searcher](https://github.com/ggreer/the_silver_searcher) | `ag -w '[A-Z]+_SUSPEND'` | 452 | 0.654s |
| [ack](https://github.com/beyondgrep/ack3) | `ack -w '[A-Z]+_SUSPEND'` | 452 | 4.054s |
| [git grep (Unicode)](https://www.kernel.org/pub/software/scm/git/docs/git-grep.html) | `LC_ALL=en_US.UTF-8 git grep -E -n -w '[A-Z]+_SUSPEND'` | 452 | 4.205s |

Here's another benchmark on the same corpus as above that disregards gitignore
files and searches with a whitelist instead. The corpus is the same as in the
previous benchmark, and the flags passed to each command ensure that they are
doing equivalent work:

| Tool | Command | Line count | Time |
| ---- | ------- | ---------- | ---- |
| ripgrep | `rg -uuu -tc -n -w '[A-Z]+_SUSPEND'` | 388 | **0.096s** |
| [ugrep](https://github.com/Genivia/ugrep) | `ugrep -r -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND'` | 388 | 0.493s |
| [GNU grep](https://www.gnu.org/software/grep/) | `egrep -r -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND'` | 388 | 0.806s |

And finally, a straight-up comparison between ripgrep, ugrep and GNU grep on a
single large file cached in memory
(~13GB, [`OpenSubtitles.raw.en.gz`](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz)):

| Tool | Command | Line count | Time |
| ---- | ------- | ---------- | ---- |
| ripgrep | `rg -w 'Sherlock [A-Z]\w+'` | 7882 | **2.769s** |
| [ugrep](https://github.com/Genivia/ugrep) | `ugrep -w 'Sherlock [A-Z]\w+'` | 7882 | 6.802s |
| [GNU grep](https://www.gnu.org/software/grep/) | `LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | 7882 | 9.027s |

In the above benchmark, passing the `-n` flag (for showing line numbers)
increases the times to `3.423s` for ripgrep and `13.031s` for GNU grep. ugrep
times are unaffected by the presence or absence of `-n`.


### Why should I use ripgrep?

* It can replace many use cases served by other search tools
  because it contains most of their features and is generally faster. (See
  [the FAQ](FAQ.md#posix4ever) for more details on whether ripgrep can truly
  replace grep.)
* Like other tools specialized to code search, ripgrep defaults to recursive
  directory search and won't search files ignored by your
  `.gitignore`/`.ignore`/`.rgignore` files. It also ignores hidden and binary
  files by default. ripgrep also implements full support for `.gitignore`,
  whereas there are many bugs related to that functionality in other code
  search tools claiming to provide the same functionality.
* ripgrep can search specific types of files. For example, `rg -tpy foo`
  limits your search to Python files and `rg -Tjs foo` excludes Javascript
  files from your search. ripgrep can be taught about new file types with
  custom matching rules.
* ripgrep supports many features found in `grep`, such as showing the context
  of search results, searching multiple patterns, highlighting matches with
  color and full Unicode support. Unlike GNU grep, ripgrep stays fast while
  supporting Unicode (which is always on).
* ripgrep has optional support for switching its regex engine to use PCRE2.
  Among other things, this makes it possible to use look-around and
  backreferences in your patterns, which are not supported in ripgrep's default
  regex engine. PCRE2 support can be enabled with `-P/--pcre2` (use PCRE2
  always) or `--auto-hybrid-regex` (use PCRE2 only if needed). An alternative
  syntax is provided via the `--engine (default|pcre2|auto-hybrid)` option.
* ripgrep supports searching files in text encodings other than UTF-8, such
  as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for
  automatically detecting UTF-16 is provided. Other text encodings must be
  specifically specified with the `-E/--encoding` flag.)
* ripgrep supports searching files compressed in a common format (brotli,
  bzip2, gzip, lz4, lzma, xz, or zstandard) with the `-z/--search-zip` flag.
* ripgrep supports arbitrary input preprocessing filters which could be PDF
  text extraction, less supported decompression, decrypting, automatic encoding
  detection and so on.

In other words, use ripgrep if you like speed, filtering by default, fewer
bugs and Unicode support.


### Why shouldn't I use ripgrep?

Despite initially not wanting to add every feature under the sun to ripgrep,
over time, ripgrep has grown support for most features found in other file
searching tools. This includes searching for results spanning across multiple
lines, and opt-in support for PCRE2, which provides look-around and
backreference support.

At this point, the primary reasons not to use ripgrep probably consist of one
or more of the following:

* You need a portable and ubiquitous tool. While ripgrep works on Windows,
  macOS and Linux, it is not ubiquitous and it does not conform to any
  standard such as POSIX. The best tool for this job is good old grep.
* There still exists some other feature (or bug) not listed in this README that
  you rely on that's in another tool that isn't in ripgrep.
* There is a performance edge case where ripgrep doesn't do well where another
  tool does do well. (Please file a bug report!)
* ripgrep isn't possible to install on your machine or isn't available for your
  platform. (Please file a bug report!)


### Is it really faster than everything else?

Generally, yes. A large number of benchmarks with detailed analysis for each is
[available on my blog](https://blog.burntsushi.net/ripgrep/).

Summarizing, ripgrep is fast because:

* It is built on top of
  [Rust's regex engine](https://github.com/rust-lang/regex).
  Rust's regex engine uses finite automata, SIMD and aggressive literal
  optimizations to make searching very fast. (PCRE2 support can be opted into
  with the `-P/--pcre2` flag.)
* Rust's regex library maintains performance with full Unicode support by
  building UTF-8 decoding directly into its deterministic finite automaton
  engine.
* It supports searching with either memory maps or by searching incrementally
  with an intermediate buffer. The former is better for single files and the
  latter is better for large directories. ripgrep chooses the best searching
  strategy for you automatically.
* Applies your ignore patterns in `.gitignore` files using a
  [`RegexSet`](https://docs.rs/regex/1/regex/struct.RegexSet.html).
  That means a single file path can be matched against multiple glob patterns
  simultaneously.
* It uses a lock-free parallel recursive directory iterator, courtesy of
  [`crossbeam`](https://docs.rs/crossbeam) and
  [`ignore`](https://docs.rs/ignore).


### Feature comparison

Andy Lester, author of [ack](https://beyondgrep.com/), has published an
excellent table comparing the features of ack, ag, git-grep, GNU grep and
ripgrep: https://beyondgrep.com/feature-comparison/

Note that ripgrep has grown a few significant new features recently that
are not yet present in Andy's table. This includes, but is not limited to,
configuration files, passthru, support for searching compressed files,
multiline search and opt-in fancy regex support via PCRE2.


### Installation

The binary name for ripgrep is `rg`.

**[Archives of precompiled binaries for ripgrep are available for Windows,
macOS and Linux.](https://github.com/BurntSushi/ripgrep/releases)** Users of
platforms not explicitly mentioned below are advised to download one of these
archives.

Linux binaries are static executables. Windows binaries are available either as
built with MinGW (GNU) or with Microsoft Visual C++ (MSVC). When possible,
prefer MSVC over GNU, but you'll need to have the [Microsoft VC++ 2015
redistributable](https://www.microsoft.com/en-us/download/details.aspx?id=48145)
installed.

If you're a **macOS Homebrew** or a **Linuxbrew** user, then you can install
ripgrep from homebrew-core:

```
$ brew install ripgrep
```

If you're a **MacPorts** user, then you can install ripgrep from the
[official ports](https://www.macports.org/ports.php?by=name&substr=ripgrep):

```
$ sudo port install ripgrep
```

If you're a **Windows Chocolatey** user, then you can install ripgrep from the
[official repo](https://chocolatey.org/packages/ripgrep):

```
$ choco install ripgrep
```

If you're a **Windows Scoop** user, then you can install ripgrep from the
[official bucket](https://github.com/ScoopInstaller/Main/blob/master/bucket/ripgrep.json):

```
$ scoop install ripgrep
```

If you're an **Arch Linux** user, then you can install ripgrep from the official repos:

```
$ pacman -S ripgrep
```

If you're a **Gentoo** user, you can install ripgrep from the
[official repo](https://packages.gentoo.org/packages/sys-apps/ripgrep):

```
$ emerge sys-apps/ripgrep
```

If you're a **Fedora** user, you can install ripgrep from official
repositories.

```
$ sudo dnf install ripgrep
```

If you're an **openSUSE** user, ripgrep is included in **openSUSE Tumbleweed**
and **openSUSE Leap** since 15.1.

```
$ sudo zypper install ripgrep
```

If you're a **RHEL/CentOS 7/8** user, you can install ripgrep from
[copr](https://copr.fedorainfracloud.org/coprs/carlwgeorge/ripgrep/):

```
$ sudo yum-config-manager --add-repo=https://copr.fedorainfracloud.org/coprs/carlwgeorge/ripgrep/repo/epel-7/carlwgeorge-ripgrep-epel-7.repo
$ sudo yum install ripgrep
```

If you're a **Nix** user, you can install ripgrep from
[nixpkgs](https://github.com/NixOS/nixpkgs/blob/master/pkgs/tools/text/ripgrep/default.nix):

```
$ nix-env --install ripgrep
$ # (Or using the attribute name, which is also ripgrep.)
```

If you're a **Debian** user (or a user of a Debian derivative like **Ubuntu**),
then ripgrep can be installed using a binary `.deb` file provided in each
[ripgrep release](https://github.com/BurntSushi/ripgrep/releases).

```
$ curl -LO https://github.com/BurntSushi/ripgrep/releases/download/11.0.2/ripgrep_11.0.2_amd64.deb
$ sudo dpkg -i ripgrep_11.0.2_amd64.deb
```

If you run Debian Buster (currently Debian stable) or Debian sid, ripgrep is
[officially maintained by Debian](https://tracker.debian.org/pkg/rust-ripgrep).
```
$ sudo apt-get install ripgrep
```

If you're an **Ubuntu Cosmic (18.10)** (or newer) user, ripgrep is
[available](https://launchpad.net/ubuntu/+source/rust-ripgrep) using the same
packaging as Debian:

```
$ sudo apt-get install ripgrep
```

(N.B. Various snaps for ripgrep on Ubuntu are also available, but none of them
seem to work right and generate a number of very strange bug reports that I
don't know how to fix and don't have the time to fix. Therefore, it is no
longer a recommended installation option.)

If you're a **FreeBSD** user, then you can install ripgrep from the
[official ports](https://www.freshports.org/textproc/ripgrep/):

```
# pkg install ripgrep
```

If you're an **OpenBSD** user, then you can install ripgrep from the
[official ports](http://openports.se/textproc/ripgrep):
Finally checkout this repository and compile the fork:

```
$ doas pkg_add ripgrep
```

If you're a **NetBSD** user, then you can install ripgrep from
[pkgsrc](http://pkgsrc.se/textproc/ripgrep):

```
# pkgin install ripgrep
```

If you're a **Haiku x86_64** user, then you can install ripgrep from the
[official ports](https://github.com/haikuports/haikuports/tree/master/sys-apps/ripgrep):

```
$ pkgman install ripgrep
```

If you're a **Haiku x86_gcc2** user, then you can install ripgrep from the
same port as Haiku x86_64 using the x86 secondary architecture build:

```
$ pkgman install ripgrep_x86
```

If you're a **Rust programmer**, ripgrep can be installed with `cargo`.

* Note that the minimum supported version of Rust for ripgrep is **1.34.0**,
  although ripgrep may work with older versions.
* Note that the binary may be bigger than expected because it contains debug
  symbols. This is intentional. To remove debug symbols and therefore reduce
  the file size, run `strip` on the binary.

```
$ cargo install ripgrep
$ git clone http://git.sr.ht/~pierrenn/ripgrep
$ cd ripgrep
$ git submodule update --init --recursive
$ cargo install --path . --features 'hyperscan,pcre2' # if you want all 3 engines: default,pcre2,hyperscan
$ # cargo install --path . --features 'hyperscan' # or if you want only 2 engines: default,hyperscan
```

And don't forget to add Cargo's bin directory to your path.

### Building

ripgrep is written in Rust, so you'll need to grab a
[Rust installation](https://www.rust-lang.org/) in order to compile it.
ripgrep compiles with Rust 1.34.0 (stable) or newer. In general, ripgrep tracks
the latest stable release of the Rust compiler.
Note that the binary name for this fork of ripgrep is also `rg` so it will overwrite the original binary (since we only add functionality this shouldn't be a problem).

To build ripgrep:
### Functionalities only available in this fork

```
$ git clone https://github.com/BurntSushi/ripgrep
$ cd ripgrep
$ cargo build --release
$ ./target/release/rg --version
0.1.3
```
TLDR: We just add a new engine named `hyperscan` to ripgrep.

If you have a Rust nightly compiler and a recent Intel CPU, then you can enable
additional optional SIMD acceleration like so:

To use it :
```
RUSTFLAGS="-C target-cpu=native" cargo build --release --features 'simd-accel'
$ rg --engine hyperscan "my pattern" my_file
```

The `simd-accel` feature enables SIMD support in certain ripgrep dependencies
(responsible for transcoding). They are not necessary to get SIMD optimizations
for search; those are enabled automatically. Hopefully, some day, the
`simd-accel` feature will similarly become unnecessary. **WARNING:** Currently,
enabling this option can increase compilation times dramatically.

Finally, optional PCRE2 support can be built with ripgrep by enabling the
`pcre2` feature:

or via a file:
```
$ cargo build --release --features 'pcre2'
$ rg --engine hyperscan -f myregexps my_file
```

(Tip: use `--features 'pcre2 simd-accel'` to also include compile time SIMD
optimizations, which will only work with a nightly compiler.)

Enabling the PCRE2 feature works with a stable Rust compiler and will
attempt to automatically find and link with your system's PCRE2 library via
`pkg-config`. If one doesn't exist, then ripgrep will build PCRE2 from source
using your system's C compiler and then statically link it into the final
executable. Static linking can be forced even when there is an available PCRE2
system library by either building ripgrep with the MUSL target or by setting
`PCRE2_SYS_STATIC=1`.

ripgrep can be built with the MUSL target on Linux by first installing the MUSL
library on your system (consult your friendly neighborhood package manager).
Then you just need to add MUSL support to your Rust toolchain and rebuild
ripgrep, which yields a fully static executable:

Where `myregexps` is a compiled hyperscan DB or a list of regexps in the standard format or the hyperscan format, e.g.:
```
$ rustup target add x86_64-unknown-linux-musl
$ cargo build --release --target x86_64-unknown-linux-musl
some default regexp
/some hyperscan regexp/imsHV8WcQ
```

Applying the `--features` flag from above works as expected. If you want to
build a static executable with MUSL and with PCRE2, then you will need to have
`musl-gcc` installed, which might be in a separate package from the actual
MUSL library, depending on your Linux distribution.
where `imsHV8WcQ` can be any subset of the following (case sensitive) option :

- 'i' : `HS_FLAG_CASELESS`
- 'm' : `HS_FLAG_MULTILINE`
- 's' : `HS_FLAG_DOTALL`
- 'H' : `HS_FLAG_SINGLEMATCH`
- 'V' : `HS_FLAG_ALLOWEMPTY`
- '8' : `HS_FLAG_UTF8`
- 'W' : `HS_FLAG_UCP`
- 'C' : `HS_FLAG_COMBINATION`
- 'Q' : `HS_FLAG_QUIET`

### Running tests
We also provide options `--hyper--allow-empty`, `--hyper-utf8` and `--hyper-ucp` to override the value of each textual regular expression provided to the DB (ignored if you provide a compiled DB as we don't support DB edition). `caseless`, `multiline` and `dotall` default `ripgrep` options also override all regexps options (again, except when using a compiled hyperscan DB).

ripgrep is relatively well-tested, including both unit tests and integration
tests. To run the full test suite, use:
Finally, you can also save a compiled database DB to your disk. This can be useful as sometimes most of the time spent by ripgrep is to compile the DB (on a single core).
Use the `-d/--hyper-write` parameter to save the DB to disk before starting the search :

```
$ cargo test --all
$ # tell rg to read the myregexps text file, compile the regexps, write them to db.hs and finally search my_file
$ rg --engine hyperscan -f myregexps -d db.hs my_file
$
$ # now tell rg to directly read the compiled DB and search my_file2 - this will be quicker
$ rg --engine hyperscan -f db.hs my_file2
```

from the repository root.


### Translations
### Others

The following is a list of known translations of ripgrep's documentation. These
are unofficially maintained and may not be up to date.

* [Chinese](https://github.com/chinanf-boy/ripgrep-zh#%E6%9B%B4%E6%96%B0-)
Please refer to the [original `readme`](https://github.com/BurntSushi/ripgrep)

M crates/core/app.rs => crates/core/app.rs +74 -6
@@ 592,6 592,10 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
    flag_glob_case_insensitive(&mut args);
    flag_heading(&mut args);
    flag_hidden(&mut args);
    flag_hyper_write(&mut args);
    flag_hyper_allowempty(&mut args);
    flag_hyper_utf8(&mut args);
    flag_hyper_ucp(&mut args);
    flag_iglob(&mut args);
    flag_ignore_case(&mut args);
    flag_ignore_file(&mut args);


@@ 1214,17 1218,20 @@ Specify which regular expression engine to use. When you choose a regex engine,
it applies that choice for every regex provided to ripgrep (e.g., via multiple
-e/--regexp or -f/--file flags).

Accepted values are 'default', 'pcre2', or 'auto'.
Accepted values are 'default', 'pcre2', 'hyperscan' or 'auto'.

The default value is 'default', which is the fastest and should be good for
most use cases. The 'pcre2' engine is generally useful when you want to use
features such as look-around or backreferences. 'auto' will dynamically choose
between supported regex engines depending on the features used in a pattern on
a best effort basis.
features such as look-around or backreferences. 'hyperscan' engine is useful
when you have several hundreds of regexp to match and other engines are too
slow or generate a too big automata for your machine.

'auto' will dynamically choose between supported regex engines depending on
the features used in a pattern on a best effort basis.

Note that the 'pcre2' engine is an optional ripgrep feature. If PCRE2 wasn't
including in your build of ripgrep, then using this flag will result in ripgrep
printing an error message and exiting.
printing an error message and exiting. The same goes for 'hyperscan'.

This overrides previous uses of --pcre2 and --auto-hybrid-regex flags.
"


@@ 1232,7 1239,7 @@ This overrides previous uses of --pcre2 and --auto-hybrid-regex flags.
    let arg = RGArg::flag("engine", "ENGINE")
        .help(SHORT)
        .long_help(LONG)
        .possible_values(&["default", "pcre2", "auto"])
        .possible_values(&["default", "pcre2", "auto", "hyperscan"])
        .default_value("default")
        .overrides("pcre2")
        .overrides("no-pcre2")


@@ 1460,6 1467,67 @@ This flag can be disabled with --no-hidden.
    args.push(arg);
}

fn flag_hyper_allowempty(args: &mut Vec<RGArg>) {
    const SHORT: &str = "Send HS_FLAG_ALLOWEMPTY to all hyperscan's regexps.";
    const LONG: &str = long!(
        "\
This option is for hyperscan engine only.

It sets the flag HS_FLAG_ALLOWEMPTY for each regexp, overriding their default behavior if any.
"
    );
    let arg = RGArg::switch("hyper-allow-empty")
        .help(SHORT)
        .long_help(LONG);
    args.push(arg);
}

fn flag_hyper_utf8(args: &mut Vec<RGArg>) {
    const SHORT: &str = "Send HS_FLAG_UTF8 to all hyperscan's regexps.";
    const LONG: &str = long!(
        "\
This option is for hyperscan engine only.

It sets the flag HS_FLAG_UTF8 for each regexp, overriding their default behavior if any.
"
    );
    let arg = RGArg::switch("hyper-utf8")
        .help(SHORT)
        .long_help(LONG);
    args.push(arg);
}

fn flag_hyper_ucp(args: &mut Vec<RGArg>) {
    const SHORT: &str = "Send HS_FLAG_UCP to all hyperscan's regexps.";
    const LONG: &str = long!(
        "\
This option is for hyperscan engine only.

It sets the flag HS_FLAG_UCP for each regexp, overriding their default behavior if any.
"
    );
    let arg = RGArg::switch("hyper-ucp")
        .help(SHORT)
        .long_help(LONG);
    args.push(arg);
}

fn flag_hyper_write(args: &mut Vec<RGArg>) {
    const SHORT: &str = "Write the compiled hyperscan db to DBFILE.";
    const LONG: &str = long!(
        "\
This option is for hyperscan engine only.

Once the provided regexps are provided, it writes the compiled hyperscan database
into the file DBFILE.
"
    );
    let arg = RGArg::flag("hyper-write", "DBFILE")
        .short("d")
        .help(SHORT)
        .long_help(LONG);
    args.push(arg);
}
fn flag_iglob(args: &mut Vec<RGArg>) {
    const SHORT: &str = "Include or exclude files case insensitively.";
    const LONG: &str = long!(

M crates/core/args.rs => crates/core/args.rs +52 -17
@@ 37,6 37,7 @@ use termcolor::{BufferWriter, ColorChoice, WriteColor};

use crate::app;
use crate::config;
use crate::hyperscan::suggest_hyperscan;
use crate::logger::Logger;
use crate::messages::{set_ignore_messages, set_messages};
use crate::path_printer::{PathPrinter, PathPrinterBuilder};


@@ 100,6 101,9 @@ struct ArgsImp {
    /// The patterns provided at the command line and/or via the -f/--file
    /// flag. This may be empty.
    patterns: Vec<String>,
    /// A possible compiled db (non string) provided to rg via the -f/--file
    /// flag. This may be empty.
    compiled_db: Option<Vec<u8>>,
    /// A matcher built from the patterns.
    ///
    /// It's important that this is only built once, since building this goes


@@ 171,6 175,12 @@ impl Args {
        &self.0.patterns
    }

    /// Return the compiled db found in the command line arguments. This includes
    /// patterns read via the -f/--file flags.
    fn compiled_db(&self) -> Option<&[u8]> {
        self.0.compiled_db.as_deref()
    }

    /// Return the matcher builder from the patterns.
    fn matcher(&self) -> &PatternMatcher {
        &self.0.matcher


@@ 240,7 250,7 @@ impl Args {
            } else {
                Command::FilesParallel
            }
        } else if self.matches().can_never_match(self.patterns()) {
        } else if self.matches().can_never_match(self.patterns(), self.compiled_db()) {
            Command::SearchNever
        } else if one_thread {
            Command::Search


@@ 355,7 365,7 @@ impl Args {
/// `ArgMatches` wraps `clap::ArgMatches` and provides semantic meaning to
/// the parsed arguments.
#[derive(Clone, Debug)]
struct ArgMatches(clap::ArgMatches<'static>);
pub(crate) struct ArgMatches(clap::ArgMatches<'static>);

/// The output format. Generally, this corresponds to the printer that ripgrep
/// uses to show search results.


@@ 548,8 558,9 @@ impl ArgMatches {
    /// configuration structure.
    fn to_args(self) -> Result<Args> {
        // We compute these once since they could be large.
        let patterns = self.patterns()?;
        let matcher = self.matcher(&patterns)?;
        let patterns = self.patterns();
        let (patterns, compiled_db) = self.check_compiled_db(patterns)?;
        let matcher = self.matcher(&patterns, compiled_db.as_deref())?;
        let mut paths = self.paths();
        let using_default_path = if paths.is_empty() {
            paths.push(self.path_default());


@@ 560,6 571,7 @@ impl ArgMatches {
        Ok(Args(Arc::new(ArgsImp {
            matches: self,
            patterns,
            compiled_db,
            matcher,
            paths,
            using_default_path,


@@ 576,14 588,14 @@ impl ArgMatches {
    ///
    /// If there was a problem building the matcher (e.g., a syntax error),
    /// then this returns an error.
    fn matcher(&self, patterns: &[String]) -> Result<PatternMatcher> {
    fn matcher(&self, patterns: &[String], compiled_db: Option<&[u8]>) -> Result<PatternMatcher> {
        if self.is_present("pcre2") {
            self.matcher_engine("pcre2", patterns)
            self.matcher_engine("pcre2", patterns, compiled_db)
        } else if self.is_present("auto-hybrid-regex") {
            self.matcher_engine("auto", patterns)
            self.matcher_engine("auto-hybrid", patterns, compiled_db)
        } else {
            let engine = self.value_of_lossy("engine").unwrap();
            self.matcher_engine(&engine, patterns)
            self.matcher_engine(engine.as_str(), patterns, compiled_db)
        }
    }



@@ 596,6 608,7 @@ impl ArgMatches {
        &self,
        engine: &str,
        patterns: &[String],
        compiled_db: Option<&[u8]>,
    ) -> Result<PatternMatcher> {
        match engine {
            "default" => {


@@ 616,6 629,15 @@ impl ArgMatches {
            "pcre2" => Err(From::from(
                "PCRE2 is not available in this build of ripgrep",
            )),
            #[cfg(feature = "hyperscan")]
            "hyperscan" => {
                let matcher = self.matcher_hyperscan(patterns, compiled_db)?;
                Ok(PatternMatcher::HYPERSCAN(matcher))
            }
            #[cfg(not(feature = "hyperscan"))]
            "hyperscan" => Err(From::from(
                "hyperscan is not available in this build of ripgrep"
            )),
            "auto" => {
                let rust_err = match self.matcher_rust(patterns) {
                    Ok(matcher) => {


@@ 628,7 650,12 @@ impl ArgMatches {
                    rust_err,
                );

                let pcre_err = match self.matcher_engine("pcre2", patterns) {
                let pcre_err = match self.matcher_engine("pcre2", patterns, compiled_db) {
                    Ok(matcher) => return Ok(matcher),
                    Err(err) => err,
                };

                let hyper_err = match self.matcher_engine("hyperscan", patterns, compiled_db) {
                    Ok(matcher) => return Ok(matcher),
                    Err(err) => err,
                };


@@ 636,11 663,13 @@ impl ArgMatches {
                    "regex could not be compiled with either the default \
                     regex engine or with PCRE2.\n\n\
                     default regex engine error:\n{}\n{}\n{}\n\n\
                     PCRE2 regex engine error:\n{}",
                     PCRE2 regex engine error:\n{}\n\n\
                     HyperScan regex engine error:\n{}",
                    "~".repeat(79),
                    rust_err,
                    "~".repeat(79),
                    pcre_err,
                    hyper_err,
                )))
            }
            _ => Err(From::from(format!(


@@ 934,15 963,16 @@ impl ArgMatches {

    /// Returns true if the command line configuration implies that a match
    /// can never be shown.
    fn can_never_match(&self, patterns: &[String]) -> bool {
        patterns.is_empty() || self.max_count().ok() == Some(Some(0))
    fn can_never_match(&self, patterns: &[String], compiled_db: Option<&[u8]>) -> bool {
        (compiled_db.is_none() || compiled_db.unwrap().is_empty()) &&
            (patterns.is_empty() || self.max_count().ok() == Some(Some(0)))
    }

    /// Returns true if and only if case should be ignore.
    ///
    /// If --case-sensitive is present, then case is never ignored, even if
    /// --ignore-case is present.
    fn case_insensitive(&self) -> bool {
    pub(crate) fn case_insensitive(&self) -> bool {
        self.is_present("ignore-case") && !self.is_present("case-sensitive")
    }



@@ 1681,7 1711,7 @@ impl ArgMatches {
/// values and return the last one. (Clap returns the first one.) We only
/// define the ones we need.
impl ArgMatches {
    fn is_present(&self, name: &str) -> bool {
    pub(crate) fn is_present(&self, name: &str) -> bool {
        self.0.is_present(name)
    }



@@ 1689,7 1719,7 @@ impl ArgMatches {
        self.0.occurrences_of(name)
    }

    fn value_of_lossy(&self, name: &str) -> Option<String> {
    pub(crate) fn value_of_lossy(&self, name: &str) -> Option<String> {
        self.0.value_of_lossy(name).map(|s| s.into_owned())
    }



@@ 1697,11 1727,11 @@ impl ArgMatches {
        self.0.values_of_lossy(name)
    }

    fn value_of_os(&self, name: &str) -> Option<&OsStr> {
    pub(crate) fn value_of_os(&self, name: &str) -> Option<&OsStr> {
        self.0.value_of_os(name)
    }

    fn values_of_os(&self, name: &str) -> Option<clap::OsValues> {
    pub(crate) fn values_of_os(&self, name: &str) -> Option<clap::OsValues> {
        self.0.values_of_os(name)
    }
}


@@ 1713,6 1743,11 @@ fn suggest(msg: String) -> String {
    if let Some(pcre_msg) = suggest_pcre2(&msg) {
        return pcre_msg;
    }

    if let Some(hyper_msg) = suggest_hyperscan(&msg) {
        return hyper_msg;
    }

    msg
}


A crates/core/hyperscan.rs => crates/core/hyperscan.rs +110 -0
@@ 0,0 1,110 @@
/// This file is part of a patchset to enable hyperscan engine for ripgrep.
///

use crate::Result;
use crate::args::ArgMatches;

#[cfg(feature = "hyperscan")]
use std::io::Read;

#[cfg(feature = "hyperscan")]
use std::fs::File;

#[cfg(feature = "hyperscan")]
use grep::hyperscan::{
    RegexMatcher as HyperScanRegexMatcher,
    RegexMatcherBuilder as HyperScanRegexMatcherBuilder,
};

#[cfg(feature = "hyperscan")]
const HS_DB_MAGIC: u8 = 0xDB;

impl ArgMatches {
    /// Build a matcher using Hyperscan.
    ///
    /// If there was a problem building the matcher (such as a regex syntax
    /// error), then an error is returned.
    #[cfg(feature = "hyperscan")]
    pub(crate) fn matcher_hyperscan(&self, patterns: &[String], compiled_db: Option<&[u8]>) -> Result<HyperScanRegexMatcher> {
        let mut builder = HyperScanRegexMatcherBuilder::new();
        builder.caseless(self.case_insensitive())
               .allow_empty(self.is_present("hyper-allow-empty"))
               .utf8(self.is_present("hyper-utf8"))
               .ucp(self.is_present("hyper-ucp"));

        if self.is_present("multiline") {
            builder.multi_line(true)
                   .dotall(self.is_present("multiline-dotall"));
        }

        Ok(builder.build(patterns, compiled_db)?
                  .write_db(self.value_of_os("hyper-write"))?)
    }

    #[cfg(not(feature = "hyperscan"))]
    pub(crate) fn check_compiled_db(&self, std_patterns: Result<Vec<String>>) -> Result<(Vec<String>, Option<Vec<u8>>)> {
        std_patterns.map(|p| (p, None))
    }

    #[cfg(feature = "hyperscan")]
    pub(crate) fn check_compiled_db(&self, std_patterns: Result<Vec<String>>) -> Result<(Vec<String>, Option<Vec<u8>>)> {
        if self.value_of_lossy("engine").unwrap() != "hyperscan" ||
            std_patterns.is_ok() || self.values_of_os("file").is_none() {
            return std_patterns.map(|p| (p, None))
        }

        let paths = self.values_of_os("file").unwrap();
        if paths.len() != 1 {
            return Err(format!("Found multiple -f flags. If you tried to use a compiled \
                                hyperscan db, please provide only a single DB via the -f \
                                flag witout any extra pattern.").into())
        }

        let path = paths.clone().next().unwrap();
        if path != "-" {
            let mut file = File::open(path)?;
            let mut data : Vec<u8> = Vec::new();
            file.read_to_end(&mut data)?;

            //check if it's a proper hyperscan db via file magic
            if data[0..4].into_iter().all(|&b| b==HS_DB_MAGIC) {
                if self.value_of_os("regexp").is_some() {
                    return Err(format!("You provided an hyperscan db with other text patterns \
                                        on cli. Please provide only a single DB via the -f \
                                        flag without any extra pattern.").into())
                }

                return Ok((Vec::new(), Some(data)));
            }
        }

        std_patterns.map(|p| (p, None))
    }
}

/// Inspect an error resulting from building a Rust regex matcher, and if it's
/// believed to correspond to a syntax error that hyperscan could handle, then
/// add a message to suggest the use of hyperscan engine.
pub(crate) fn suggest_hyperscan(msg: &String) -> Option<String> {
    #[cfg(feature = "hyperscan")]
    fn suggest(msg: &str) -> Option<String> {
        if msg.contains("backreferences") || msg.contains("look-around") {
            None
        } else {
            Some(format!(
                "{}

Consider enabling hyperscan with the --engine=hyperscan flag, which can handle larger
regexps and compiled hyperscan DB.",
                msg
            ))
        }
    }

    #[cfg(not(feature = "hyperscan"))]
    fn suggest(_: &str) -> Option<String> {
        None
    }

    suggest(msg)
}

M crates/core/main.rs => crates/core/main.rs +1 -0
@@ 15,6 15,7 @@ mod messages;
mod app;
mod args;
mod config;
mod hyperscan;
mod logger;
mod path_printer;
mod search;

M crates/core/search.rs => crates/core/search.rs +8 -0
@@ 6,6 6,8 @@ use std::time::Duration;

use grep::cli;
use grep::matcher::Matcher;
#[cfg(feature = "hyperscan")]
use grep::hyperscan::RegexMatcher as HyperScanRegexMatcher;
#[cfg(feature = "pcre2")]
use grep::pcre2::RegexMatcher as PCRE2RegexMatcher;
use grep::printer::{Standard, Stats, Summary, JSON};


@@ 206,6 208,8 @@ impl SearchResult {
#[derive(Clone, Debug)]
pub enum PatternMatcher {
    RustRegex(RustRegexMatcher),
    #[cfg(feature = "hyperscan")]
    HYPERSCAN(HyperScanRegexMatcher),
    #[cfg(feature = "pcre2")]
    PCRE2(PCRE2RegexMatcher),
}


@@ 425,6 429,8 @@ impl<W: WriteColor> SearchWorker<W> {
        let (searcher, printer) = (&mut self.searcher, &mut self.printer);
        match self.matcher {
            RustRegex(ref m) => search_path(m, searcher, printer, path),
            #[cfg(feature = "hyperscan")]
            HYPERSCAN(ref m) => search_path(m, searcher, printer, path),
            #[cfg(feature = "pcre2")]
            PCRE2(ref m) => search_path(m, searcher, printer, path),
        }


@@ 449,6 455,8 @@ impl<W: WriteColor> SearchWorker<W> {
        let (searcher, printer) = (&mut self.searcher, &mut self.printer);
        match self.matcher {
            RustRegex(ref m) => search_reader(m, searcher, printer, path, rdr),
            #[cfg(feature = "hyperscan")]
            HYPERSCAN(ref m) => search_reader(m, searcher, printer, path, rdr),
            #[cfg(feature = "pcre2")]
            PCRE2(ref m) => search_reader(m, searcher, printer, path, rdr),
        }

M crates/grep/Cargo.toml => crates/grep/Cargo.toml +2 -0
@@ 19,6 19,7 @@ grep-pcre2 = { version = "0.1.4", path = "../pcre2", optional = true }
grep-printer = { version = "0.1.4", path = "../printer" }
grep-regex = { version = "0.1.6", path = "../regex" }
grep-searcher = { version = "0.1.7", path = "../searcher" }
grep-hyperscan = { version = "0.0.1", path = "../hyperscan", optional = true }

[dev-dependencies]
termcolor = "1.0.4"


@@ 27,6 28,7 @@ walkdir = "2.2.7"
[features]
simd-accel = ["grep-searcher/simd-accel"]
pcre2 = ["grep-pcre2"]
hyperscan = ["grep-hyperscan"]

# This feature is DEPRECATED. Runtime dispatch is used for SIMD now.
avx-accel = []

M crates/grep/src/lib.rs => crates/grep/src/lib.rs +2 -0
@@ 15,6 15,8 @@ A cookbook and a guide are planned.
#![deny(missing_docs)]

pub extern crate grep_cli as cli;
#[cfg(feature = "hyperscan")]
pub extern crate grep_hyperscan as hyperscan;
pub extern crate grep_matcher as matcher;
#[cfg(feature = "pcre2")]
pub extern crate grep_pcre2 as pcre2;

A crates/hyperscan => crates/hyperscan +1 -0
@@ 0,0 1,1 @@
Subproject commit 6c76d609fc1584e2ae553a94ba85805eba987fa0