~hrbrmstr/sergeant

9cb4647fbfcdc117eaa50da3fb4a4269d314d1d9 — Bob Rudis 3 years ago 2ed19e0
README tweak
4 files changed, 98 insertions(+), 27 deletions(-)

M R/dplyr.r
M README.Rmd
M README.md
M cran-comments.md
M R/dplyr.r => R/dplyr.r +26 -1
@@ 11,13 11,23 @@
#' @note This is a DBI wrapper around the Drill REST API. TODO username/password support
#' @export
#' @examples \dontrun{
#' db <- src_drill("localhost", "8047")
#' db <- src_drill("localhost", 8047L)
#'
#' print(db)
#' ## src:  DrillConnection
#' ## tbls: INFORMATION_SCHEMA, cp.default, dfs.default, dfs.root, dfs.tmp, sys
#'
#' emp <- tbl(db, "cp.`employee.json`")
#'
#' count(emp, gender, marital_status)
#' ## # Source:   lazy query [?? x 3]
#' ## # Database: DrillConnection
#' ## # Groups:   gender
#' ##   marital_status gender     n
#' ##            <chr>  <chr> <int>
#' ## 1              S      F   297
#' ## 2              M      M   278
#' ## 3              S      M   276
#'
#' # Drill-specific SQL functions are also available
#' select(emp, full_name) %>%


@@ 29,6 39,21 @@
#'                  pos = position("en", full_name),
#'                  rpd = rpad(full_name, 20L),
#'                 rpdw = rpad_with(full_name, 20L, "*"))
#' ## # Source:   lazy query [?? x 9]
#' ## # Database: DrillConnection
#' ##      loc         full_name   len                 rpdw   pos                rx
#' ##    <int>             <chr> <int>                <chr> <int>             <chr>
#' ##  1     0      Sheri Nowmer    12 Sheri Nowmer********     0      Sh*r* N*wm*r
#' ##  2     0   Derrick Whelply    15 Derrick Whelply*****     0   D*rr*ck Wh*lply
#' ##  3     5    Michael Spence    14 Michael Spence******    11    M*ch**l Sp*nc*
#' ##  4     2    Maya Gutierrez    14 Maya Gutierrez******     0    M*y* G*t**rr*z
#' ##  5     7   Roberta Damstra    15 Roberta Damstra*****     0   R*b*rt* D*mstr*
#' ##  6     7  Rebecca Kanagaki    16 Rebecca Kanagaki****     0  R*b*cc* K*n*g*k*
#' ##  7     0       Kim Brunner    11 Kim Brunner*********     0       K*m Br*nn*r
#' ##  8     6   Brenda Blumberg    15 Brenda Blumberg*****     3   Br*nd* Bl*mb*rg
#' ##  9     2      Darren Stanz    12 Darren Stanz********     5      D*rr*n St*nz
#' ## 10     4 Jonathan Murraiin    17 Jonathan Murraiin***     0 J*n*th*n M*rr***n
#' ## # ... with more rows, and 3 more variables: rpd <chr>, rnd <dbl>, first_three <chr>
#' }
src_drill <- function(host=Sys.getenv("DRILL_HOST", "localhost"),
                      port=as.integer(Sys.getenv("DRILL_PORT", 8047L)),

M README.Rmd => README.Rmd +25 -3
@@ 89,9 89,19 @@ options(width=120)
```{r message=FALSE}
library(sergeant)

```

```{r echo=TRUE, eval=FALSE}
ds <- src_drill("localhost")  # use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds
```

```{r echo=FALSE, eval=TRUE}
ds <- src_drill("bigd")
ds
```

```{r message=FALSE}
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:


@@ 162,14 172,20 @@ mutate(db, position_title=tolower(position_title)) %>%

### Usage

```{r}
```{r message=FALSE}
library(sergeant)

# current verison
packageVersion("sergeant")

```
```{r eval=FALSE}
dc <- drill_connection("localhost") 

```
```{r echo=FALSE}
dc <- drill_connection("bigd") 
```
```{r message=FALSE}
drill_active(dc)

drill_version(dc)


@@ 230,8 246,14 @@ library(RJDBC)
# con <- drill_jdbc("drill-node:2181", "drillbits1") 

# Use the following if running drill-embedded
```
```{r eval=FALSE}
con <- drill_jdbc("localhost:31010", use_zk=FALSE)

```
```{r echo=FALSE}
con <- drill_jdbc("bigd:31010", use_zk=FALSE)
```
```{r message=FALSE}
drill_query(con, "SELECT * FROM cp.`employee.json`")

# but it can work via JDBC function calls, too

M README.md => README.md +30 -15
@@ 8,7 8,7 @@

Drill + `sergeant` is (IMO) a nice alternative to Spark + `sparklyr` if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

I find writing SQL queries to parquet files with Drill on a local linux or macOS workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.
I find writing SQL queries to parquet files with Drill on a local linux or macOS workstation to be more performant than doing the data ingestion work with R (especially for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.

You can download Drill from <https://drill.apache.org/download/> (use "Direct File Download"). I use `/usr/local/drill` as the install directory. `drill-embedded` is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the `drill-embedded` server from within R, please file an issue.



@@ 24,13 24,13 @@ The following functions are implemented:

**`DBI`**

-   As complete of an R `DBI` driver has been implemented using the Drill REST API, mostly to facilitate the `dplyr` interface. Use the `RJDBC` driver interface if you need more `DBI` functionality.
-   This also means that SQL functions unique to Drill have also been "implemented" (i.e. made accessible to the `dplyr` interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub.
-   A "just enough" feature complete R `DBI` driver has been implemented using the Drill REST API, mostly to facilitate the `dplyr` interface. Use the `RJDBC` driver interface if you need more `DBI` functionality.
-   This also means that SQL functions unique to Drill have also been "implemented" (i.e. made accessible to the `dplyr` interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface.

**`RJDBC`**

-   `drill_jdbc`: Connect to Drill using JDBC, enabling use of said idioms. See `RJDBC` for more info.
-   NOTE: The DRILL JDBC driver fully-qualified path must be placed in the `DRILL_JDBC_JAR` environment variable. This is best done via `~/.Renviron` for interactive work. i.e. `DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.9.0.jar`
-   NOTE: The DRILL JDBC driver fully-qualified path must be placed in the `DRILL_JDBC_JAR` environment variable. This is best done via `~/.Renviron` for interactive work. i.e. `DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.10.0.jar`

**`dplyr`**:



@@ 72,12 72,17 @@ devtools::install_github("hrbrmstr/sergeant")

``` r
library(sergeant)
```

``` r
ds <- src_drill("localhost")  # use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds
#> src:  DrillConnection
#> tbls: INFORMATION_SCHEMA, cp.default, dfs.default, dfs.root, dfs.tmp, sys
```

    #> src:  DrillConnection
    #> tbls: INFORMATION_SCHEMA, cp.default, dfs.d, dfs.default, dfs.h, dfs.natexp, dfs.p, dfs.root, dfs.tmp, sys

``` r
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:


@@ 225,10 230,14 @@ library(sergeant)

# current verison
packageVersion("sergeant")
#> [1] '0.5.0'
#> [1] '0.5.2'
```

``` r
dc <- drill_connection("localhost") 
```

``` r
drill_active(dc)
#> [1] TRUE



@@ 355,11 364,11 @@ drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nation
#> # A tibble: 5 x 5
#>              N_COMMENT    N_NAME N_NATIONKEY N_REGIONKEY      dir0
#> *                <chr>     <chr>       <int>       <int>     <chr>
#> 1  haggle. carefully f   ALGERIA           0           0 nationsMF
#> 2 al foxes promise sly ARGENTINA           1           1 nationsMF
#> 3 y alongside of the p    BRAZIL           2           1 nationsMF
#> 4 eas hang ironic, sil    CANADA           3           1 nationsMF
#> 5 y above the carefull     EGYPT           4           4 nationsMF
#> 1  haggle. carefully f   ALGERIA           0           0 nationsSF
#> 2 al foxes promise sly ARGENTINA           1           1 nationsSF
#> 3 y alongside of the p    BRAZIL           2           1 nationsSF
#> 4 eas hang ironic, sil    CANADA           3           1 nationsSF
#> 5 y above the carefull     EGYPT           4           4 nationsSF
```

### A preview of the built-in support for spatial ops


@@ 408,9 417,15 @@ library(RJDBC)
# con <- drill_jdbc("drill-node:2181", "drillbits1") 

# Use the following if running drill-embedded
```

``` r
con <- drill_jdbc("localhost:31010", use_zk=FALSE)
#> Using [jdbc:drill:drillbit=localhost:31010]...
```

    #> Using [jdbc:drill:drillbit=bigd:31010]...

``` r
drill_query(con, "SELECT * FROM cp.`employee.json`")
#> # A tibble: 1,155 x 16
#>    employee_id         full_name first_name last_name position_id         position_title store_id department_id


@@ 460,12 475,12 @@ library(testthat)
#>     matches

date()
#> [1] "Mon Jun 19 00:15:05 2017"
#> [1] "Mon Jul 17 12:23:17 2017"

devtools::test()
#> Loading sergeant
#> Testing sergeant
#> dplyr: ....
#> dplyr: ...
#> rest: ................
#> 
#> DONE ===================================================================================================================

M cran-comments.md => cran-comments.md +17 -8
@@ 1,11 1,12 @@
## Test environments
* local OS X install, R 3.4.1
* ubuntu 12.04 (on travis-ci), R 3.4.1
* win-builder (devel and release)
* local ubuntu 14.04 install, R 3.4.1
* ubuntu 12.04 (on travis-ci), R 3.4.1 and oldrel
* win-builder

## R CMD check results

0 errors | 0 warnings | 1 note
0 errors | 0 warnings | 0 notes

* This is a new release.



@@ 15,9 16,17 @@ This is a new release, so there are no reverse dependencies.

---

* I have run R CMD check on the NUMBER downstream dependencies.
  (Summary at ...). 
  
* FAILURE SUMMARY
* WinBuilder and R-hub both are reporting httr and covr are not available so 
  I have not been able to get it to work successfully on those platforms as 
  a result of these errors which have nothing to do with the package
  configuration.

* All revdep maintainers were notified of the release on RELEASE DATE.
* The examples and tests are wrapped in \dontrun{} or testthat:::skip_on_cran()
  since they absolutely require a running Apache Drill server. Full tests
  are run on Travis (weekly, now) with results avaialble for review:
  https://travis-ci.org/hrbrmstr/sergeant
  
  The Travis tests install Apache Drill and test out the REST API calls
  as well as the dplyr/dbplyr interface with live queries.
  
* Code coverage is run and is currently at 40%