~hrbrmstr/sergeant

c6e0bb7e89f58dd2961e8255139ab1903ea4cd7c — hrbrmstr 1 year, 6 months ago f0651c5
README
3 files changed, 96 insertions(+), 105 deletions(-)

M .Rbuildignore
M README.md
R README.Rmd => pre/README.Rmd
M .Rbuildignore => .Rbuildignore +1 -0
@@ 12,3 12,4 @@
^apache-drill-1\.10\.0\.tar\.gz$
^cdh4-repository_1\.0_all\.deb$
^cran-comments\.md$
^pre$

M README.md => README.md +89 -84
@@ 6,7 6,7 @@
Status](https://travis-ci.org/hrbrmstr/sergeant.svg?branch=master)](https://travis-ci.org/hrbrmstr/sergeant)
[![Coverage
Status](https://codecov.io/gh/hrbrmstr/sergeant/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/sergeant)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)
[![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)

# 💂 sergeant



@@ 14,26 14,12 @@ Tools to Transform and Query Data with ‘Apache’ ‘Drill’

## \*\* IMPORTANT \*\*

Version 0.7.0 splits off the JDBC interface into a separate package
`sergeant.caffeinated`
([sr.ht](https://git.sr.ht/~hrbrmstr/sergeant);
Version 0.7.0 (a.k.a. the main branch) splits off the JDBC interface
into a separate package `sergeant.caffeinated`
([GitLab](https://gitlab.com/hrbrmstr/sergeant-caffeinated);
[GitHub](https://github.com/hrbrmstr/sergeant-caffeinated)).

If you want to try all the new features coming in 0.8.0 please install from the 0.8.0 branch via:

``` r
# sr.ht
devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0")

# GitLab
devtools::install_git("https://gitlab.com/hrbrmstr/sergeant", ref="0.8.0")

# GitHub
devtools::install_git("https://github.com/hrbrmstr/sergeant", ref="0.8.0")
```

## Description
I\# Description

Drill + `sergeant` is (IMO) a streamlined alternative to Spark +
`sparklyr` if you don’t need the ML components of Spark (i.e. just need


@@ 133,14 119,28 @@ function mappings.
# Installation

``` r
install.packages("sergeant", repos = "https://cinc.rud.is")
# or
devtools::install_git("https://git.rud.is/hrbrmstr/sergeant.git")
# or
devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant")
# or
devtools::install_gitlab("hrbrmstr/sergeant")
# or
devtools::install_github("hrbrmstr/sergeant")
```

\`\`{r echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
options(width=120)

```` 

## Usage

### `dplyr` interface

``` r

```r
library(sergeant)
library(tidyverse)



@@ 198,30 198,32 @@ arrange(db, desc(employee_id)) %>% print(n = 20)
##  # Source:     table<cp.`employee.json`> [?? x 20]
##  # Database:   DrillConnection
##  # Ordered by: desc(employee_id)
##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>          <chr>    <chr>         <chr>      <chr>    
##   1 999         Beverly … Beverly    Dittmar   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   2 998         Elizabet… Elizabeth  Jantzer   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   3 997         John Swe… John       Sweet     17          Store Permane… 8        17            1914-02-02 1998-01-…
##   4 996         William … William    Murphy    17          Store Permane… 8        17            1914-02-02 1998-01-…
##   5 995         Carol Li… Carol      Lindsay   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   6 994         Richard … Richard    Burke     17          Store Permane… 8        17            1914-02-02 1998-01-…
##   7 993         Ethan Bu… Ethan      Bunosky   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   8 992         Claudett… Claudette  Cabrera   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   9 991         Maria Te… Maria      Terry     17          Store Permane… 8        17            1914-02-02 1998-01-…
##  10 990         Stacey C… Stacey     Case      17          Store Permane… 8        17            1914-02-02 1998-01-…
##  11 99          Elizabet… Elizabeth  Horne     18          Store Tempora… 6        18            1976-10-05 1997-01-…
##  12 989         Dominick… Dominick   Nutter    17          Store Permane… 8        17            1914-02-02 1998-01-…
##  13 988         Brian Wi… Brian      Willeford 17          Store Permane… 8        17            1914-02-02 1998-01-…
##  14 987         Margaret… Margaret   Clendenen 17          Store Permane… 8        17            1914-02-02 1998-01-…
##  15 986         Maeve Wa… Maeve      Wall      17          Store Permane… 8        17            1914-02-02 1998-01-…
##  16 985         Mildred … Mildred    Morrow    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  17 984         French W… French     Wilson    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  18 983         Elisabet… Elisabeth  Duncan    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  19 982         Linda An… Linda      Anderson  16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  20 981         Selene W… Selene     Watson    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  # … with more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
##  #   marital_status <chr>, gender <chr>, management_role <chr>
##     employee_id full_name first_name last_name position_id position_title
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>         
##   1 999         Beverly … Beverly    Dittmar   17          Store Permane…
##   2 998         Elizabet… Elizabeth  Jantzer   17          Store Permane…
##   3 997         John Swe… John       Sweet     17          Store Permane…
##   4 996         William … William    Murphy    17          Store Permane…
##   5 995         Carol Li… Carol      Lindsay   17          Store Permane…
##   6 994         Richard … Richard    Burke     17          Store Permane…
##   7 993         Ethan Bu… Ethan      Bunosky   17          Store Permane…
##   8 992         Claudett… Claudette  Cabrera   17          Store Permane…
##   9 991         Maria Te… Maria      Terry     17          Store Permane…
##  10 990         Stacey C… Stacey     Case      17          Store Permane…
##  11 99          Elizabet… Elizabeth  Horne     18          Store Tempora…
##  12 989         Dominick… Dominick   Nutter    17          Store Permane…
##  13 988         Brian Wi… Brian      Willeford 17          Store Permane…
##  14 987         Margaret… Margaret   Clendenen 17          Store Permane…
##  15 986         Maeve Wa… Maeve      Wall      17          Store Permane…
##  16 985         Mildred … Mildred    Morrow    16          Store Tempora…
##  17 984         French W… French     Wilson    16          Store Tempora…
##  18 983         Elisabet… Elisabeth  Duncan    16          Store Tempora…
##  19 982         Linda An… Linda      Anderson  16          Store Tempora…
##  20 981         Selene W… Selene     Watson    16          Store Tempora…
##  # … with more rows, and 10 more variables: store_id <chr>,
##  #   department_id <chr>, birth_date <chr>, hire_date <chr>, salary <chr>,
##  #   supervisor_id <chr>, education_level <chr>, marital_status <chr>,
##  #   gender <chr>, management_role <chr>

mutate(db, position_title = tolower(position_title)) %>%
  mutate(salary = as.numeric(salary)) %>%


@@ 244,7 246,7 @@ mutate(db, position_title = tolower(position_title)) %>%
##   9 6                            4
##  10 36                           2
##  # … with 102 more rows
```
````

### REST API



@@ 258,57 260,60 @@ drill_version(dc)
##  [1] "1.15.0"

drill_storage(dc)$name
##   [1] "cp"       "dfs"      "drilldat" "hbase"    "hdfs"     "hive"     "kudu"     "mongo"    "my"       "s3"
##   [1] "cp"       "dfs"      "drilldat" "hbase"    "hdfs"     "hive"    
##   [7] "kudu"     "mongo"    "my"       "s3"

drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100")
##  # A tibble: 100 x 16
##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>          <chr>    <chr>         <chr>      <chr>    
##   1 1           Sheri No… Sheri      Nowmer    1           President      0        1             1961-08-26 1994-12-…
##   2 2           Derrick … Derrick    Whelply   2           VP Country Ma… 0        1             1915-07-03 1994-12-…
##   3 4           Michael … Michael    Spence    2           VP Country Ma… 0        1             1969-06-20 1998-01-…
##   4 5           Maya Gut… Maya       Gutierrez 2           VP Country Ma… 0        1             1951-05-10 1998-01-…
##   5 6           Roberta … Roberta    Damstra   3           VP Informatio… 0        2             1942-10-08 1994-12-…
##   6 7           Rebecca … Rebecca    Kanagaki  4           VP Human Reso… 0        3             1949-03-27 1994-12-…
##   7 8           Kim Brun… Kim        Brunner   11          Store Manager  9        11            1922-08-10 1998-01-…
##   8 9           Brenda B… Brenda     Blumberg  11          Store Manager  21       11            1979-06-23 1998-01-…
##   9 10          Darren S… Darren     Stanz     5           VP Finance     0        5             1949-08-26 1994-12-…
##  10 11          Jonathan… Jonathan   Murraiin  11          Store Manager  1        11            1967-06-20 1998-01-…
##  # … with 90 more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
##  #   marital_status <chr>, gender <chr>, management_role <chr>
##     employee_id full_name first_name last_name position_id position_title
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>         
##   1 1           Sheri No… Sheri      Nowmer    1           President     
##   2 2           Derrick … Derrick    Whelply   2           VP Country Ma…
##   3 4           Michael … Michael    Spence    2           VP Country Ma…
##   4 5           Maya Gut… Maya       Gutierrez 2           VP Country Ma…
##   5 6           Roberta … Roberta    Damstra   3           VP Informatio…
##   6 7           Rebecca … Rebecca    Kanagaki  4           VP Human Reso…
##   7 8           Kim Brun… Kim        Brunner   11          Store Manager 
##   8 9           Brenda B… Brenda     Blumberg  11          Store Manager 
##   9 10          Darren S… Darren     Stanz     5           VP Finance    
##  10 11          Jonathan… Jonathan   Murraiin  11          Store Manager 
##  # … with 90 more rows, and 10 more variables: store_id <chr>,
##  #   department_id <chr>, birth_date <chr>, hire_date <chr>, salary <chr>,
##  #   supervisor_id <chr>, education_level <chr>, marital_status <chr>,
##  #   gender <chr>, management_role <chr>

drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender")

drill_options(dc)
##  # A tibble: 179 x 6
##     name                                                        value    defaultValue accessibleScopes kind   optionScope
##     <chr>                                                       <chr>    <chr>        <chr>            <chr>  <chr>      
##   1 debug.validate_iterators                                    FALSE    false        ALL              BOOLE… BOOT       
##   2 debug.validate_vectors                                      FALSE    false        ALL              BOOLE… BOOT       
##   3 drill.exec.functions.cast_empty_string_to_null              FALSE    false        ALL              BOOLE… BOOT       
##   4 drill.exec.hashagg.fallback.enabled                         FALSE    false        ALL              BOOLE… BOOT       
##   5 drill.exec.hashjoin.fallback.enabled                        FALSE    false        ALL              BOOLE… BOOT       
##   6 drill.exec.memory.operator.output_batch_size                16777216 16777216     SYSTEM           LONG   BOOT       
##   7 drill.exec.memory.operator.output_batch_size_avail_mem_fac… 0.1      0.1          SYSTEM           DOUBLE BOOT       
##   8 drill.exec.storage.file.partition.column.label              dir      dir          ALL              STRING BOOT       
##   9 drill.exec.storage.implicit.filename.column.label           filename filename     ALL              STRING BOOT       
##  10 drill.exec.storage.implicit.filepath.column.label           filepath filepath     ALL              STRING BOOT       
##     name              value  defaultValue accessibleScopes kind  optionScope
##     <chr>             <chr>  <chr>        <chr>            <chr> <chr>      
##   1 debug.validate_i… FALSE  false        ALL              BOOL… BOOT       
##   2 debug.validate_v… FALSE  false        ALL              BOOL… BOOT       
##   3 drill.exec.funct… FALSE  false        ALL              BOOL… BOOT       
##   4 drill.exec.hasha… FALSE  false        ALL              BOOL… BOOT       
##   5 drill.exec.hashj… FALSE  false        ALL              BOOL… BOOT       
##   6 drill.exec.memor… 16777… 16777216     SYSTEM           LONG  BOOT       
##   7 drill.exec.memor… 0.1    0.1          SYSTEM           DOUB… BOOT       
##   8 drill.exec.stora… dir    dir          ALL              STRI… BOOT       
##   9 drill.exec.stora… filen… filename     ALL              STRI… BOOT       
##  10 drill.exec.stora… filep… filepath     ALL              STRI… BOOT       
##  # … with 169 more rows

drill_options(dc, "json")
##  # A tibble: 10 x 6
##     name                                                    value defaultValue accessibleScopes kind    optionScope
##     <chr>                                                   <chr> <chr>        <chr>            <chr>   <chr>      
##   1 store.hive.maprdb_json.optimize_scan_with_native_reader FALSE false        ALL              BOOLEAN BOOT       
##   2 store.json.all_text_mode                                TRUE  false        ALL              BOOLEAN SYSTEM     
##   3 store.json.extended_types                               TRUE  false        ALL              BOOLEAN SYSTEM     
##   4 store.json.read_numbers_as_double                       FALSE false        ALL              BOOLEAN BOOT       
##   5 store.json.reader.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       
##   6 store.json.reader.print_skipped_invalid_record_number   TRUE  false        ALL              BOOLEAN SYSTEM     
##   7 store.json.reader.skip_invalid_records                  TRUE  false        ALL              BOOLEAN SYSTEM     
##   8 store.json.writer.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       
##   9 store.json.writer.skip_null_fields                      TRUE  true         ALL              BOOLEAN BOOT       
##  10 store.json.writer.uglify                                TRUE  false        ALL              BOOLEAN SYSTEM
##     name               value defaultValue accessibleScopes kind  optionScope
##     <chr>              <chr> <chr>        <chr>            <chr> <chr>      
##   1 store.hive.maprdb… FALSE false        ALL              BOOL… BOOT       
##   2 store.json.all_te… TRUE  false        ALL              BOOL… SYSTEM     
##   3 store.json.extend… TRUE  false        ALL              BOOL… SYSTEM     
##   4 store.json.read_n… FALSE false        ALL              BOOL… BOOT       
##   5 store.json.reader… TRUE  true         ALL              BOOL… BOOT       
##   6 store.json.reader… TRUE  false        ALL              BOOL… SYSTEM     
##   7 store.json.reader… TRUE  false        ALL              BOOL… SYSTEM     
##   8 store.json.writer… TRUE  true         ALL              BOOL… BOOT       
##   9 store.json.writer… TRUE  true         ALL              BOOL… BOOT       
##  10 store.json.writer… TRUE  false        ALL              BOOL… SYSTEM
```

## Working with parquet files


@@ 375,7 380,7 @@ select columns[2] as city, columns[4] as lon, columns[3] as lat
| Lang | \# Files |  (%) |  LoC |  (%) | Blank lines |  (%) | \# Lines |  (%) |
| :--- | -------: | ---: | ---: | ---: | ----------: | ---: | -------: | ---: |
| R    |       18 | 0.95 | 1212 | 0.96 |         349 | 0.86 |      716 | 0.89 |
| Rmd  |        1 | 0.05 |   54 | 0.04 |          56 | 0.14 |       92 | 0.11 |
| Rmd  |        1 | 0.05 |   56 | 0.04 |          55 | 0.14 |       90 | 0.11 |

## Code of Conduct


R README.Rmd => pre/README.Rmd +6 -21
@@ 19,7 19,7 @@ options(sergeant.bigint.warnonce = FALSE)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1248912.svg)](https://doi.org/10.5281/zenodo.1248912) 
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/sergeant.svg?branch=master)](https://travis-ci.org/hrbrmstr/sergeant) 
[![Coverage Status](https://codecov.io/gh/hrbrmstr/sergeant/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/sergeant)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)

# 💂 sergeant



@@ 29,21 29,7 @@ Tools to Transform and Query Data with 'Apache' 'Drill'

Version 0.7.0 (a.k.a. the main branch) splits off the JDBC interface into a separate package `sergeant.caffeinated` ([GitLab](https://gitlab.com/hrbrmstr/sergeant-caffeinated); [GitHub](https://github.com/hrbrmstr/sergeant-caffeinated)).

If you want to try all the new features coming in 0.8.0 please install from the 0.8.0 branch via:

```{r eval=FALSE}
# sr.ht
devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0")

# GitLab
devtools::install_git("https://gitlab.com/hrbrmstr/sergeant", ref="0.8.0")

# GitHub
devtools::install_git("https://github.com/hrbrmstr/sergeant", ref="0.8.0")
```


## Description
I# Description

Drill + `sergeant` is (IMO) a streamlined alternative to Spark + `sparklyr` if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.



@@ 107,11 93,10 @@ Note that a number of Drill SQL functions have been mapped to R functions (e.g. 

# Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/sergeant")
```

```{r echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
```{r einstall-ex, results='asis', echo = FALSE}
hrbrpkghelpr::install_block()
````
``{r echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
options(width=120)
```