["hi", "iv", "ve"]
sentences()
함수와 함께 쓰는 경우가 많다.ngram
, N-gram 값 자체estfrequency
, N-gram 의 각 값들이 몇번 나타났는지에 대한 count (빈도)The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides standard SQL functionality, including many of the later 2003 and 2011 features for analytics. Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).
SELECT
ngrams(
sentences("The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides standard SQL functionality, including many of the later 2003 and 2011 features for analytics. Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs)."),
2,
3
);
주어진 문장을 sentences()
함수를 이용해 단어 단위로 분리된 Array 로 변환되어 ngrams()
함수의 첫번째 파라미터로 전달 됩니다.
단어로 구성된 Array 에 대해 두번째 파라미터 값을 N 으로 하여 N-gram 을 계산합니다. (2-Gram)
세번째 파라미터의 값은 전체 결과중에 출력하고자 하는 Top - K 의 개수를 지정합니다. (Top 3)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.88 s
--------------------------------------------------------------------------------
OK
[{"ngram":["user","defined"],"estfrequency":3.0},{"ngram":["and","user"],"estfrequency":1.0},{"ngram":["with","user"],"estfrequency":1.0}]
Time taken: 10.326 seconds, Fetched: 1 row(s)
참고: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF