2.1.0
User Documentation for Apache MADlib
crf_feature_gen.sql_in File Reference

SQL function for POS/NER feature extraction. More...

Functions

void crf_train_fgen (text train_segment_tbl, text regex_tbl, text label_tbl, text dictionary_tbl, text train_feature_tbl, text train_featureset_tbl)
 This function extracts POS/NER features from the training data. More...
 
void crf_test_fgen (text test_segment_tbl, text dictionary_tbl, text label_tbl, text regex_tbl, text crf_weights_tbl, text viterbi_mtbl, text viterbi_rtbl)
 This function extracts POS/NER features from the testing data. More...
 

Detailed Description

Date
February 2012
See also
For an introduction to POS/NER feature extraction, see the module description Conditional Random Field

Function Documentation

◆ crf_test_fgen()

void crf_test_fgen ( text  test_segment_tbl,
text  dictionary_tbl,
text  label_tbl,
text  regex_tbl,
text  crf_weights_tbl,
text  viterbi_mtbl,
text  viterbi_rtbl 
)

This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.

  • viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.

startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:

                 0  1  2  3  4  5...44
startFeature -1  a  a  a  a  a  a...a
edgeFeature   0  a  a  a  a  a  a...a
edgeFeature   1  a  a  a  a  a  a...a
...
edgeFeature  44  a  a  a  a  a  a...a
endFeature   45  a  a  a  a  a  a...a
  • viterbi_r table is related to specific tokens. It encodes the single state features, e.g., wordFeature, RegexFeature for all tokens. The r table is represented in the following way.
           0  1  2  3  4...44
    token1 a  a  a  a  a...a
    token2 a  a  a  a  a...a
Parameters
test_segment_tblName of table containing all the tokenized testing sentences.
dictionary_tblName of table containing the dictionary_tbl.
label_tblName of table containing the the label space used in POS or other NLP tasks.
regex_tblName of table containing all the regular expressions to capture regex features.
crf_weights_tblName of the table containing featureset weights.
viterbi_mtblName of table to store the m factors.
viterbi_rtblName of table to store the r factors.

◆ crf_train_fgen()

void crf_train_fgen ( text  train_segment_tbl,
text  regex_tbl,
text  label_tbl,
text  dictionary_tbl,
text  train_feature_tbl,
text  train_featureset_tbl 
)
Parameters
train_segment_tblName of table containing all the tokenized training sentences.
regex_tblName of table containing all the regular expressions to capture regex features.
label_tblName of the label table containing unique ids and label names.
dictionary_tblName of table containing the dictionary_tbl.
train_feature_tblfeatures generated from the traning dataset
train_featureset_tblunique feature set generated from the training dataset