SQL function for POS/NER feature extraction.
More...
|
void | crf_train_fgen (text train_segment_tbl, text regex_tbl, text label_tbl, text dictionary_tbl, text train_feature_tbl, text train_featureset_tbl) |
| This function extracts POS/NER features from the training data. More...
|
|
void | crf_test_fgen (text test_segment_tbl, text dictionary_tbl, text label_tbl, text regex_tbl, text crf_weights_tbl, text viterbi_mtbl, text viterbi_rtbl) |
| This function extracts POS/NER features from the testing data. More...
|
|
- Date
- February 2012
- See also
- For an introduction to POS/NER feature extraction, see the module description Conditional Random Field
◆ crf_test_fgen()
void crf_test_fgen |
( |
text |
test_segment_tbl, |
|
|
text |
dictionary_tbl, |
|
|
text |
label_tbl, |
|
|
text |
regex_tbl, |
|
|
text |
crf_weights_tbl, |
|
|
text |
viterbi_mtbl, |
|
|
text |
viterbi_rtbl |
|
) |
| |
This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.
- viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.
startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:
0 1 2 3 4 5...44
startFeature -1 a a a a a a...a
edgeFeature 0 a a a a a a...a
edgeFeature 1 a a a a a a...a
...
edgeFeature 44 a a a a a a...a
endFeature 45 a a a a a a...a
- Parameters
-
test_segment_tbl | Name of table containing all the tokenized testing sentences. |
dictionary_tbl | Name of table containing the dictionary_tbl. |
label_tbl | Name of table containing the the label space used in POS or other NLP tasks. |
regex_tbl | Name of table containing all the regular expressions to capture regex features. |
crf_weights_tbl | Name of the table containing featureset weights. |
viterbi_mtbl | Name of table to store the m factors. |
viterbi_rtbl | Name of table to store the r factors. |
◆ crf_train_fgen()
void crf_train_fgen |
( |
text |
train_segment_tbl, |
|
|
text |
regex_tbl, |
|
|
text |
label_tbl, |
|
|
text |
dictionary_tbl, |
|
|
text |
train_feature_tbl, |
|
|
text |
train_featureset_tbl |
|
) |
| |
- Parameters
-
train_segment_tbl | Name of table containing all the tokenized training sentences. |
regex_tbl | Name of table containing all the regular expressions to capture regex features. |
label_tbl | Name of the label table containing unique ids and label names. |
dictionary_tbl | Name of table containing the dictionary_tbl. |
train_feature_tbl | features generated from the traning dataset |
train_featureset_tbl | unique feature set generated from the training dataset |