SQL function for POS/NER feature extraction. More...

Functions
void	crf_train_fgen (text train_segment_tbl, text regex_tbl, text label_tbl, text dictionary_tbl, text train_feature_tbl, text train_featureset_tbl)
	This function extracts POS/NER features from the training data. More...

void	crf_test_fgen (text test_segment_tbl, text dictionary_tbl, text label_tbl, text regex_tbl, text crf_weights_tbl, text viterbi_mtbl, text viterbi_rtbl)
	This function extracts POS/NER features from the testing data. More...

Detailed Description

Date: February 2012

See also: For an introduction to POS/NER feature extraction, see the module description Conditional Random Field

Function Documentation

◆ crf_test_fgen()

void crf_test_fgen	(	text	test_segment_tbl,
		text	dictionary_tbl,
		text	label_tbl,
		text	regex_tbl,
		text	crf_weights_tbl,
		text	viterbi_mtbl,
		text	viterbi_rtbl
	)

This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.

viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.

startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:

                 0  1  2  3  4  5...44
startFeature -1  a  a  a  a  a  a...a
edgeFeature   0  a  a  a  a  a  a...a
edgeFeature   1  a  a  a  a  a  a...a
...
edgeFeature  44  a  a  a  a  a  a...a
endFeature   45  a  a  a  a  a  a...a

viterbi_r table is related to specific tokens. It encodes the single state features, e.g., wordFeature, RegexFeature for all tokens. The r table is represented in the following way.
```
       0  1  2  3  4...44
token1 a  a  a  a  a...a
token2 a  a  a  a  a...a
```

Parameters

test_segment_tbl	Name of table containing all the tokenized testing sentences.
dictionary_tbl	Name of table containing the dictionary_tbl.
label_tbl	Name of table containing the the label space used in POS or other NLP tasks.
regex_tbl	Name of table containing all the regular expressions to capture regex features.
crf_weights_tbl	Name of the table containing featureset weights.
viterbi_mtbl	Name of table to store the m factors.
viterbi_rtbl	Name of table to store the r factors.

◆ crf_train_fgen()

void crf_train_fgen	(	text	train_segment_tbl,
		text	regex_tbl,
		text	label_tbl,
		text	dictionary_tbl,
		text	train_feature_tbl,
		text	train_featureset_tbl
	)

Parameters

train_segment_tbl	Name of table containing all the tokenized training sentences.
regex_tbl	Name of table containing all the regular expressions to capture regex features.
label_tbl	Name of the label table containing unique ids and label names.
dictionary_tbl	Name of table containing the dictionary_tbl.
train_feature_tbl	features generated from the traning dataset
train_featureset_tbl	unique feature set generated from the training dataset

Functions

Detailed Description

Function Documentation

◆ crf_test_fgen()

◆ crf_train_fgen()