Rasa原文--NLU Training Data
作者:互联网
目录
Regular Expressions for Intent Classification
Regular Expressions for Entity Extraction#
Regular Expressions as Features#
Regular Expressions for Rule-based Entity Extraction
Entity Roles and Groups influencing dialogue predictions#
NLU Training Data
NLU(自然语言理解)的目标是从用户消息中提取结构化信息。这通常包括用户的意图和他们的消息包含的任何实体。您可以添加额外的信息,如正则表达式和查找表到您的训练数据,以帮助模型正确识别意图和实体。
Training Examples
NLU训练数据由按意图分类的示例用户话语组成。为了让你的意图更容易使用,给它们命名与用户想要实现的意图相关的名称,使用小写字母,避免使用空格和特殊字符。
Entities
实体是用户消息中的结构化信息片段。要使实体提取工作,您需要指定训练数据来训练ML模型,或者需要定义正则表达式来使用基于字符模式的RegexEntityExtractor来提取实体。
在决定需要提取哪些实体时,请考虑助手需要哪些信息来实现其用户目标。用户可能会提供你不需要的附加信息;您不需要将它们提取为实体。
有关如何在训练数据中标注实体的详细信息,请参阅训练数据格式。
Synonyms
Synonyms map extracted entities to a value other than the literal text extracted. You can use synonyms when there are multiple ways users refer to the same thing. Think of the end goal of extracting an entity, and figure out from there which values should be considered equivalent.
Let's say you had an entity account
that you use to look up the user's balance. One of the possible account types is "credit". Your users also refer to their "credit" account as "credit account" and "credit card account".
In this case, you could define "credit card account" and "credit account" as synonyms to "credit":
nlu:
- synonym: credit
examples: |
- credit card account
- credit account
Then, if either of these phrases is extracted as an entity, it will be mapped to the value credit
.
See the training data format for details on how to include synonyms in your training data.
Regular Expressions
You can use regular expressions to improve intent classification and entity extraction in combination with the RegexFeaturizer
and RegexEntityExtractor
components in the pipeline.
Regular Expressions for Intent Classification
You can use regular expressions to improve intent classification by including the RegexFeaturizer
component in your pipeline. When using the RegexFeaturizer
, a regex does not act as a rule for classifying an intent. It only provides a feature that the intent classifier will use to learn patterns for intent classification. Currently, all intent classifiers make use of available regex features.
The name of a regex in this case is a human readable description. It can help you remember what a regex is used for, and it is the title of the corresponding pattern feature. It does not have to match any intent or entity name. A regex for a "help" request might look like this:
nlu:
- regex: help
examples: |
- \bhelp\b
The intent being matched could be greet
,help_me
, assistance
or anything else.
Try to create your regular expressions in a way that they match as few words as possible. E.g. using \bhelp\b
instead of help.*
, as the later one might match the whole message whereas the first one only matches a single word.
Regular Expressions for Entity Extraction#
If your entity has a deterministic structure, you can use regular expressions in one of two ways:
Regular Expressions as Features#
You can use regular expressions to create features for the RegexFeaturizer
component in you NLU pipeline.
When using a regular expression with the RegexFeaturizer
, the name of the regular expression does not matter. When using the RegexFeaturizer
, a regular expression provides a feature that helps the model learn an association between intents/entities and inputs that fit the regular expression.
Regex features for entity extraction are currently only supported by the CRFEntityExtractor
and DIETClassifier
components. Other entity extractors, like MitieEntityExtractor
or SpacyEntityExtractor
, won't use the generated features and their presence will not improve entity recognition for these extractors.
Regular Expressions for Rule-based Entity Extraction
You can use regular expressions for rule-based entity extraction using the RegexEntityExtractor
component in your NLU pipeline.
When using the RegexEntityExtractor
, the name of the regular expression should match the name of the entity you want to extract. For example, you could extract account numbers of 10-12 digits by including this regular expression and at least two annotated examples in your training data:
nlu:
- regex: account_number
examples: |
- \d{10,12}
- intent: inform
examples: |
- my account number is [1234567891](account_number)
- This is my account number [1234567891](account_number)
Whenever a user message contains a sequence of 10-12 digits, it will be extracted as an account_number
entity. RegexEntityExtractor
doesn't require training examples to learn to extract the entity, but you do need at least two annotated examples of the entity so that the NLU model can register it as an entity at training time.
Lookup Tables#
Lookup tables are lists of words used to generate case-insensitive regular expression patterns. They can be used in the same ways as regular expressions are used, in combination with the RegexFeaturizer
and RegexEntityExtractor
components in the pipeline.
You can use lookup tables to help extract entities which have a known set of possible values. Keep your lookup tables as specific as possible. For example, to extract country names, you could add a lookup table of all countries in the world:
nlu:
- lookup: country
examples: |
- Afghanistan
- Albania
- ...
- Zambia
- Zimbabwe
When using lookup tables with RegexFeaturizer
, provide enough examples for the intent or entity you want to match so that the model can learn to use the generated regular expression as a feature. When using lookup tables with RegexEntityExtractor
, provide at least two annotated examples of the entity so that the NLU model can register it as an entity at training time.
Entities Roles and Groups
Annotating words as custom entities allows you to define certain concepts in your training data. For example, you can identify cities by annotating them:
I want to fly from [Berlin]{"entity": "city"} to [San Francisco]{"entity": "city"} .
However, sometimes you want to add more details to your entities.
For example, to build an assistant that should book a flight, the assistant needs to know which of the two cities in the example above is the departure city and which is the destination city. Berlin
and San Francisco
are both cities, but they play different roles in the message. To distinguish between the different roles, you can assign a role label in addition to the entity label.
- I want to fly from [Berlin]{"entity": "city", "role": "departure"} to [San Francisco]{"entity": "city", "role": "destination"}.
You can also group different entities by specifying a group label next to the entity label. The group label can, for example, be used to define different orders. In the following example, the group label specifies which toppings go with which pizza and what size each pizza should be.
Give me a [small]{"entity": "size", "group": "1"} pizza with [mushrooms]{"entity": "topping", "group": "1"} and
a [large]{"entity": "size", "group": "2"} [pepperoni]{"entity": "topping", "group": "2"}
See the Training Data Format for details on how to define entities with roles and groups in your training data.
The entity object returned by the extractor will include the detected role/group label.
{
"text": "Book a flight from Berlin to SF",
"intent": "book_flight",
"entities": [
{
"start": 19,
"end": 25,
"value": "Berlin",
"entity": "city",
"role": "departure",
"extractor": "DIETClassifier",
},
{
"start": 29,
"end": 31,
"value": "San Francisco",
"entity": "city",
"role": "destination",
"extractor": "DIETClassifier",
}
]
}
In order to properly train your model with entities that have roles and groups, make sure to include enough training examples for every combination of entity and role or group label. To enable the model to generalize, make sure to have some variation in your training examples. For example, you should include examples like fly TO y FROM x
, not only fly FROM x TO y
.
To fill slots from entities with a specific role/group, you need to either define a custom slot mappings using forms or use custom actions to extract the corresponding entity directly from the tracker.
Entity Roles and Groups influencing dialogue predictions#
If you want to influence the dialogue predictions by roles or groups, you need to modify your stories to contain the desired role or group label. You also need to list the corresponding roles and groups of an entity in yourdomain file.
Let's assume you want to output a different sentence depending on what the user's location is. E.g. if the user just arrived from London, you might want to ask how the trip to London was. But if the user is on the way to Madrid, you might want to wish the user a good stay. You can achieve this with the following two stories:
stories:
- story: The user just arrived from another city.
steps:
- intent: greet
- action: utter_greet
- intent: inform_location
entities:
- city: London
role: from
- action: utter_ask_about_trip
- story: The user is going to another city.
steps:
- intent: greet
- action: utter_greet
- intent: inform_location
entities:
- city: Madrid
role: to
- action: utter_wish_pleasant_stay
BILOU Entity Tagging#
The DIETClassifier and CRFEntityExtractor have the option BILOU_flag
, which refers to a tagging schema that can be used by the machine learning model when processing entities. BILOU
is short for Beginning, Inside, Last, Outside, and Unit-length.
For example, the training example
[Alex]{"entity": "person"} is going with [Marty A. Rick]{"entity": "person"} to [Los Angeles]{"entity": "location"}.
is first split into a list of tokens. Then the machine learning model applies the tagging schema as shown below depending on the value of the option BILOU_flag
:
token | BILOU_flag = true | BILOU_flag = false |
---|---|---|
alex | U-person | person |
is | O | O |
going | O | O |
with | O | O |
marty | B-person | person |
a | I-person | person |
rick | L-person | person |
to | O | O |
los | B-location | location |
angeles | L-location | location |
The BILOU tagging schema is richer compared to the normal tagging schema. It may help to improve the performance of the machine learning model when predicting entities.
标签:Training,NLU,Rasa,use,account,entity,entities,regular,intent 来源: https://blog.csdn.net/Da___Vinci/article/details/112987255