Question Answering — is it different in conversations?

Published in

LMAO Learning Group

9 min readJan 10, 2023

There is more to consider apart from the question asked — intelligence is to perform those possibilities in finite space and finite time.

We are in the phase of conversational AI, where we augment the training phrases using Large Language models, and we finetune the synthetic samples using generative or classification models.

There is also a case of remodeling the problem by fitting it into the existing set of a framework of tasks in NLP. Most closely, would be text classification, question answering, and intent detection.

Alignment — Right now we perform question-question alignment, there could be a case of question-to-action alignment that could pave the way to adding more information in the common-knowledge repository to resolve some cases of missing text phenomenon — I have no evidence of it, but I could think of it as probably a good scope.

Hence in Conversational AI, questions to the FAQ chatbot are different from handling questions in context. We don't want to treat each utterance as a separate trigger for FAQ every time. Consider the below dialog

User: I would like to apply for a loan?
Bot: Sure, here is the answer for an FAQ that's triggered (LoanApplication)
     intent triggered.
User: Sounds good, How about for a car?
Bot: No, I can't understand it.

Let’s explore the purpose of questions in context rewriting, the models, and the datasets around it. The generation models would be of great use to consider the context into account since the prompting style is more on preserving the hierarchy of events.

Questions could be asked in single, multi-hop, or multi-turn, or they could involve complex reasoning combining compositional semantics.

I came across a representation called a logical form of a sentence — an intermediate representation between the surface form of a sentence and the intended meaning of a sentence.

Some of the generic approaches that are well known in the landscape of conversational AI in the manipulation of the questions asked to the bot. The goal of uncovering the “hidden slate” information or “common knowledge” is much more than simple question reforming.

📝 Query Rewriting repurposed from a recommendation or semantic search:

Rewrites query by searching through a database of query logs to find semantically similar queries as recommendations of making users choose within the boundary of similar semantic scope. The query-Question page is ranked based on the click rate of customers determining the confidence.

The gap in good quality data criteria could always have a space for reinforcement learning with human feedback.

📝 Paraphrase Generation:

Either fine-tuned models based on the domain-specific paraphrase question pairs or models like T5 are used. Very prevalent data augmentation technique used for NLU.

📝 Text normalization:

This is used for converting non-canonical representation to standard writing which is sometimes reframed as a text simplification process. I am thinking if I can capture helpfulness, clarity, and utility for conversational AI response.

📝 Back Translation:

There are a few rounds of back translation that might change the verbatim of the sentence and we are training the model with utterances that might not be asked by US-based customers at all. I have seen good performances on Russian, German, and French round-trip translations.

Let’s not forget the dude which models have to understand — Adversarial examples. This is heavily needed for the reliability of the product and the model's robustness.

📝 Repurposing the existing customer conversations as training data:

There are also cases where we repurpose the chat conversations that go into the unrecognized or fallback bucket which can be recycled via semantic closeness techniques to add training data to the model.

🦹‍♂️ Linguistic Phenomenon tied around it:

Prof Dr. Walid Saba mentions — The objective of understanding we are looking for is not in the data. People are not explicit in their statements or uttering to a chatbot. The human-Human conversation revolves around the common ground agreement that we have in order to continue the conversation. We do question rewriting in our mind in some way to remind the other user to come into the space of agreement.

Coreference: This is the phenomenon where we have to identify and link all the mentions of a particular entity in the context.

The balance of my savings account is 300 dollars and my checking account is 100 dollars, could you help transfer from the former? Here the former refers to a savings account.

2. Ellipsis resolution: The customer tends to keep the chat and conversation simple. Consider a case, If a debit card is lost, I don't want to explicitly give all the information at a stretch. Ellipsis is to decipher those missing words or phrases.

Consider an example of ellipsis resolution in the context of a banking domain conversational AI:

I’d would like to check my balance….what’s my account balance?

Expected ellipsis resolution to happen is — Which account would you like to check the balance for? We have several accounts linked to your name, including a checking account and a savings account.

🪄 Its a prompt for asking the clarification of resolving to right intended meaning.

3. Nominal compounds: The phenomenon where an entity can be addressed by more than one noun. It comprises two or more nouns that function as a single unit.

I’d like to open a new checking account

Expected resolution: Certainly, we offer a range of checking account options for our customers. Would you like to open a standard checking account, a high-yield checking account, or a student checking account?

There could be several nominal compounds in the banking domain relating to accounts, loans, financial products, and services.

Both these kinds revolve around reading comprehension, where in order to resolve the questions asked, we need to find the appropriate text span in the previous discourse. The above two problems are generally modeled as a question-answering task.

It comes down the lane to differentiating language that’s actually learned vs learning the language from data.

There are some interpretations of representing knowledge base in knowledge graphs for one hop or few hops of questions that could be retrieved by passing the path of Knowledge graphs.

Recent papers on this ordered from latest:

Improving Complex knowledge Base Question Answering via Question-to-Action and Question-to-Question alignment.

The alignment that’s discussed above in discourse may not be quite obvious to perform question answering directly comparing the semantic similarity with the knowledge base. There might need extra reasoning to answer some complex questions. The framework the authors developed came up with the three-step procedure of question rewriting, question-to-action, and question-question alignment. Most explicit implementations I have seen revolves around question-question alignment with large language models using embeddings.

Rather than using the similar questions in our curated dataset as similarity context, they used a reward function to select the correct action sequence.

2. Asking better questions — A large-scale multi-domain dataset for rewriting ill-formed questions or contextless vague questions which in isolation leads to a fallback.

How can we say a question is well-formed and sufficient to understand the intent of the customer? Apart from generally known practices of augmenting the question as a synthetic dataset or paraphrasing, rewriting the question is new for me. Sometimes we ought to get “word salad” from the list of conversations the customer has with the bot. Crucial step: rewriting without changing the semantics.

This is more of an issue in speech recognition engines that needs a transcription service and then move to the text understanding model. We are having a supervised well formed dataset to be able to generalize on not-so-structured questions.

🔖 Datasets:

CANARD — A dataset for question-in-context rewriting. That couples dialog context together with a context-independent rewriting of the question. A dataset to evaluate question rewriting models.

2. QuAC — A conversational reading comprehension dataset where the answers are selected span from a given section of a passage. If the question could not be answered from the passage, the answer expected is “Can’t answer based on the knowledge base”

3. MQR (Multi-domain Question rewriting dataset)

This is based on human-contributed stack exchange question edit histories that have columns of ill-formed questions to well-formed questions. This would help to train question rewriting models with better semantic preservation.

Well-formedness could come at the cost of semantic drift, where only based on semantic similarity we can’t determine the coherence of intent or question vs question-rewritten. Generic approaches for these include paraphrasing, Grammatical error correction, back translation etc.

Some generic constraints on question-rewriting don't solve NLU but might help in improving text classification.

The questions should be grammatically correct without semantic drift.
The questions should not contain spelling errors or any kind of random words in out-of-domain
The questions should be explicit. Which is reforming a search query or a command or an instruction to a question. (Questions that start with “how”, “why”, “when”, ”what”, “which”, “who”, “whose”, “do”, “where”, “does”, “is”, “are”, “must”, “may”, “need”, “did”, “was”, “were”, “can”, “has”, “have”.

🚀 My perspective on understanding the Natural Language Framework:

Conversational AI is not to be satisfied as a stand-alone task, there are multiple pipelines to be taken into consideration with less latency. Different steps of context-sensitive and context-insensitive pipelines can be combined.

There are different ways to put this learning framework, one in the form of Knowledge learning and the other in the form of knowledge acquisition. The latter part instills the subconscious mind to respond to questions or actions. The former part is still in incremental possession of mapping the questions to responses or actions.

Could there be a framework where the learnability of chatbots is generalized and just fine-tuned

Learning in conversational AI:

I can learn from curated datasets of the knowledge base of the chatbot, which in this case ground truth might be wrong. I can put a training phrase in either of the buckets of classes I think that another person thinks may fall in another class.
Mapping the existing collected conversations to new conversations coming in the chatbot.
Learning from its own context of the history of messages to take action on subsequent messages.
Learning from human feedback, where the feedback from humans changes the learnability or dataset of the model.
Learning from deduction, deducing them using proven techniques and also manipulated by existing instructions.
Learning from a manual of factual knowledge base that’s factually on mutual agreement with different ethics and principles.

Is there a learning framework that satisfies all these different perspectives? How can a margin of error be approximated for all these methods? Not every knowledge is associated with confidence. We are now in a framework of understanding systems where each evaluation is predicted based on the confidence threshold.

There is a subtle AI winter in the landscape of conversational AI bundling the process of building chatbots by fitting into one specific framework of the deep learning tasks or language tasks like (classification, semantic search, etc), and building an expectation of true language understanding from the stakeholders.

I really admire the thoughts of Professor Dr.Walid Saba on companies adopting rich multidisciplinary aspects to building a conversational AI or an intelligent system. Another perspective is How simple an algorithm of K-means would help to cluster the data.

To follow, I will be going through each section further down in this task. Keep learning!

🧠 Models:

Sequence to Sequence Models Fine-tuning

2. Text generation Enterprise Models: (OpenAI, Cohere)

📝 Benchmarks

BLUE scores of test data and validation data.

Plausible misleads on question-writing:

It might introduce new information that may not be factually based on the knowledge base.
There is also an issue of “prompt injection”, where the change shouldn’t produce unexpected results or leaked information from closed-form content of the banking domain for example.
There should be a measure of Cohens-kappa that validates a ground truth of conversational AI dataset across multiple domain experts.

One more part to follow on the implementation of all the above concepts using Large Language Models.

Language actors take the stage with LLM creating a pipeline for us to solve NLU tasks.