Exercise: Module 6, Lesson 3 - The Unstructured Data Audit
Objective: To develop a practical ability to identify sources of high-value, unstructured proprietary data within a typical organization and to anticipate the challenges of preparing that data for use in AI applications.
Your Task
This exercise is a thought experiment in data strategy. You will act as a data strategist tasked with finding the hidden "gold" inside a specific business unit.
-
Choose a Department: Select one of the following standard business departments:
- Sales Department
- Customer Support Department
- Human Resources (HR) Department
- Product Development / R&D Department
-
Audit Unstructured Data Sources: For your chosen department, brainstorm and list at least five distinct types of unstructured data that this department likely creates and stores. Think beyond simple spreadsheets; focus on the messy, human-generated content.
-
Identify Potential Quality Issues: For each data source you listed, describe one potential data quality issue that might need to be addressed before it could be used to train a reliable AI model. (e.g., inconsistent formatting, slang or jargon, missing context, etc.).
-
Propose an AI Application: Describe one specific, high-value AI application that could be built by combining and analyzing the unstructured data sources you identified. What business problem would it solve?
Deliverable
Write a short analysis in a Markdown file. Structure your analysis with the following headings:
- Department Audited: [Your Chosen Department]
- Unstructured Data Audit:
- Data Source 1: [Name of data source]
- Potential Quality Issue: [Description of issue]
- Data Source 2: [Name of data source]
- Potential Quality Issue: [Description of issue]
- ...and so on for all five sources.
- Data Source 1: [Name of data source]
- Proposed AI Application: [Your description of the high-value AI application]
Example Submission Snippet:
Department Audited: Customer Support Department
Unstructured Data Audit:
- Data Source 1: Customer Support Call Transcripts
- Potential Quality Issue: Automated transcriptions often contain errors in spelling, punctuation, and speaker identification, which would need to be cleaned.
- Data Source 2: Customer Support Email Archives
- Potential Quality Issue: Emails contain a lot of boilerplate (signatures, legal disclaimers) that is not relevant to the customer's problem and would need to be stripped out.
- Data Source 3: Internal Wiki/Knowledge Base for Support Agents
- Potential Quality Issue: Many articles may be outdated or contain conflicting information that needs to be versioned and reconciled.
- Data Source 4: Chat Logs from the Website's Help Widget
- Potential Quality Issue: Chat logs are filled with informal language, slang, and typos that a model might misinterpret.
- Data Source 5: Post-Interaction Customer Satisfaction (CSAT) Survey Verbatim Comments
- Potential Quality Issue: Comments are often short, lack context, and can be highly sarcastic, making sentiment analysis difficult without the context of the rest of the interaction.
Proposed AI Application:
"Proactive Support Agent." By combining all five data sources, we could build an AI agent for our support team. When a new customer email or chat comes in, this agent would instantly analyze the customer's question, search the entire history of past interactions (calls, emails, chats) from that customer, and find the most relevant articles from the internal knowledge base. It would then present a summary of the customer's history and the three most likely solutions to the human agent. This would dramatically reduce the time it takes for an agent to resolve an issue, improving both customer satisfaction and operational efficiency.