AI-Assisted Quantitative Analysis
Capabilities, Limitations, and Responsible Practice
Level‑Setting: Key Terms
- Artificial Intelligence (AI): any system that helps make decisions or complete tasks using explicit logic or learned patterns.
- That can include simple rules, checklists, and even flowcharts.
- Machine learning (ML): a subset of AI where the system learns patterns from data rather than being fully hand‑coded.
- Generative AI (GenAI): ML systems that generate new outputs (text, code, images, etc.).
- Large language models (LLMs): GenAI models trained on large text corpora; “chat” is a product/interface on top of an LLM.
Today: we’ll mostly discuss GenAI/LLMs because they’re the current frontier. AI spans well-established as well as emerging capabilities.
What AI Actually Adds
AI is a tool for augmentation, not replacement
- Your expertise in research design, domain knowledge, and critical interpretation remains essential.
- AI accelerates specific tasks; it does not substitute for thinking.
Concerns around AI are legitimate
- Accuracy, reproducibility, and methodological rigor are valid worries.
- Verification and validation are key to responsible use.
How we’ll approach today’s session
- We’ll demonstrate capabilities, but not promise that AI is a silver bullet.
- Leave with practical frameworks for deciding when and how to use these tools.
Examples of AI-Assisted Analysis
AI Coding Assistant Archetypes
Sheets/Excel with =AI() Functions
AI functions embedded directly in spreadsheet cells: natural language prompts that return structured outputs alongside your data.
=AI("Categorize this survey response as
positive, negative, or neutral.
Return one of: POS, NEU, NEG.", A2)
What this enables
- Text classification at scale
- Entity extraction from unstructured fields
- Quick summaries and normalization
- Lightweight EDA without coding
- Non‑deterministic: the same input may yield different outputs later
- Data exposure: cell contents may be sent to third‑party services
- Auditability: transformations can be hard to reconstruct
- Cost at scale: per‑cell calls add up quickly
Workflow Scaffolding
AI helps you construct an analysis pipeline while you keep control over methodology.
- Describe your goal in plain language
- AI generates initial code structure
- Review and modify critical sections
- Iterate and refine with targeted prompts
- Boilerplate (imports, loading, plotting)
- Inspecting dataset and writing appropriate loading code
- Debugging error messages
- Quick refactors
- Method appropriateness and assumptions
- Package version changes
- e.g. using deprecated functions or passing incorrect arguments
- Interpretation and causal language
- e.g. not being able to reason about the results
Data Cleaning
Visual Studio Code’s Data Wrangler extension provides an intuitive, spreadsheet-like interface for exploring, cleaning, and transforming data directly within VS Code. It allows you to:
- Explore: Open
.csvand.tsvfiles, visualize distributions, and filter/sort values easily. - Clean: Perform common preprocessing tasks (remove duplicates, fill missing values, type conversion) with a click.
- Transform: Generate code for all actions (as clean, editable Python with
pandas), ensuring changes are fully reproducible. - Undo/Redo: Step through data cleaning operations, refining each step without risk.
Copilot Integration
With the GitHub Copilot plugin enabled in VS Code, you can:
- Generate code for custom operations: Ask Copilot for complex cleaning or transformation scripts tailored to your dataset (e.g., “Impute missing values using median for column ‘income’”).
- Explain transformations: Prompt Copilot to clarify what a piece of code does or to comment each operation as you go.
- Automate documentation: Use Copilot to summarize your cleaning workflow or draft markdown for your analysis notebook, increasing transparency and reproducibility.
Paths to Failure
- Statistical hallucination
- Impossible statistics (percentages > 100%, means outside range).
- LLMs report what sounds right, not necessarily what is true.
- Use LLMs to generate code, not to give answers directly.
- Incorrect assumptions
- Applies methods without checking prerequisites.
- Agent-based tools have a tendency to make assumptions without checking them.
- Ask LLMs for its assumptions before taking action.
- Poor reasoning
- LLMs struggle with reasoning, but can be dishonest about their own limitations.
- e.g. “A linear regression model is the best fit for this data” when it is not.
- LLMs can be helpful for working with data, but they are not a substitute for critical thinking.
- LLMs struggle with reasoning, but can be dishonest about their own limitations.
Chat-based AI can incorporate external knowledge beyond your dataset. This can introduce errors or unintended data mixing.
A Verification Framework
- Randomly select a sample of rows and manually verify AI outputs.
- Ask the AI to state its assumptions before generating code or analysis.
- Confirm means, ranges, and totals are plausible; watch for impossible values.
- Ask the AI to critique its own output or argue the opposite case.
- Save conversation logs and exact prompts used
- Export generated code separately (don’t rely on regeneration)
- Document which outputs were AI-assisted
- Version control scripts and notebooks
- Re-run analysis on held-out data or fresh subsets
The same prompt can yield different outputs tomorrow!
- Save the actual output you used, not just the prompt
- Where possible, set temperature to 0 (reduces but doesn’t eliminate variation)
- temperature is a parameter that controls the randomness of the output
- Final analysis should run from your verified, exported code — not live AI calls
UofT Guidance
- You own decisions and outcomes (AI is assistive, not a decider).
- Use the minimum University data required for the task.
- Don’t provide University data to third‑party AI vendors without appropriate authorization
- Cross‑check facts, calculations, and recommendations; watch for bias.
- DO use AI for drafts, summaries, and routine text (when appropriate).
- DO get consent if using AI to record/summarize meetings.
- DON’T use AI for high‑stakes decision‑making
- DON’T treat AI output as a source of truth.
- Does REB approval cover AI use and where data flows?
- Remember data residency / third‑party restrictions still apply.
- Document what tool, what prompts, what was AI‑assisted, and what you verified.
- Prefer synthetic/anonymized data for code development; real data only in approved environments.
In some contexts, disclosures are legally required (e.g., Ontario job posting disclosure rules effective January 1, 2026 when AI is used to screen/assess/select applicants).
Citing AI Use
- Any content produced by generative AI
- Functional use of AI tools (editing, translating, code generation, etc.)
- When in doubt, disclose
- APA 7: How to cite ChatGPT — treats AI as software/algorithm
- MLA: How do I cite generative AI?
- Chicago: Q&A on AI-generated content
Rules are evolving. When no official guidance exists, treat AI as software output.
- UofT’s Academic Integrity policy applies
- Graduate students: see SGS guidance on AI in theses
“We used Claude (Anthropic, 2025) to generate initial Python code for data cleaning, which we reviewed, modified, and validated against manual spot-checks. Prompts and final code are available in the supplementary materials.”
UofT Approved Tools
- Microsoft Copilot when signed in through UofT is recommended and described as meeting UofT privacy/security standards for up to Level 3 data.
- ChatGPT Edu provides enhanced protections; free ChatGPT is not protected mode.
- Claude for Education in pilot availability.
Some database‑embedded assistants can provide cited answers within their corpus
- Assume prompts and data may be retained and/or used beyond your intent.
- Don’t paste University data unless the tool is explicitly approved for that data level.
- As standards evolve, keep an eye on ai.utoronto.ca.
- UofT Libraries offers AI research guides for discipline-specific resources.
UofT Data Levels Refresher
- Published research, public websites, course catalogues
- Generally OK in most tools; still validate outputs
- e.g. ChatGPT free, Claude personal, etc.
- Non‑public, de‑identified data, most unpublished research, most course materials
- Prefer UofT‑approved/protected tools; avoid consumer accounts
- e.g. ChatGPT Edu, Microsoft Copilot, etc.
- FIPPA personal info, contracts, security logs, detailed facilities plans
- Only in tools explicitly approved for L3 (and in protected mode); otherwise don’t
- e.g. Microsoft Copilot (protected), Claude for Education
- PHI/PHIPA, SIN/passport, banking/PCI, credentials, high‑risk investigative files
- Never paste; use approved secure environments only
- No tools are approved for Level 4.
UofT Approved Tools: Quick Profiles (1/3)
Enterprise-grade chatbot with web search capabilities and context-aware responses.
Key features:
- web-connected answers with footnotes
- file upload
- chat data not used for model training
- does not access your Outlook/Teams/SharePoint content
Who: faculty, staff, and students with UofT M365 access.
How: sign in with your UofT M365 account and confirm the enterprise shield (protected mode).
- Virtual assistant that generates contextual content based on documents in OneDrive and SharePoint.
- Key point: it can access confidential content in your M365 environment.
- Requires a confidentiality agreement and consent when collaborating.
- Requires a confidentiality agreement and consent when collaborating.
- Who: UofT staff (department-funded).
UofT Approved Tools: Quick Profiles (2/3)
- AI meeting assistant for Microsoft Teams meetings.
- Key point: requires consent from participants; don’t share AI outputs beyond attendees unless everyone authorizes.
- Who: UofT staff (department-funded).
Note: if you already have M365 Copilot, you may not need Teams Premium (overlapping functionality).
- Privacy-enhanced version of OpenAI’s ChatGPT.
- Key point:
- content/files uploaded are not used to train OpenAI models
- sharing GPTs outside UofT workspace is restricted
- third-party GPTs restricted
- Who: UofT departments, faculty and staff only (not students).
UofT Approved Tools: Quick Profiles (3/3)
- AI assistant for writing, research, learning and operations support.
- Not included: Claude Code; Claude API access.
- Who: faculty, staff, and sponsored students on approved pilot teams.
- How: request via Service Catalogue / IT Service Centre; pilot (500 licenses) runs Sep 1, 2025 to Jun 2026 (rolling allocation).
- Data: approved for Level 3 by ITS Information Security.
Safe Starting Points
- Code syntax and debugging
- Data visualization
- Documentation and formatting
- Method selection
- Interpretation
- Real participant data
Use AI where you can verify the output. Be skeptical where verification is hard.