Solutions

Data Categorization
Free-format request
PDF Extraction
Data Anonymization
Uniqueness Quantification

Coming Soon

Coming Soon

Category Generator

Data Categorization

Description:

Categorization is inherently complex and often lacks clear, definitive answers due to its subjective nature. Our large language model-powered categorizer introduces a groundbreaking approach by pairing efficient categorization with confidence scores. These scores provide insights into where the model is highly certain and where it might struggle with nuance, helping you identify areas that may require further review. This empowers you to approach the categorization process with greater confidence and clarity, driving more reliable analysis overall.

What Use Cases to Consider:

This tool is most effective in scenarios where:

1. You have a clean, well-organized dataset that is free from inconsistencies and ready for categorization.

2. You have a clear understanding of the categories you want to use, ensuring they are distinct, mutually exclusive, and easily interpretable by the model.

3. You possess at least a basic understanding of the data itself. This tool is not intended as a primary research tool—it requires the user to validate the output to ensure it aligns with their expectations and objectives.

Beyond that, there are immense opportunities available with categorization tasks, including binary, multi-class categorization, hierarchical categorization, time-based grouping, and sentiment analysis.

How to Input:

Inputting data is straightforward. First, highlight all the cells of the row or column you want categorized—there is no limit, as the tool can handle hundreds of thousands of rows. The only restriction is the token limit imposed by your OpenAI API key or subscription model. Next, do the same for your categories, with a recommended maximum of 15 to 20 categories.

On the lower end, binary categorization, such as simple 'yes' or 'no', has proven effective for various use cases. Finally, use the third input box for additional context. Think of this as explaining a task to an intern—the more clear and precise the details, the better the results. While it's not necessary to overexplain, adding relevant context can significantly improve accuracy and alignment.

How to Analyze Results:

Once the categorization is complete, review the results by focusing on the confidence scores provided for each entry. Higher confidence scores indicate that the model is more certain about the categorization, while lower scores may warrant a closer inspection.

Confidence scores above 0.98 are generally viewed as very good, while anything below 0.8 should be personally reviewed. Use these scores as a guide to identify areas that might need refinement or additional input. For any categorization with a low confidence score, cross-check the context and input provided to ensure alignment with your expectations.

When discrepancies arise, consider testing with smaller subsets of data rather than reiterating on the entire dataset. We recommend focusing on 20 to 30 rows where the context seems off. This approach saves API token usage, reduces costs, and minimizes energy consumption at data centers, making it a more sustainable and efficient method.

Limitations and Constraints:

  • · Training Data Cutoff: The tool's training data often does not include information beyond a certain point. This limits its ability to handle recent events or changes.

  • · Context Sensitivity: Context is critical. If the task is incorrectly assigned or framed, confidence scores may still be high, despite solving the wrong problem. High confidence doesn’t guarantee correctness, similar to a human confidently solving the wrong task.

  • · Domain-Specific Categories: Nuanced or specialized categories may require more detailed context, and there is potential for biases based on the training data used by the underlying LLM.

  • · Dataset Size: The maximum size of the dataset is constrained by the token limits of your OpenAI API subscription. This may affect the ability to process larger datasets.

Best Practices:

Practical tips for optimizing results, avoiding common pitfalls, and using the tool effectively:

  • Input Context Matters: You don’t need to limit your input to just the item being categorized. Adding context or combining relevant details can enhance the model’s accuracy.

    • Example: When categorizing pasta dishes, combining the dish name and its ingredients adds valuable context. For instance, 'Linguine alle Vongole' can be input as 'Linguine alle Vongole: Linguine, clams, garlic, olive oil, white wine.'

  • MECEs Categories: Ensuring categories are Mutually Exclusive and Collectively Exhaustive is key. Overlapping categories introduce uncertainty for the model, often leading to lower confidence scores. Emphasizing clear distinctions between categories helps improve accuracy and confidence.

  • Test Upfront: Before committing to a full dataset, test a small subset to identify potential issues with context, input structure, or category definitions. This allows you to refine your approach without expending unnecessary resources.

  • Keep It Simple: Avoid overly complex or nuanced category definitions when possible. Simpler, well-defined categories are easier for the model to interpret and lead to higher confidence scores.

Category Generator

Description:

The Category Generator helps you quickly identify relevant categories for your dataset, making it easier to organize and analyze information. It provides a structured starting point while ensuring efficient token use.

What Use Cases to Consider:

This tool is useful when working with large datasets where the optimal way to segment data for analysis is unclear. It provides an initial set of categories to consider, helping you refine your approach.

How to Input:

Simply select the range of data to categorize

Provide the Categorization Context. This is the commonality all categories should have (ex. when generating categories for a data set of law questions, the context would be "Law Topics"

Select the number of categories to create, as you will also receive an output with a lower number and higher number than selected don't worry about being too percise.

How to Analyze Results:

Compare the results against your use case and refine them as needed.

Adjust the number of categories or context input to better align with your dataset’s structure.

Limitations and Constraints:

The The model does not determine the ideal number of categories beyond ensuring no single category dominates. It does not create a predefined distribution.

This tool only generates category labels; it does not assign data to those categories. Use the Data Categorization tool for that step.

Best Practices:

Review and Adjust: Regularly assess the generated categories and make necessary adjustments to ensure they meet your specific requirements.

Uniqueness Quantification

Description:
The uniqueness quantification module is an innovative tool for analyzing text data, designed to highlight distinctive ideas that may be underrepresented or unconventional within a dataset, uncovering hidden patterns and novel insights.

What Use Cases to Consider:

This module has shown success in two main areas: analyzing user input, such as comments or text responses, to identify unique or minority opinions that may otherwise be overlooked, and exploring datasets to understand how large language models differentiate between various items. These use cases help bring valuable perspectives to the surface that traditional methods might miss.

How to Input:

Simply highlight and select a range of text to begin the analysis.

How to Analyze Results:

The module outputs a normalized range from 0 to 1, with 0 being the least unique and 1 being the most unique. These values are relative to the dataset and help quickly identify standout responses for deeper insights.

Limitations and Constraints:

The uniqueness score is a relative measure, applicable only within the specific dataset being analyzed. Comparisons across datasets or even different iterations of the same dataset with slight alterations are not valid, as the scale recalibrates to the composition of each dataset.

Best Practices:

Uniqueness alone may not provide actionable insights. Apply additional filters to refine your analysis:

  • Obscenity and Irregularity Filters: Exclude responses with excessive capitalization, swear words, or erratic behavior, as these tend to skew uniqueness results.

  • Length Filters: Focus on entries with meaningful length to avoid overly short or overly verbose responses, which can distort the significance of the analysis.

Free-format Request

Description:
The free-format request is made available to provide all the utility of the LLMs in their native browser UI but in your Excel document. This tool empowers users by enabling seamless integration of LLM capabilities directly into Excel, allowing for efficient and flexible data manipulation and analysis.

What Use Cases to Consider:

This tool is most effective in scenarios where:

Content Generation - This use case leverages LLMs for tasks like creating summaries, titles, or other forms of text generation based on structured inputs (e.g., "Write a title for this post based on a piece of summary text").

Research - The tool can efficiently build datasets using information broadly available online prior to the selected model’s training date (e.g., "With a list of all US senators, return each’s birthplace in the form of 'City, State'").

Translation - This application allows users to perform translations of selected data directly within Excel, facilitating multilingual applications and analysis.

How to Input:

Inputting data is straightforward. First, highlight and select a range of text within your spreadsheet to define the scope of analysis. Ensure that the selected data aligns with the task requirements. Next, write the prompt with the output requirements directly within the input interface. Adding clarity and detail to your input enhances the accuracy and relevance of the results.

Remember, the prompt is processed row by row, so write it in the singular form for clarity. For example, use "Return the US Senator's birth city" instead of "Return the birth citiesof all US senators."

How to Analyze Results:

Given the near-infinite amount of outputs there is no one way to correctly analyze the results. Some general tips include: Sort by length to identify any outputs where the LLM deviated from the output instructions. In terms of data correction, it generally makes sense to randomly spot check 2-10% of outputs to ensure correctness, depending on the complexity of the request and the amount of information available on the request subject.

Limitations and Constraints:

  • Training Data Cutoff: Models are limited to information based on their training data and may not account for recent events or developments.

  • Context Sensitivity: Clear and accurate input context is critical to achieving reliable outputs. Misaligned instructions can lead to incorrect results, even with high confidence scores.

  • Domain-Specific Applications: Specialized or nuanced tasks may require additional context or custom prompts to ensure accuracy, as biases may arise from the training data.

Best Practices:

  • Select an Appropriate Model: Use the minimum viable model that meets your needs to minimize costs and maximize efficiency.

  • Explicit Prompts: Be clear and detailed in your prompt instructions, specifying the desired format and structure of the output.

  • Test First: Validate your prompt and model choice by testing on 5-10 rows of data before scaling up to the full dataset.

  • Iterative Refinement: Use feedback from initial outputs to refine prompts and improve overall results.

Data Anonymization (Coming Soon)

Description:
This module is designed to anonymize personal data while preserving the structure and analytical integrity of the dataset. By leveraging AI, it ensures that anonymized data remains representative of the original population. This approach allows organizations to gain meaningful insights without compromising individual privacy or confidentiality.

What Use Cases to Consider:

The tool is particularly useful in scenarios where personal data needs to be shared across divisions or organizations for analysis while adhering to privacy regulations. It safeguards sensitive information, enabling secure data sharing without revealing identifiable details.

How to Input:

Select the range of data you want to anonymize (e.g., A1:G20) and press "Select Range."

Choose the level of location data to anonymize. Any data at or below the selected level will be anonymized:

  • Street Address

  • Zip Code

  • City

  • Sub-national Division

  • Country

For example, selecting "City" will transform an address like 4 Ash St, Pittsford, NY 14534, USA into 89 Birch St, Rochester, NY 16325, USA.

Optionally, specify columns to exclude from anonymization. For instance, if the customer ID column is omitted, it will remain untouched.

How to Analyze Results:

Although this module does not directly provide analytical insights, it supports the establishment of secure workflows for data analysis. Your team can use anonymized data to build dashboards or analytical models that can later sync with the original dataset. This approach ensures privacy while enabling effective data-driven decisions. Additionally, general trends such as customer distribution across metro areas will remain accurate even if specific location details are altered.

Limitations and Constraints:

The module can only anonymize data types included in its database. If your dataset contains unsupported elements, anonymization may not be possible. If you encounter this issue, please contact us at contact@noiric.com to suggest product enhancements.

To process the data, Noiric sends the information to OpenAI. Please review your organization’s data policy and ensure it complies with this practice before starting. More details can be found in our FAQ under security settings.

Best Practices:

It is advisable to anonymize data beyond the required level to minimize risks. If necessary, specific details can be reintroduced later. This proactive approach ensures sensitive information is not unintentionally exposed.

PDF Extraction (Coming Soon)

Description:

The PDF Extraction tool allows you to extract both document-level and line-item data from large sets of PDF files directly into Excel. With this tool, you can select and customize the fields you want to extract without worrying about the structure of the underlying PDF. Each extracted field is accompanied by Noiric’s proprietary confidence score, ensuring data reliability.

What Use Cases to Consider:

This tool is ideal for extracting data from PDFs in situations such as processing invoices, contracts, reports, or any other document where structured data needs to be pulled into Excel for further analysis.

How to Input:

To get started, ensure all the PDFs you want to process are stored in one folder. In the Noiric add-in, select this folder to extract data. Next, enter the desired fields for extraction:

Document-level fields (e.g., Invoice Number, Rental Agent): Extracted once per document.

Line-item fields (e.g., Product Purchased, Quantity): Extracted for repeated items within documents.

You can also choose whether to include confidence scores. By default, these scores are enabled, but for clean and structured documents, you may disable them to simplify the output table.

Additionally, during the extraction process, you can categorize PDFs (e.g., tagging invoices as “Hardware” or “Software” purchases) using the tool’s categorization options for better organization and analysis.

How to Analyze Results:

Each field includes a confidence score. For fields with scores below 0.9, it’s recommended to cross-check the extracted data against the original PDF to ensure accuracy. Low confidence scores are often due to issues such as poor document scans or typos in the original file.

Limitations and Constraints:

  • The tool cannot extract data from illegible handwriting or severely degraded documents.

  • Individual PDFs over 15 pages are not supported.

  • The tool is currently limited to PDFs only, though additional file types may be supported in the future.

Best Practices:

Test your extraction fields with a small subset of data to ensure they align with the information in the PDFs. Keep in mind that the tool relies on the provided prompts and cannot infer or create information not present in the document. Additionally, consider combining this tool with other Noiric tools to further organize, analyze, or refine your data after extraction.