Using Python to Detect Proper Nouns in a List of Words.

I was working on keyword research for a client and was working to categorize all the keywords. Google Gemini helped me create scripts to go through a list of 2500 keywords and flag if it is a person’s name. Here’s a summary of the process we went through to get your Name Detector script working, along with explanations to help with future projects:

Problem:

I wanted to use Python to detect names in a list of words and update a CSV file accordingly.

Steps:

Environment Setup:
- Virtual Environment (Key Recommendation): We created a virtual environment named ‘spacy_env’ to isolate your project dependencies and prevent conflicts with your system’s Python or other packages. This is done using:
```
python3 -m venv spacy_env
source spacy_env/bin/activate
```
- Homebrew Considerations: We navigated Homebrew’s preferences for protecting its own Python environment.
Installing Dependencies:
- spaCy: We installed the core spaCy library for natural language processing:
```
python3 -m pip install spacy
```
- Language Model: We downloaded a suitable language model (en_core_web_sm) for English named entity recognition:
```
python3 -m spacy download en_core_web_sm
```
- pandas: We installed pandas for working with the CSV file:
```
python3 -m pip install pandas
```
Coding the Script:
- You provided a Python script named ‘name_detector.py’. We might have made minor adjustments if needed for file paths or error handling.

import spacy
import pandas as pd

# Load the spaCy model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm") 

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

# Function to label names
def label_person(keyword):
    doc = nlp(keyword)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            return "Person"
    return ""  # Not a person name

# Apply the function to the 'Keyword' column
df['Person'] = df['Keyword'].apply(label_person)

# Save the modified CSV
df.to_csv(csv_path, index=False)

Running cells with 'Python 3.12.3' requires the ipykernel package.

Run the following command to install 'ipykernel' into the Python environment. 

Command: '/opt/homebrew/bin/python3 -m pip install ipykernel -U --user --force-reinstall'

Troubleshooting Type Errors
- CSV Data: We ensured that all values in the ‘Keyword’ column of your CSV were actually text strings. Numbers needed to be either converted to strings or filtered out for spaCy to process them correctly.

Key Takeaways

Virtual Environments: Always use virtual environments to cleanly manage project dependencies.
Language Models: spaCy relies on language models to perform its analysis. Make sure you download the appropriate model for your language.
Data Types: Be aware of data types in your files (like CSVs) and ensure they match the requirements of the libraries you’re using.

Additional Notes

pipx: We discussed pipx as an alternative for installing isolated Python applications.
Python Version: Be mindful of potential conflicts if you’re working with multiple Python versions on your system.

Feel free to reach out if you have more questions or want to explore customizing your name detection further!

Further Scripts

Location Detection

I also created scripts to detect the location based on the keyword.

import spacy
import pandas as pd

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

def identify_location(keyword):
    doc = nlp(keyword)
    for ent in doc.ents:
        if ent.label_ == "GPE":
            return ent.text
    return ""  # No location found

# Apply the function to the 'Keyword' column
df['Location'] = df['Keyword'].apply(identify_location)

# Save the modified CSV
df.to_csv(csv_path, index=False)

Phrase Count

This script is to find repeating phrases in the keywords and count the occurences of the words and phrases.

import spacy
import pandas as pd
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

def find_common_phrases(text, n=5):
    """Finds the top 'n' most common words/phrases."""
    doc = nlp(text)
    # Filter out stop words
    words = [token.text for token in doc if not token.is_stop]
    word_counts = Counter(words)
    return word_counts.most_common(n)

def process_keyword(keyword):
    phrases = find_common_phrases(keyword)
    return phrases[0] if phrases else ('', 0)  # Get the top phrase

# Apply the function to the DataFrame
df[['common_phrases', 'phrase_count']] = df['Keyword'].apply(process_keyword).to_list()

# Save the modified CSV
df.to_csv(csv_path, index=False)

Brian Hafner Analytics

Digital Markerting Analytics

SEO Strategies: Using Python to Detect Names in a List of Words

Using Python to Detect Proper Nouns in a List of Words.

Leave a Reply Cancel reply