SEO Strategies: Using Python to Detect Names in a List of Words

 


Using Python to Detect Proper Nouns in a List of Words.


I was working on keyword research for a client and was working to categorize all the keywords. Google Gemini helped me create scripts to go through a list of 2500 keywords and flag if it is a person’s name. Here’s a summary of the process we went through to get your Name Detector script working, along with explanations to help with future projects:

Problem:

I wanted to use Python to detect names in a list of words and update a CSV file accordingly.

Steps:

  1. Environment Setup:
    • Virtual Environment (Key Recommendation): We created a virtual environment named ‘spacy_env’ to isolate your project dependencies and prevent conflicts with your system’s Python or other packages. This is done using:
      python3 -m venv spacy_env
      source spacy_env/bin/activate
      
    • Homebrew Considerations: We navigated Homebrew’s preferences for protecting its own Python environment.
  2. Installing Dependencies:
    • spaCy: We installed the core spaCy library for natural language processing:
      python3 -m pip install spacy
      
    • Language Model: We downloaded a suitable language model (en_core_web_sm) for English named entity recognition:
      python3 -m spacy download en_core_web_sm
      
    • pandas: We installed pandas for working with the CSV file:
      python3 -m pip install pandas
      
  3. Coding the Script:
    • You provided a Python script named ‘name_detector.py’. We might have made minor adjustments if needed for file paths or error handling.
In [ ]:
import spacy
import pandas as pd

# Load the spaCy model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm") 

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

# Function to label names
def label_person(keyword):
    doc = nlp(keyword)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            return "Person"
    return ""  # Not a person name

# Apply the function to the 'Keyword' column
df['Person'] = df['Keyword'].apply(label_person)

# Save the modified CSV
df.to_csv(csv_path, index=False)
Running cells with 'Python 3.12.3' requires the ipykernel package.

Run the following command to install 'ipykernel' into the Python environment. 

Command: '/opt/homebrew/bin/python3 -m pip install ipykernel -U --user --force-reinstall'
  1. Troubleshooting Type Errors
    • CSV Data: We ensured that all values in the ‘Keyword’ column of your CSV were actually text strings. Numbers needed to be either converted to strings or filtered out for spaCy to process them correctly.

Key Takeaways

  • Virtual Environments: Always use virtual environments to cleanly manage project dependencies.
  • Language Models: spaCy relies on language models to perform its analysis. Make sure you download the appropriate model for your language.
  • Data Types: Be aware of data types in your files (like CSVs) and ensure they match the requirements of the libraries you’re using.

Additional Notes

  • pipx: We discussed pipx as an alternative for installing isolated Python applications.
  • Python Version: Be mindful of potential conflicts if you’re working with multiple Python versions on your system.

Feel free to reach out if you have more questions or want to explore customizing your name detection further!

Further Scripts

Location Detection

I also created scripts to detect the location based on the keyword.

import spacy
import pandas as pd

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

def identify_location(keyword):
    doc = nlp(keyword)
    for ent in doc.ents:
        if ent.label_ == "GPE":
            return ent.text
    return ""  # No location found

# Apply the function to the 'Keyword' column
df['Location'] = df['Keyword'].apply(identify_location)

# Save the modified CSV
df.to_csv(csv_path, index=False)

Phrase Count

This script is to find repeating phrases in the keywords and count the occurences of the words and phrases.

import spacy
import pandas as pd
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# File path
csv_path = "/Users/bhafner/Library/CloudStorage/OneDrive-brianhafner.com/Brian Hafner Tech-OneDrive/Clients - BHT/American Promise/Keyword Research/kw_person.csv"

# Read the CSV file
df = pd.read_csv(csv_path)

def find_common_phrases(text, n=5):
    """Finds the top 'n' most common words/phrases."""
    doc = nlp(text)
    # Filter out stop words
    words = [token.text for token in doc if not token.is_stop]
    word_counts = Counter(words)
    return word_counts.most_common(n)

def process_keyword(keyword):
    phrases = find_common_phrases(keyword)
    return phrases[0] if phrases else ('', 0)  # Get the top phrase

# Apply the function to the DataFrame
df[['common_phrases', 'phrase_count']] = df['Keyword'].apply(process_keyword).to_list()

# Save the modified CSV
df.to_csv(csv_path, index=False)

 

Cyclist Capstone

Cyclistic Capstone Project Cyclistic Capstone Project Brian Hafner 2024-01-28 Cyclistic Rides Analysis Background of Cyclistic Cyclistic is a bike-share program in Chicago, established in 2016, with 5,824 bicycles across 692 stations. They offer various pricing plans, categorizing customers into casual riders (using single-ride or full-day passes) and members (holding annual memberships). The company’s financial analysis […]

Reasons for discrepancies between old Google Analytics Universal Analytics Properties, and the new Google Analytics 4 Properties

Have you noticed discrepancies between your old Google Analytics Universal Analytics properties, and the new Google Analytics 4 properties? As we delve into the differences between Google Analytics 4 (GA4) and Universal Analytics (UA), it becomes evident that discrepancies in session or user numbers are often attributed to the shift in measurement methods. There are several […]

Upgrading from Google Analytics UA property to a Google Analytics 4 (GA4) property 

If you have recently upgraded to a Google Analytics GA4 property, you may have noticed some differences in the data reported by your new property compared to your old UA (universal analytics) property. This is because GA4 and UA use different methods of collecting, processing, and presenting data. In this blog post, we will explain […]

How to Use ChatGPT to Create Visual Assets in Canva

If you are looking for a way to spice up your visual content, you might be interested in a new feature that Canva has recently launched: ChatGPT. ChatGPT is a powerful tool that uses artificial intelligence to generate text and images based on your input. You can use ChatGPT to create catchy headlines, captions, slogans, […]