Cuban Address Manzanification System

Overview
What is Manzanification?
Algorithm Research & Selection
Installation
Implementation
Usage
Configuration
Limitations & Future Work

Overview

This system automates the process of matching Cuban street addresses to their corresponding manzanas (housing blocks). The system is designed to process batches of addresses from Excel files and return enriched data with manzana assignments and confidence scores.

Use Case Flow

User uploads Excel file with addresses
Backend processes addresses through manzanification engine
System returns Excel with added manzana column
User reviews/corrects results in separate validation tool

What is Manzanification?

Manzanification is the process of geocoding Cuban addresses to their administrative housing block units called manzanas.

A manzana is a city block or housing unit used in Cuban urban planning and administration. Each manzana has a unique identifier (e.g., 781A, 817, 676).

Cuban Address Structure

Cuban addresses typically follow these patterns:

Pattern	Example	Components
Street + Number + Cross Streets	`CALLE 39 # 2012 / 20 Y 22`	Main street, house number, intersecting streets
Avenue + Number	`AVE 48 # 6116 / 61 Y 63`	Avenue, house number, cross streets
Building + Apartment	`EDF 44 APT 5 JS`	Building number, apartment, settlement code
Street + Cross Streets (no number)	`CALLE 61 / 20 Y 22`	Main street, cross streets only

The cross streets (e.g., "20 Y 22") define the block boundaries, making spatial matching feasible.

Algorithm Research & Selection

We evaluated four spatial matching algorithms for this project:

Algorithm 1: Point-in-Polygon

Approach: Generate a point from the address coordinates and check which manzana polygon contains it.

Process:

Find main street geometry
Find intersection point of cross streets
Use house number to offset point along the street
Check which manzana polygon contains this point

Pros:

Simple and fast
Low computational overhead

Cons:

Requires accurate point generation
Assumes house numbers are evenly distributed
Fails if point falls on boundary or in gaps
No handling of ambiguous cases

Verdict: Too brittle for real-world Cuban address data

Algorithm 2: Street Segment + Buffer (SELECTED)

Approach: Find the street segment between cross streets, buffer it, and intersect with manzanas.

Process:

Get main street LineString (e.g., "C-39")
Find intersections with cross streets ("C-20" and "C-22")
Extract street segment between these two points
Determine which side using house number logic
Buffer the segment appropriately
Find manzana(s) that intersect this buffer
Return manzana with highest overlap area

Pros:

Uses actual street geometry, not just points
Cross streets provide precise block boundaries
Buffer handles boundary inaccuracies in shapefiles
Works even if house numbers aren't perfectly sequential
Returns confidence scores based on overlap
Graceful degradation when cross streets missing

Cons:

Requires calibration of buffer distance
Assumes street geometries are relatively accurate
Side determination logic may need adjustment

Verdict: ✅ SELECTED - Best balance of accuracy, robustness, and maintainability

Algorithm 3: Nearest Neighbor with Constraints

Approach: Find manzanas near the intersection, rank by distance and constraints.

Process:

Find intersection of main street with cross streets
Get all manzanas within X meters of intersection
Filter manzanas that actually touch the main street
Use house number range to pick the right one
Return closest match

Pros:

Handles messy/incomplete data well
Doesn't require perfect street geometries
Good for quick prototyping

Cons:

Requires manual tuning of distance threshold
May return wrong manzana if threshold too large
Less precise than segment-based approach
Doesn't use full geometric information

Verdict: Good fallback option, but less precise than Algorithm 2

Algorithm 4: Graph-Based Network Analysis

Approach: Model streets as a network graph and navigate to the address location.

Process:

Build street network graph from shapefiles
Find intersection node (cross streets + main street)
Navigate along main street for estimated distance
Find manzana at that graph position

Pros:

Most accurate for complex urban layouts
Handles one-way streets, connectivity issues
Good for routing applications

Cons:

Significant implementation complexity
Higher computational overhead
Requires clean, well-connected street network
Overkill for static address matching

Verdict: Too complex for current requirements; revisit if routing needed

Installation

Required Python Libraries

pip install geopandas pandas shapely openpyxl

Library Descriptions

Library	Purpose	Documentation
`geopandas`	Reading shapefiles, spatial operations	docs
`pandas`	Excel I/O, data manipulation	docs
`shapely`	Geometric operations (buffers, intersections)	docs
`openpyxl`	Excel file reading/writing	docs

Additional Dependencies

These are typically installed automatically with geopandas:

fiona - Shapefile I/O
pyproj - Coordinate system transformations
rtree - Spatial indexing for performance

Implementation

Sample Code

Below is the complete implementation of the Street Segment + Buffer algorithm:

`import geopandas as gpd import pandas as pd from shapely.geometry import Point, LineString from shapely.ops import nearest_points import re

class ManzanaMatcher: def init(self, streets_shapefile, manzanas_shapefile): """ Initialize the matcher with shapefiles.

    Args:
        streets_shapefile: Path to streets shapefile
        manzanas_shapefile: Path to manzanas shapefile
    """
    self.streets = gpd.read_file(streets_shapefile)
    self.manzanas = gpd.read_file(manzanas_shapefile)
    
    # Ensure same CRS
    if self.streets.crs != self.manzanas.crs:
        self.streets = self.streets.to_crs(self.manzanas.crs)

def parse_address(self, address):
    """
    Parse Cuban address into components.
    
    Returns dict with:
    - main_street: e.g., "C-39" or "A-46"
    - cross_street_1: e.g., "20"
    - cross_street_2: e.g., "22"
    - house_number: e.g., "2406"
    - building: e.g., "44"
    - apartment: e.g., "5"
    """
    result = {
        'main_street': None,
        'cross_street_1': None,
        'cross_street_2': None,
        'house_number': None,
        'building': None,
        'apartment': None
    }
    
    # Extract main street (CALLE XX or AVE XX)
    street_match = re.search(r'(CALLE|AVE|C-|A-)(\d+)', address, re.IGNORECASE)
    if street_match:
        prefix = 'C-' if 'C' in street_match.group(1).upper() else 'A-'
        result['main_street'] = f"{prefix}{street_match.group(2)}"
    
    # Extract house number (# XXXX)
    number_match = re.search(r'#?\s*(\d{3,4})(?!\s*APT)', address)
    if number_match:
        result['house_number'] = number_match.group(1)
    
    # Extract cross streets (/ XX Y XX or ENTRE XX Y XX)
    cross_match = re.search(r'[/|ENTRE]\s*(\d+)\s*[YyXx]\s*(\d+)', address, re.IGNORECASE)
    if cross_match:
        result['cross_street_1'] = cross_match.group(1)
        result['cross_street_2'] = cross_match.group(2)
    
    # Extract building (EDF XX or EDIFICIO XX)
    building_match = re.search(r'(?:EDF|EDIFICIO)\s*(\d+)', address, re.IGNORECASE)
    if building_match:
        result['building'] = building_match.group(1)
    
    # Extract apartment
    apt_match = re.search(r'APT\s*(\d+)', address, re.IGNORECASE)
    if apt_match:
        result['apartment'] = apt_match.group(1)
    
    return result

def find_street(self, street_name):
    """Find street geometry by name."""
    # Adjust column name based on your shapefile
    # Common names: 'name', 'NAME', 'street_nam', 'CALLE'
    street_col = 'name'  # CHANGE THIS to match your shapefile
    
    matches = self.streets[self.streets[street_col].str.contains(
        street_name, case=False, na=False)]
    
    if len(matches) == 0:
        return None
    return matches.iloc[0].geometry

def find_intersection(self, main_street_geom, cross_street_1, cross_street_2):
    """
    Find the segment of main street between two cross streets.
    
    Returns:
        LineString segment between the two cross streets
    """
    # Find cross street geometries
    cross_geom_1 = self.find_street(f"C-{cross_street_1}")
    cross_geom_2 = self.find_street(f"C-{cross_street_2}")
    
    if cross_geom_1 is None or cross_geom_2 is None:
        # Try with A- prefix
        if cross_geom_1 is None:
            cross_geom_1 = self.find_street(f"A-{cross_street_1}")
        if cross_geom_2 is None:
            cross_geom_2 = self.find_street(f"A-{cross_street_2}")
    
    if cross_geom_1 is None or cross_geom_2 is None:
        return None
    
    # Find intersection points
    int_point_1 = main_street_geom.intersection(cross_geom_1)
    int_point_2 = main_street_geom.intersection(cross_geom_2)
    
    if int_point_1.is_empty or int_point_2.is_empty:
        return None
    
    # Get the actual point (in case intersection returns multiple geometries)
    if hasattr(int_point_1, 'geoms'):
        int_point_1 = list(int_point_1.geoms)[0]
    if hasattr(int_point_2, 'geoms'):
        int_point_2 = list(int_point_2.geoms)[0]
    
    # Create segment between intersection points
    segment = LineString([int_point_1, int_point_2])
    return segment

def determine_side(self, house_number):
    """
    Determine which side of street based on house number.
    
    Returns: 'left' or 'right' or 'both'
    """
    if house_number is None:
        return 'both'
    
    # Even/odd logic (common in many cities)
    # Adjust this based on Cuban numbering conventions
    num = int(house_number)
    if num % 2 == 0:
        return 'right'
    else:
        return 'left'

def find_manzana(self, address):
    """
    Main function to find manzana for an address.
    
    Returns:
        dict with 'manzana' code and 'confidence' score
    """
    # Parse address
    parsed = self.parse_address(address)
    
    # Handle building-only addresses (like EDF 44 APT 5 JS)
    if parsed['building'] and not parsed['main_street']:
        # For building-based addresses, you might need a separate lookup
        # or manual mapping. For now, return low confidence.
        return {'manzana': None, 'confidence': 0.0, 'method': 'building_only'}
    
    if not parsed['main_street']:
        return {'manzana': None, 'confidence': 0.0, 'method': 'no_street'}
    
    # Find main street
    main_street_geom = self.find_street(parsed['main_street'])
    if main_street_geom is None:
        return {'manzana': None, 'confidence': 0.0, 'method': 'street_not_found'}
    
    # Find street segment between cross streets
    if parsed['cross_street_1'] and parsed['cross_street_2']:
        segment = self.find_intersection(
            main_street_geom, 
            parsed['cross_street_1'], 
            parsed['cross_street_2']
        )
        if segment is None:
            # Fallback: use entire street
            segment = main_street_geom
            confidence_penalty = 0.3
        else:
            confidence_penalty = 0.0
    else:
        # No cross streets, use entire main street
        segment = main_street_geom
        confidence_penalty = 0.5
    
    # Determine side
    side = self.determine_side(parsed['house_number'])
    
    # Buffer the segment (adjust buffer distance based on your data)
    buffer_distance = 50  # meters, adjust as needed
    
    if side == 'both':
        buffered = segment.buffer(buffer_distance)
    elif side == 'right':
        # Buffer only to the right
        buffered = segment.buffer(buffer_distance, single_sided=False)
    else:  # left
        # Buffer only to the left
        buffered = segment.buffer(buffer_distance, single_sided=False)
    
    # Find manzanas that intersect the buffer
    # Adjust column name based on your shapefile
    manzana_col = 'manzana'  # CHANGE THIS to match your shapefile
    
    intersecting = self.manzanas[self.manzanas.intersects(buffered)]
    
    if len(intersecting) == 0:
        return {'manzana': None, 'confidence': 0.0, 'method': 'no_intersection'}
    
    # Calculate overlap area for each candidate
    intersecting = intersecting.copy()
    intersecting['overlap'] = intersecting.geometry.apply(
        lambda geom: geom.intersection(buffered).area
    )
    
    # Get the manzana with highest overlap
    best_match = intersecting.loc[intersecting['overlap'].idxmax()]
    manzana_code = best_match[manzana_col]
    
    # Calculate confidence based on:
    # - How many candidates there were
    # - Whether cross streets were found
    # - Overlap ratio
    confidence = 1.0 - confidence_penalty
    if len(intersecting) > 1:
        confidence *= 0.8
    
    return {
        'manzana': manzana_code,
        'confidence': round(confidence, 2),
        'method': 'spatial_match',
        'candidates': len(intersecting)
    }

def process_excel(self, input_path, output_path, address_column='DIRECCION'):
    """
    Process an Excel file with addresses and add manzana column.
    
    Args:
        input_path: Path to input Excel file
        output_path: Path to output Excel file
        address_column: Name of column containing addresses
    """
    # Read Excel
    df = pd.read_excel(input_path)
    
    # Process each address
    results = []
    for address in df[address_column]:
        result = self.find_manzana(address)
        results.append(result)
    
    # Add results to dataframe
    df['MANZANA_PREDICTED'] = [r['manzana'] for r in results]
    df['CONFIDENCE'] = [r['confidence'] for r in results]
    df['MATCH_METHOD'] = [r['method'] for r in results]
    
    # Save to Excel
    df.to_excel(output_path, index=False)
    
    return df

Usage

Basic Usage

`from manzana_matcher import ManzanaMatcher

Initialize matcher with your shapefiles

matcher = ManzanaMatcher( streets_shapefile='data/streets.shp', manzanas_shapefile='data/manzanas.shp' )

Match a single address

address = "CALLE 39 # 2012 / 20 Y 22" result = matcher.find_manzana(address)

print(f"Manzana: {result['manzana']}") print(f"Confidence: {result['confidence']}") print(f"Method: {result['method']}") `

Batch Processing

`# Process entire Excel file output_df = matcher.process_excel( input_path='input/addresses.xlsx', output_path='output/addresses_with_manzanas.xlsx', address_column='DIRECCION' )

View statistics

print(f"Total addresses: {len(output_df)}") print(f"Matched: {output_df['MANZANA_PREDICTED'].notna().sum()}") print(f"Average confidence: {output_df['CONFIDENCE'].mean():.2f}") `

API Integration

`from flask import Flask, request, send_file from manzana_matcher import ManzanaMatcher import tempfile

app = Flask(name) matcher = ManzanaMatcher('data/streets.shp', 'data/manzanas.shp')

@app.route('/api/manzanify', methods=['POST']) def manzanify(): # Get uploaded file file = request.files['excel']

# Save to temp location
with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_in:
    file.save(tmp_in.name)
    
    # Process
    with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_out:
        matcher.process_excel(tmp_in.name, tmp_out.name)
        return send_file(tmp_out.name, as_attachment=True)

Configuration

Critical Configuration Points

Before running the system, you must configure these settings in the code:

1. Shapefile Column Names

`# In find_street() method (line 72) street_col = 'name' # Change to your streets column name

Common alternatives: 'NAME', 'street_nam', 'CALLE', 'nom_calle'

In find_manzana() method (line 195)

manzana_col = 'manzana' # Change to your manzanas column name

Common alternatives: 'MANZANA', 'codigo', 'id', 'block_id'

How to find your column names:

`import geopandas as gpd

Check streets columns

streets = gpd.read_file('streets.shp') print("Streets columns:", streets.columns.tolist())

Check manzanas columns

manzanas = gpd.read_file('manzanas.shp') print("Manzanas columns:", manzanas.columns.tolist()) `

2. Buffer Distance

# In find_manzana() method (line 166) buffer_distance = 50 # meters

How to calibrate:

Measure typical manzana width in your shapefiles
Start with 50m and adjust based on results
Too small: misses correct manzanas
Too large: includes too many candidates

3. House Number Side Logic

`# In determine_side() method (line 144-154) def determine_side(self, house_number): if house_number is None: return 'both'

num = int(house_number)
if num % 2 == 0:
    return 'right'  # Even numbers on right
else:
    return 'left'   # Odd numbers on left

Cuban numbering conventions:

Verify if even/odd rule applies in Cuba
Some cities use sequential numbering
May vary by neighborhood
Consider disabling side logic initially (always return 'both')

Limitations & Future Work

Current Limitations

Building-Only Addresses
- Addresses like EDF 44 APT 5 JS cannot be spatially matched
- Require separate lookup table or manual mapping
- Currently return None with low confidence
Missing Cross Streets
- Addresses without cross streets use entire street geometry
- Lower confidence scores
- May match to wrong manzana if street is long
House Numbering Assumptions
- Even/odd side logic may not apply in all Cuban cities
- Sequential numbering not well-handled
- May need city-specific calibration
Shapefile Quality Dependency
- Requires accurate street geometries
- Streets must actually intersect in the data
- Manzana boundaries must align with streets
Performance
- Not optimized for real-time processing
- Large batches (10,000+ addresses) may be slow
- No spatial indexing implemented

Cuban Address Manzanification System ​

Table of Contents ​

Overview ​

Use Case Flow ​

What is Manzanification? ​

Cuban Address Structure ​

Algorithm Research & Selection ​

Algorithm 1: Point-in-Polygon ​

Algorithm 2: Street Segment + Buffer (SELECTED) ​

Algorithm 3: Nearest Neighbor with Constraints ​

Algorithm 4: Graph-Based Network Analysis ​

Installation ​

Required Python Libraries ​

Library Descriptions ​

Additional Dependencies ​

Implementation ​

Sample Code ​

Usage ​

Basic Usage ​

Initialize matcher with your shapefiles ​

Match a single address ​

Batch Processing ​

View statistics ​

API Integration ​

Configuration ​

Critical Configuration Points ​

1. Shapefile Column Names ​

Common alternatives: 'NAME', 'street_nam', 'CALLE', 'nom_calle' ​

In find_manzana() method (line 195) ​

Common alternatives: 'MANZANA', 'codigo', 'id', 'block_id' ​

Check streets columns ​

Check manzanas columns ​

2. Buffer Distance ​

3. House Number Side Logic ​

Limitations & Future Work ​

Current Limitations ​

Cuban Address Manzanification System

Table of Contents

Overview

Use Case Flow

What is Manzanification?

Cuban Address Structure

Algorithm Research & Selection

Algorithm 1: Point-in-Polygon

Algorithm 2: Street Segment + Buffer (SELECTED)

Algorithm 3: Nearest Neighbor with Constraints

Algorithm 4: Graph-Based Network Analysis

Installation

Required Python Libraries

Library Descriptions

Additional Dependencies

Implementation

Sample Code

Usage

Basic Usage

Initialize matcher with your shapefiles

Match a single address

Batch Processing

View statistics

API Integration

Configuration

Critical Configuration Points

1. Shapefile Column Names

Common alternatives: 'NAME', 'street_nam', 'CALLE', 'nom_calle'

In find_manzana() method (line 195)

Common alternatives: 'MANZANA', 'codigo', 'id', 'block_id'

Check streets columns

Check manzanas columns

2. Buffer Distance

3. House Number Side Logic

Limitations & Future Work

Current Limitations