A community based topic aggregation platform built on atproto

Merge branch 'feat/kagi-news-aggregator-phase-1'

+4344 -884
+6
aggregators/kagi-news/.env.example
··· 1 + # Aggregator Identity (pre-created account credentials) 2 + AGGREGATOR_HANDLE=kagi-news.local.coves.dev 3 + AGGREGATOR_PASSWORD=your-secure-password-here 4 + 5 + # Optional: Override Coves API URL (defaults to config.yaml) 6 + # COVES_API_URL=http://localhost:3001
+41
aggregators/kagi-news/.gitignore
··· 1 + # Environment and config 2 + .env 3 + config.yaml 4 + venv/ 5 + 6 + # State files 7 + data/*.json 8 + data/world.xml 9 + 10 + # Python 11 + __pycache__/ 12 + *.py[cod] 13 + *$py.class 14 + *.so 15 + .Python 16 + build/ 17 + develop-eggs/ 18 + dist/ 19 + downloads/ 20 + eggs/ 21 + .eggs/ 22 + lib/ 23 + lib64/ 24 + parts/ 25 + sdist/ 26 + var/ 27 + wheels/ 28 + *.egg-info/ 29 + .installed.cfg 30 + *.egg 31 + 32 + # Testing 33 + .pytest_cache/ 34 + .coverage 35 + htmlcov/ 36 + 37 + # IDE 38 + .vscode/ 39 + .idea/ 40 + *.swp 41 + *.swo
+173
aggregators/kagi-news/README.md
··· 1 + # Kagi News RSS Aggregator 2 + 3 + A Python-based RSS aggregator that posts Kagi News stories to Coves communities using rich text formatting. 4 + 5 + ## Overview 6 + 7 + This aggregator: 8 + - Fetches RSS feeds from Kagi News daily via CRON 9 + - Parses HTML descriptions to extract structured content (highlights, perspectives, sources) 10 + - Formats posts using Coves rich text with facets (bold, italic, links) 11 + - Hot-links images from Kagi's proxy (no blob upload) 12 + - Posts to configured communities via XRPC 13 + 14 + ## Project Structure 15 + 16 + ``` 17 + aggregators/kagi-news/ 18 + ├── src/ 19 + │ ├── models.py # Data models (KagiStory, Perspective, etc.) 20 + │ ├── rss_fetcher.py # RSS feed fetching with retry logic 21 + │ ├── html_parser.py # Parse Kagi HTML to structured data 22 + │ ├── richtext_formatter.py # Format content with rich text facets (TODO) 23 + │ ├── atproto_client.py # ATProto authentication and operations (TODO) 24 + │ ├── state_manager.py # Deduplication state tracking (TODO) 25 + │ ├── config.py # Configuration loading (TODO) 26 + │ └── main.py # Entry point (TODO) 27 + ├── tests/ 28 + │ ├── test_rss_fetcher.py # RSS fetcher tests ✓ 29 + │ ├── test_html_parser.py # HTML parser tests ✓ 30 + │ └── fixtures/ 31 + │ ├── sample_rss_item.xml 32 + │ └── world.xml 33 + ├── scripts/ 34 + │ └── generate_did.py # Helper to generate aggregator DID (TODO) 35 + ├── requirements.txt # Python dependencies 36 + ├── config.example.yaml # Example configuration 37 + ├── .env.example # Environment variables template 38 + ├── crontab # CRON schedule 39 + └── README.md 40 + ``` 41 + 42 + ## Setup 43 + 44 + ### Prerequisites 45 + 46 + - Python 3.11+ 47 + - python3-venv package (`apt install python3.12-venv`) 48 + 49 + ### Installation 50 + 51 + 1. Create virtual environment: 52 + ```bash 53 + python3 -m venv venv 54 + source venv/bin/activate 55 + ``` 56 + 57 + 2. Install dependencies: 58 + ```bash 59 + pip install -r requirements.txt 60 + ``` 61 + 62 + 3. Copy configuration templates: 63 + ```bash 64 + cp config.example.yaml config.yaml 65 + cp .env.example .env 66 + ``` 67 + 68 + 4. Edit `config.yaml` to map RSS feeds to communities 69 + 5. Set environment variables in `.env` (aggregator DID and private key) 70 + 71 + ## Running Tests 72 + 73 + ```bash 74 + # Activate virtual environment 75 + source venv/bin/activate 76 + 77 + # Run all tests 78 + pytest -v 79 + 80 + # Run specific test file 81 + pytest tests/test_html_parser.py -v 82 + 83 + # Run with coverage 84 + pytest --cov=src --cov-report=html 85 + ``` 86 + 87 + ## Development Status 88 + 89 + ### ✅ Phase 1-2 Complete (Oct 24, 2025) 90 + - [x] Project structure created 91 + - [x] Data models defined (KagiStory, Perspective, Quote, Source) 92 + - [x] RSS fetcher with retry logic and tests 93 + - [x] HTML parser extracting all sections (summary, highlights, perspectives, sources, quote, image) 94 + - [x] Test fixtures from real Kagi News feed 95 + 96 + ### 🚧 Next Steps (Phase 3-4) 97 + - [ ] Rich text formatter (convert to Coves format with facets) 98 + - [ ] State manager for deduplication 99 + - [ ] Configuration loader 100 + - [ ] ATProto client for post creation 101 + - [ ] Main orchestration script 102 + - [ ] End-to-end tests 103 + 104 + ## Configuration 105 + 106 + Edit `config.yaml` to define feed-to-community mappings: 107 + 108 + ```yaml 109 + coves_api_url: "https://api.coves.social" 110 + 111 + feeds: 112 + - name: "World News" 113 + url: "https://news.kagi.com/world.xml" 114 + community_handle: "world-news.coves.social" 115 + enabled: true 116 + 117 + - name: "Tech News" 118 + url: "https://news.kagi.com/tech.xml" 119 + community_handle: "tech.coves.social" 120 + enabled: true 121 + ``` 122 + 123 + ## Architecture 124 + 125 + ### Data Flow 126 + 127 + ``` 128 + Kagi RSS Feed 129 + ↓ (HTTP GET) 130 + RSS Fetcher 131 + ↓ (feedparser) 132 + Parsed RSS Items 133 + ↓ (for each item) 134 + HTML Parser 135 + ↓ (BeautifulSoup) 136 + Structured KagiStory 137 + 138 + Rich Text Formatter 139 + ↓ (with facets) 140 + Post Record 141 + ↓ (XRPC) 142 + Coves Community 143 + ``` 144 + 145 + ### Rich Text Format 146 + 147 + Posts use Coves rich text with UTF-8 byte-positioned facets: 148 + 149 + ```python 150 + { 151 + "content": "Summary text...\n\nHighlights:\n• Point 1\n...", 152 + "facets": [ 153 + { 154 + "index": {"byteStart": 20, "byteEnd": 31}, 155 + "features": [{"$type": "social.coves.richtext.facet#bold"}] 156 + }, 157 + { 158 + "index": {"byteStart": 50, "byteEnd": 75}, 159 + "features": [{"$type": "social.coves.richtext.facet#link", "uri": "https://..."}] 160 + } 161 + ] 162 + } 163 + ``` 164 + 165 + ## License 166 + 167 + See parent Coves project license. 168 + 169 + ## Related Documentation 170 + 171 + - [PRD: Kagi News Aggregator](../../docs/aggregators/PRD_KAGI_NEWS_RSS.md) 172 + - [PRD: Aggregator System](../../docs/aggregators/PRD_AGGREGATORS.md) 173 + - [Coves Rich Text Lexicon](../../internal/atproto/lexicon/social/coves/richtext/README.md)
+29
aggregators/kagi-news/config.example.yaml
··· 1 + # Kagi News RSS Aggregator Configuration 2 + 3 + # Coves API endpoint 4 + coves_api_url: "https://api.coves.social" 5 + 6 + # Feed-to-community mappings 7 + feeds: 8 + - name: "World News" 9 + url: "https://news.kagi.com/world.xml" 10 + community_handle: "world-news.coves.social" 11 + enabled: true 12 + 13 + - name: "Tech News" 14 + url: "https://news.kagi.com/tech.xml" 15 + community_handle: "tech.coves.social" 16 + enabled: true 17 + 18 + - name: "Business News" 19 + url: "https://news.kagi.com/business.xml" 20 + community_handle: "business.coves.social" 21 + enabled: false 22 + 23 + - name: "Science News" 24 + url: "https://news.kagi.com/science.xml" 25 + community_handle: "science.coves.social" 26 + enabled: false 27 + 28 + # Logging configuration 29 + log_level: "info" # debug, info, warning, error
+5
aggregators/kagi-news/crontab
··· 1 + # Run Kagi News aggregator daily at 1 PM UTC (after Kagi updates around noon) 2 + 0 13 * * * cd /app && /usr/local/bin/python -m src.main >> /var/log/cron.log 2>&1 3 + 4 + # Blank line required at end of crontab 5 +
+12
aggregators/kagi-news/pytest.ini
··· 1 + [pytest] 2 + testpaths = tests 3 + python_files = test_*.py 4 + python_classes = Test* 5 + python_functions = test_* 6 + addopts = 7 + -v 8 + --strict-markers 9 + --tb=short 10 + --cov=src 11 + --cov-report=term-missing 12 + --cov-report=html
+17
aggregators/kagi-news/requirements.txt
··· 1 + # Core dependencies 2 + feedparser==6.0.11 3 + beautifulsoup4==4.12.3 4 + requests==2.31.0 5 + atproto==0.0.55 6 + pyyaml==6.0.1 7 + 8 + # Testing 9 + pytest==8.1.1 10 + pytest-cov==5.0.0 11 + responses==0.25.0 12 + 13 + # Development 14 + black==24.3.0 15 + mypy==1.9.0 16 + types-PyYAML==6.0.12.12 17 + types-requests==2.31.0.20240311
+3
aggregators/kagi-news/src/__init__.py
··· 1 + """Kagi News RSS Aggregator for Coves.""" 2 + 3 + __version__ = "0.1.0"
+165
aggregators/kagi-news/src/config.py
··· 1 + """ 2 + Configuration Loader for Kagi News Aggregator. 3 + 4 + Loads and validates configuration from YAML files. 5 + """ 6 + import os 7 + import logging 8 + from pathlib import Path 9 + from typing import Dict, Any 10 + import yaml 11 + from urllib.parse import urlparse 12 + 13 + from src.models import AggregatorConfig, FeedConfig 14 + 15 + logger = logging.getLogger(__name__) 16 + 17 + 18 + class ConfigError(Exception): 19 + """Configuration error.""" 20 + pass 21 + 22 + 23 + class ConfigLoader: 24 + """ 25 + Loads and validates aggregator configuration. 26 + 27 + Supports: 28 + - Loading from YAML file 29 + - Environment variable overrides 30 + - Validation of required fields 31 + - URL validation 32 + """ 33 + 34 + def __init__(self, config_path: Path): 35 + """ 36 + Initialize config loader. 37 + 38 + Args: 39 + config_path: Path to config.yaml file 40 + """ 41 + self.config_path = Path(config_path) 42 + 43 + def load(self) -> AggregatorConfig: 44 + """ 45 + Load and validate configuration. 46 + 47 + Returns: 48 + AggregatorConfig object 49 + 50 + Raises: 51 + ConfigError: If config is invalid or missing 52 + """ 53 + # Check file exists 54 + if not self.config_path.exists(): 55 + raise ConfigError(f"Configuration file not found: {self.config_path}") 56 + 57 + # Load YAML 58 + try: 59 + with open(self.config_path, 'r') as f: 60 + config_data = yaml.safe_load(f) 61 + except yaml.YAMLError as e: 62 + raise ConfigError(f"Failed to parse YAML: {e}") 63 + 64 + if not config_data: 65 + raise ConfigError("Configuration file is empty") 66 + 67 + # Validate and parse 68 + try: 69 + return self._parse_config(config_data) 70 + except Exception as e: 71 + raise ConfigError(f"Invalid configuration: {e}") 72 + 73 + def _parse_config(self, data: Dict[str, Any]) -> AggregatorConfig: 74 + """ 75 + Parse and validate configuration data. 76 + 77 + Args: 78 + data: Parsed YAML data 79 + 80 + Returns: 81 + AggregatorConfig object 82 + 83 + Raises: 84 + ConfigError: If validation fails 85 + """ 86 + # Get coves_api_url (with env override) 87 + coves_api_url = os.getenv('COVES_API_URL', data.get('coves_api_url')) 88 + if not coves_api_url: 89 + raise ConfigError("Missing required field: coves_api_url") 90 + 91 + # Validate URL 92 + if not self._is_valid_url(coves_api_url): 93 + raise ConfigError(f"Invalid URL for coves_api_url: {coves_api_url}") 94 + 95 + # Get log level (default to info) 96 + log_level = data.get('log_level', 'info') 97 + 98 + # Parse feeds 99 + feeds_data = data.get('feeds', []) 100 + if not feeds_data: 101 + raise ConfigError("Configuration must include at least one feed") 102 + 103 + feeds = [] 104 + for feed_data in feeds_data: 105 + feed = self._parse_feed(feed_data) 106 + feeds.append(feed) 107 + 108 + logger.info(f"Loaded configuration with {len(feeds)} feeds ({sum(1 for f in feeds if f.enabled)} enabled)") 109 + 110 + return AggregatorConfig( 111 + coves_api_url=coves_api_url, 112 + feeds=feeds, 113 + log_level=log_level 114 + ) 115 + 116 + def _parse_feed(self, data: Dict[str, Any]) -> FeedConfig: 117 + """ 118 + Parse and validate a single feed configuration. 119 + 120 + Args: 121 + data: Feed configuration data 122 + 123 + Returns: 124 + FeedConfig object 125 + 126 + Raises: 127 + ConfigError: If validation fails 128 + """ 129 + # Required fields 130 + required_fields = ['name', 'url', 'community_handle'] 131 + for field in required_fields: 132 + if field not in data: 133 + raise ConfigError(f"Missing required field in feed config: {field}") 134 + 135 + name = data['name'] 136 + url = data['url'] 137 + community_handle = data['community_handle'] 138 + enabled = data.get('enabled', True) # Default to True 139 + 140 + # Validate URL 141 + if not self._is_valid_url(url): 142 + raise ConfigError(f"Invalid URL for feed '{name}': {url}") 143 + 144 + return FeedConfig( 145 + name=name, 146 + url=url, 147 + community_handle=community_handle, 148 + enabled=enabled 149 + ) 150 + 151 + def _is_valid_url(self, url: str) -> bool: 152 + """ 153 + Validate URL format. 154 + 155 + Args: 156 + url: URL to validate 157 + 158 + Returns: 159 + True if valid, False otherwise 160 + """ 161 + try: 162 + result = urlparse(url) 163 + return all([result.scheme, result.netloc]) 164 + except Exception: 165 + return False
+175
aggregators/kagi-news/src/coves_client.py
··· 1 + """ 2 + Coves API Client for posting to communities. 3 + 4 + Handles authentication and posting via XRPC. 5 + """ 6 + import logging 7 + import requests 8 + from typing import Dict, List, Optional 9 + from atproto import Client 10 + 11 + logger = logging.getLogger(__name__) 12 + 13 + 14 + class CovesClient: 15 + """ 16 + Client for posting to Coves communities via XRPC. 17 + 18 + Handles: 19 + - Authentication with aggregator credentials 20 + - Creating posts in communities (social.coves.post.create) 21 + - External embed formatting 22 + """ 23 + 24 + def __init__(self, api_url: str, handle: str, password: str, pds_url: Optional[str] = None): 25 + """ 26 + Initialize Coves client. 27 + 28 + Args: 29 + api_url: Coves AppView URL for posting (e.g., "http://localhost:8081") 30 + handle: Aggregator handle (e.g., "kagi-news.coves.social") 31 + password: Aggregator password/app password 32 + pds_url: Optional PDS URL for authentication (defaults to api_url) 33 + """ 34 + self.api_url = api_url 35 + self.pds_url = pds_url or api_url # Auth through PDS, post through AppView 36 + self.handle = handle 37 + self.password = password 38 + self.client = Client(base_url=self.pds_url) # Use PDS for auth 39 + self._authenticated = False 40 + 41 + def authenticate(self): 42 + """ 43 + Authenticate with Coves API. 44 + 45 + Uses com.atproto.server.createSession directly to avoid 46 + Bluesky-specific endpoints that don't exist on Coves PDS. 47 + 48 + Raises: 49 + Exception: If authentication fails 50 + """ 51 + try: 52 + logger.info(f"Authenticating as {self.handle}") 53 + 54 + # Use createSession directly (avoid app.bsky.actor.getProfile) 55 + session = self.client.com.atproto.server.create_session( 56 + {"identifier": self.handle, "password": self.password} 57 + ) 58 + 59 + # Manually set session (skip profile fetch) 60 + self.client._session = session 61 + self._authenticated = True 62 + self.did = session.did 63 + 64 + logger.info(f"Authentication successful (DID: {self.did})") 65 + except Exception as e: 66 + logger.error(f"Authentication failed: {e}") 67 + raise 68 + 69 + def create_post( 70 + self, 71 + community_handle: str, 72 + content: str, 73 + facets: List[Dict], 74 + embed: Optional[Dict] = None 75 + ) -> str: 76 + """ 77 + Create a post in a community. 78 + 79 + Args: 80 + community_handle: Community handle (e.g., "world-news.coves.social") 81 + content: Post content (rich text) 82 + facets: Rich text facets (formatting, links) 83 + embed: Optional external embed 84 + 85 + Returns: 86 + AT Proto URI of created post (e.g., "at://did:plc:.../social.coves.post/...") 87 + 88 + Raises: 89 + Exception: If post creation fails 90 + """ 91 + if not self._authenticated: 92 + self.authenticate() 93 + 94 + try: 95 + # Prepare post data for social.coves.post.create endpoint 96 + post_data = { 97 + "community": community_handle, 98 + "content": content, 99 + "facets": facets 100 + } 101 + 102 + # Add embed if provided 103 + if embed: 104 + post_data["embed"] = embed 105 + 106 + # Use Coves-specific endpoint (not direct PDS write) 107 + # This provides validation, authorization, and business logic 108 + logger.info(f"Creating post in community: {community_handle}") 109 + 110 + # Make direct HTTP request to XRPC endpoint 111 + url = f"{self.api_url}/xrpc/social.coves.post.create" 112 + headers = { 113 + "Authorization": f"Bearer {self.client._session.access_jwt}", 114 + "Content-Type": "application/json" 115 + } 116 + 117 + response = requests.post(url, json=post_data, headers=headers, timeout=30) 118 + 119 + # Log detailed error if request fails 120 + if not response.ok: 121 + error_body = response.text 122 + logger.error(f"Post creation failed ({response.status_code}): {error_body}") 123 + response.raise_for_status() 124 + 125 + result = response.json() 126 + post_uri = result["uri"] 127 + logger.info(f"Post created: {post_uri}") 128 + return post_uri 129 + 130 + except Exception as e: 131 + logger.error(f"Failed to create post: {e}") 132 + raise 133 + 134 + def create_external_embed( 135 + self, 136 + uri: str, 137 + title: str, 138 + description: str, 139 + thumb: Optional[str] = None 140 + ) -> Dict: 141 + """ 142 + Create external embed object for hot-linked content. 143 + 144 + Args: 145 + uri: External URL (story link) 146 + title: Story title 147 + description: Story description/summary 148 + thumb: Optional thumbnail image URL 149 + 150 + Returns: 151 + External embed dictionary 152 + """ 153 + embed = { 154 + "$type": "social.coves.embed.external", 155 + "external": { 156 + "uri": uri, 157 + "title": title, 158 + "description": description 159 + } 160 + } 161 + 162 + if thumb: 163 + embed["external"]["thumb"] = thumb 164 + 165 + return embed 166 + 167 + def _get_timestamp(self) -> str: 168 + """ 169 + Get current timestamp in ISO 8601 format. 170 + 171 + Returns: 172 + ISO timestamp string 173 + """ 174 + from datetime import datetime, timezone 175 + return datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
+300
aggregators/kagi-news/src/html_parser.py
··· 1 + """ 2 + Kagi News HTML description parser. 3 + 4 + Parses the HTML content from RSS feed item descriptions 5 + into structured data. 6 + """ 7 + import re 8 + import logging 9 + from typing import Dict, List, Optional 10 + from datetime import datetime 11 + from bs4 import BeautifulSoup 12 + from urllib.parse import urlparse 13 + 14 + from src.models import KagiStory, Perspective, Quote, Source 15 + 16 + logger = logging.getLogger(__name__) 17 + 18 + 19 + class KagiHTMLParser: 20 + """Parses Kagi News HTML descriptions into structured data.""" 21 + 22 + def parse(self, html_description: str) -> Dict: 23 + """ 24 + Parse HTML description into structured data. 25 + 26 + Args: 27 + html_description: HTML content from RSS item description 28 + 29 + Returns: 30 + Dictionary with extracted data: 31 + - summary: str 32 + - image_url: Optional[str] 33 + - image_alt: Optional[str] 34 + - highlights: List[str] 35 + - quote: Optional[Dict[str, str]] 36 + - perspectives: List[Dict] 37 + - sources: List[Dict] 38 + """ 39 + soup = BeautifulSoup(html_description, 'html.parser') 40 + 41 + return { 42 + 'summary': self._extract_summary(soup), 43 + 'image_url': self._extract_image_url(soup), 44 + 'image_alt': self._extract_image_alt(soup), 45 + 'highlights': self._extract_highlights(soup), 46 + 'quote': self._extract_quote(soup), 47 + 'perspectives': self._extract_perspectives(soup), 48 + 'sources': self._extract_sources(soup), 49 + } 50 + 51 + def parse_to_story( 52 + self, 53 + title: str, 54 + link: str, 55 + guid: str, 56 + pub_date: datetime, 57 + categories: List[str], 58 + html_description: str 59 + ) -> KagiStory: 60 + """ 61 + Parse HTML and create a KagiStory object. 62 + 63 + Args: 64 + title: Story title 65 + link: Story URL 66 + guid: Unique identifier 67 + pub_date: Publication date 68 + categories: List of categories 69 + html_description: HTML content from description 70 + 71 + Returns: 72 + KagiStory object 73 + """ 74 + parsed = self.parse(html_description) 75 + 76 + # Convert parsed data to model objects 77 + perspectives = [ 78 + Perspective( 79 + actor=p['actor'], 80 + description=p['description'], 81 + source_url=p['source_url'] 82 + ) 83 + for p in parsed['perspectives'] 84 + ] 85 + 86 + sources = [ 87 + Source( 88 + title=s['title'], 89 + url=s['url'], 90 + domain=s['domain'] 91 + ) 92 + for s in parsed['sources'] 93 + ] 94 + 95 + quote = None 96 + if parsed['quote']: 97 + quote = Quote( 98 + text=parsed['quote']['text'], 99 + attribution=parsed['quote']['attribution'] 100 + ) 101 + 102 + return KagiStory( 103 + title=title, 104 + link=link, 105 + guid=guid, 106 + pub_date=pub_date, 107 + categories=categories, 108 + summary=parsed['summary'], 109 + highlights=parsed['highlights'], 110 + perspectives=perspectives, 111 + quote=quote, 112 + sources=sources, 113 + image_url=parsed['image_url'], 114 + image_alt=parsed['image_alt'] 115 + ) 116 + 117 + def _extract_summary(self, soup: BeautifulSoup) -> str: 118 + """Extract summary from first <p> tag.""" 119 + p_tag = soup.find('p') 120 + if p_tag: 121 + return p_tag.get_text(strip=True) 122 + return "" 123 + 124 + def _extract_image_url(self, soup: BeautifulSoup) -> Optional[str]: 125 + """Extract image URL from <img> tag.""" 126 + img_tag = soup.find('img') 127 + if img_tag and img_tag.get('src'): 128 + return img_tag['src'] 129 + return None 130 + 131 + def _extract_image_alt(self, soup: BeautifulSoup) -> Optional[str]: 132 + """Extract image alt text from <img> tag.""" 133 + img_tag = soup.find('img') 134 + if img_tag and img_tag.get('alt'): 135 + return img_tag['alt'] 136 + return None 137 + 138 + def _extract_highlights(self, soup: BeautifulSoup) -> List[str]: 139 + """Extract highlights list from H3 section.""" 140 + highlights = [] 141 + 142 + # Find "Highlights:" h3 tag 143 + h3_tags = soup.find_all('h3') 144 + for h3 in h3_tags: 145 + if 'Highlights' in h3.get_text(): 146 + # Get the <ul> that follows this h3 147 + ul = h3.find_next_sibling('ul') 148 + if ul: 149 + for li in ul.find_all('li'): 150 + highlights.append(li.get_text(strip=True)) 151 + break 152 + 153 + return highlights 154 + 155 + def _extract_quote(self, soup: BeautifulSoup) -> Optional[Dict[str, str]]: 156 + """Extract quote from <blockquote> tag.""" 157 + blockquote = soup.find('blockquote') 158 + if not blockquote: 159 + return None 160 + 161 + text = blockquote.get_text(strip=True) 162 + 163 + # Try to split on " - " to separate quote from attribution 164 + if ' - ' in text: 165 + quote_text, attribution = text.rsplit(' - ', 1) 166 + return { 167 + 'text': quote_text.strip(), 168 + 'attribution': attribution.strip() 169 + } 170 + 171 + # If no attribution found, entire text is the quote 172 + # Try to infer attribution from context (often mentioned in highlights/perspectives) 173 + return { 174 + 'text': text, 175 + 'attribution': self._infer_quote_attribution(soup, text) 176 + } 177 + 178 + def _infer_quote_attribution(self, soup: BeautifulSoup, quote_text: str) -> str: 179 + """ 180 + Try to infer quote attribution from context. 181 + 182 + This is a fallback when quote doesn't have explicit attribution. 183 + """ 184 + # For now, check if any perspective mentions similar keywords 185 + perspectives_section = soup.find('h3', string=re.compile(r'Perspectives')) 186 + if perspectives_section: 187 + ul = perspectives_section.find_next_sibling('ul') 188 + if ul: 189 + for li in ul.find_all('li'): 190 + li_text = li.get_text() 191 + # Extract actor name (before first colon) 192 + if ':' in li_text: 193 + actor = li_text.split(':', 1)[0].strip() 194 + return actor 195 + 196 + return "Unknown" 197 + 198 + def _extract_perspectives(self, soup: BeautifulSoup) -> List[Dict]: 199 + """Extract perspectives from H3 section.""" 200 + perspectives = [] 201 + 202 + # Find "Perspectives:" h3 tag 203 + h3_tags = soup.find_all('h3') 204 + for h3 in h3_tags: 205 + if 'Perspectives' in h3.get_text(): 206 + # Get the <ul> that follows this h3 207 + ul = h3.find_next_sibling('ul') 208 + if ul: 209 + for li in ul.find_all('li'): 210 + perspective = self._parse_perspective_li(li) 211 + if perspective: 212 + perspectives.append(perspective) 213 + break 214 + 215 + return perspectives 216 + 217 + def _parse_perspective_li(self, li) -> Optional[Dict]: 218 + """ 219 + Parse a single perspective <li> element. 220 + 221 + Format: "Actor: Description. (Source)" 222 + """ 223 + # Get full text 224 + full_text = li.get_text() 225 + 226 + # Extract actor (before first colon) 227 + if ':' not in full_text: 228 + return None 229 + 230 + actor, rest = full_text.split(':', 1) 231 + actor = actor.strip() 232 + 233 + # Find the <a> tag for source URL 234 + a_tag = li.find('a') 235 + source_url = a_tag['href'] if a_tag and a_tag.get('href') else "" 236 + 237 + # Extract description (between colon and source link) 238 + # Remove the source citation part in parentheses 239 + description = rest 240 + 241 + # Remove source citation like "(The Straits Times)" from description 242 + if a_tag: 243 + # Remove the link text and surrounding parentheses 244 + link_text = a_tag.get_text() 245 + description = description.replace(f"({link_text})", "").strip() 246 + 247 + # Clean up trailing period 248 + description = description.strip('. ') 249 + 250 + return { 251 + 'actor': actor, 252 + 'description': description, 253 + 'source_url': source_url 254 + } 255 + 256 + def _extract_sources(self, soup: BeautifulSoup) -> List[Dict]: 257 + """Extract sources list from H3 section.""" 258 + sources = [] 259 + 260 + # Find "Sources:" h3 tag 261 + h3_tags = soup.find_all('h3') 262 + for h3 in h3_tags: 263 + if 'Sources' in h3.get_text(): 264 + # Get the <ul> that follows this h3 265 + ul = h3.find_next_sibling('ul') 266 + if ul: 267 + for li in ul.find_all('li'): 268 + source = self._parse_source_li(li) 269 + if source: 270 + sources.append(source) 271 + break 272 + 273 + return sources 274 + 275 + def _parse_source_li(self, li) -> Optional[Dict]: 276 + """ 277 + Parse a single source <li> element. 278 + 279 + Format: "<a href='...'>Title</a> - domain.com" 280 + """ 281 + a_tag = li.find('a') 282 + if not a_tag or not a_tag.get('href'): 283 + return None 284 + 285 + title = a_tag.get_text(strip=True) 286 + url = a_tag['href'] 287 + 288 + # Extract domain from URL 289 + parsed_url = urlparse(url) 290 + domain = parsed_url.netloc 291 + 292 + # Remove "www." prefix if present 293 + if domain.startswith('www.'): 294 + domain = domain[4:] 295 + 296 + return { 297 + 'title': title, 298 + 'url': url, 299 + 'domain': domain 300 + }
+243
aggregators/kagi-news/src/main.py
··· 1 + """ 2 + Main Orchestration Script for Kagi News Aggregator. 3 + 4 + Coordinates all components to: 5 + 1. Fetch RSS feeds 6 + 2. Parse HTML content 7 + 3. Format as rich text 8 + 4. Deduplicate stories 9 + 5. Post to Coves communities 10 + 6. Track state 11 + 12 + Designed to run via CRON (single execution, then exit). 13 + """ 14 + import os 15 + import sys 16 + import logging 17 + from pathlib import Path 18 + from datetime import datetime 19 + from typing import Optional 20 + 21 + from src.config import ConfigLoader 22 + from src.rss_fetcher import RSSFetcher 23 + from src.html_parser import KagiHTMLParser 24 + from src.richtext_formatter import RichTextFormatter 25 + from src.state_manager import StateManager 26 + from src.coves_client import CovesClient 27 + 28 + # Setup logging 29 + logging.basicConfig( 30 + level=logging.INFO, 31 + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' 32 + ) 33 + logger = logging.getLogger(__name__) 34 + 35 + 36 + class Aggregator: 37 + """ 38 + Main aggregator orchestration. 39 + 40 + Coordinates all components to fetch, parse, format, and post stories. 41 + """ 42 + 43 + def __init__( 44 + self, 45 + config_path: Path, 46 + state_file: Path, 47 + coves_client: Optional[CovesClient] = None 48 + ): 49 + """ 50 + Initialize aggregator. 51 + 52 + Args: 53 + config_path: Path to config.yaml 54 + state_file: Path to state.json 55 + coves_client: Optional CovesClient (for testing) 56 + """ 57 + # Load configuration 58 + logger.info("Loading configuration...") 59 + config_loader = ConfigLoader(config_path) 60 + self.config = config_loader.load() 61 + 62 + # Initialize components 63 + logger.info("Initializing components...") 64 + self.rss_fetcher = RSSFetcher() 65 + self.html_parser = KagiHTMLParser() 66 + self.richtext_formatter = RichTextFormatter() 67 + self.state_manager = StateManager(state_file) 68 + self.state_file = state_file 69 + 70 + # Initialize Coves client (or use provided one for testing) 71 + if coves_client: 72 + self.coves_client = coves_client 73 + else: 74 + # Get credentials from environment 75 + aggregator_handle = os.getenv('AGGREGATOR_HANDLE') 76 + aggregator_password = os.getenv('AGGREGATOR_PASSWORD') 77 + pds_url = os.getenv('PDS_URL') # Optional: separate PDS for auth 78 + 79 + if not aggregator_handle or not aggregator_password: 80 + raise ValueError( 81 + "Missing AGGREGATOR_HANDLE or AGGREGATOR_PASSWORD environment variables" 82 + ) 83 + 84 + self.coves_client = CovesClient( 85 + api_url=self.config.coves_api_url, 86 + handle=aggregator_handle, 87 + password=aggregator_password, 88 + pds_url=pds_url # Auth through PDS if specified 89 + ) 90 + 91 + def run(self): 92 + """ 93 + Run aggregator: fetch, parse, post, and update state. 94 + 95 + This is the main entry point for CRON execution. 96 + """ 97 + logger.info("=" * 60) 98 + logger.info("Starting Kagi News Aggregator") 99 + logger.info("=" * 60) 100 + 101 + # Get enabled feeds only 102 + enabled_feeds = [f for f in self.config.feeds if f.enabled] 103 + logger.info(f"Processing {len(enabled_feeds)} enabled feeds") 104 + 105 + # Authenticate once at the start 106 + try: 107 + self.coves_client.authenticate() 108 + except Exception as e: 109 + logger.error(f"Failed to authenticate: {e}") 110 + logger.error("Cannot continue without authentication") 111 + return 112 + 113 + # Process each feed 114 + for feed_config in enabled_feeds: 115 + try: 116 + self._process_feed(feed_config) 117 + except Exception as e: 118 + # Log error but continue with other feeds 119 + logger.error(f"Error processing feed '{feed_config.name}': {e}", exc_info=True) 120 + continue 121 + 122 + logger.info("=" * 60) 123 + logger.info("Aggregator run completed") 124 + logger.info("=" * 60) 125 + 126 + def _process_feed(self, feed_config): 127 + """ 128 + Process a single RSS feed. 129 + 130 + Args: 131 + feed_config: FeedConfig object 132 + """ 133 + logger.info(f"Processing feed: {feed_config.name} -> {feed_config.community_handle}") 134 + 135 + # Fetch RSS feed 136 + try: 137 + feed = self.rss_fetcher.fetch_feed(feed_config.url) 138 + except Exception as e: 139 + logger.error(f"Failed to fetch feed '{feed_config.name}': {e}") 140 + raise 141 + 142 + # Check for feed errors 143 + if feed.bozo: 144 + logger.warning(f"Feed '{feed_config.name}' has parsing issues (bozo flag set)") 145 + 146 + # Process entries 147 + new_posts = 0 148 + skipped_posts = 0 149 + 150 + for entry in feed.entries: 151 + try: 152 + # Check if already posted 153 + guid = entry.guid if hasattr(entry, 'guid') else entry.link 154 + if self.state_manager.is_posted(feed_config.url, guid): 155 + skipped_posts += 1 156 + logger.debug(f"Skipping already-posted story: {guid}") 157 + continue 158 + 159 + # Parse story 160 + story = self.html_parser.parse_to_story( 161 + title=entry.title, 162 + link=entry.link, 163 + guid=guid, 164 + pub_date=entry.published_parsed, 165 + categories=[tag.term for tag in entry.tags] if hasattr(entry, 'tags') else [], 166 + html_description=entry.description 167 + ) 168 + 169 + # Format as rich text 170 + rich_text = self.richtext_formatter.format_full(story) 171 + 172 + # Create external embed 173 + embed = self.coves_client.create_external_embed( 174 + uri=story.link, 175 + title=story.title, 176 + description=story.summary[:200] if len(story.summary) > 200 else story.summary, 177 + thumb=story.image_url 178 + ) 179 + 180 + # Post to community 181 + try: 182 + post_uri = self.coves_client.create_post( 183 + community_handle=feed_config.community_handle, 184 + content=rich_text["content"], 185 + facets=rich_text["facets"], 186 + embed=embed 187 + ) 188 + 189 + # Mark as posted (only if successful) 190 + self.state_manager.mark_posted(feed_config.url, guid, post_uri) 191 + new_posts += 1 192 + logger.info(f"Posted: {story.title[:50]}... -> {post_uri}") 193 + 194 + except Exception as e: 195 + # Don't update state if posting failed 196 + logger.error(f"Failed to post story '{story.title}': {e}") 197 + continue 198 + 199 + except Exception as e: 200 + # Log error but continue with other entries 201 + logger.error(f"Error processing entry: {e}", exc_info=True) 202 + continue 203 + 204 + # Update last run timestamp 205 + self.state_manager.update_last_run(feed_config.url, datetime.now()) 206 + 207 + logger.info( 208 + f"Feed '{feed_config.name}': {new_posts} new posts, {skipped_posts} duplicates" 209 + ) 210 + 211 + 212 + def main(): 213 + """ 214 + Main entry point for command-line execution. 215 + 216 + Usage: 217 + python -m src.main 218 + """ 219 + # Get paths from environment or use defaults 220 + config_path = Path(os.getenv('CONFIG_PATH', 'config.yaml')) 221 + state_file = Path(os.getenv('STATE_FILE', 'data/state.json')) 222 + 223 + # Validate config file exists 224 + if not config_path.exists(): 225 + logger.error(f"Configuration file not found: {config_path}") 226 + logger.error("Please create config.yaml (see config.example.yaml)") 227 + sys.exit(1) 228 + 229 + # Create aggregator and run 230 + try: 231 + aggregator = Aggregator( 232 + config_path=config_path, 233 + state_file=state_file 234 + ) 235 + aggregator.run() 236 + sys.exit(0) 237 + except Exception as e: 238 + logger.error(f"Aggregator failed: {e}", exc_info=True) 239 + sys.exit(1) 240 + 241 + 242 + if __name__ == '__main__': 243 + main()
+79
aggregators/kagi-news/src/models.py
··· 1 + """ 2 + Data models for Kagi News RSS aggregator. 3 + """ 4 + from dataclasses import dataclass, field 5 + from datetime import datetime 6 + from typing import List, Optional 7 + 8 + 9 + @dataclass 10 + class Source: 11 + """A news source citation.""" 12 + title: str 13 + url: str 14 + domain: str 15 + 16 + 17 + @dataclass 18 + class Perspective: 19 + """A perspective from a particular actor/stakeholder.""" 20 + actor: str 21 + description: str 22 + source_url: str 23 + 24 + 25 + @dataclass 26 + class Quote: 27 + """A notable quote from the story.""" 28 + text: str 29 + attribution: str 30 + 31 + 32 + @dataclass 33 + class KagiStory: 34 + """ 35 + Structured representation of a Kagi News story. 36 + 37 + Parsed from RSS feed item with HTML description. 38 + """ 39 + # RSS metadata 40 + title: str 41 + link: str # Kagi story permalink 42 + guid: str 43 + pub_date: datetime 44 + categories: List[str] = field(default_factory=list) 45 + 46 + # Parsed from HTML description 47 + summary: str = "" 48 + highlights: List[str] = field(default_factory=list) 49 + perspectives: List[Perspective] = field(default_factory=list) 50 + quote: Optional[Quote] = None 51 + sources: List[Source] = field(default_factory=list) 52 + image_url: Optional[str] = None 53 + image_alt: Optional[str] = None 54 + 55 + def __post_init__(self): 56 + """Validate required fields.""" 57 + if not self.title: 58 + raise ValueError("title is required") 59 + if not self.link: 60 + raise ValueError("link is required") 61 + if not self.guid: 62 + raise ValueError("guid is required") 63 + 64 + 65 + @dataclass 66 + class FeedConfig: 67 + """Configuration for a single RSS feed.""" 68 + name: str 69 + url: str 70 + community_handle: str 71 + enabled: bool = True 72 + 73 + 74 + @dataclass 75 + class AggregatorConfig: 76 + """Full aggregator configuration.""" 77 + coves_api_url: str 78 + feeds: List[FeedConfig] 79 + log_level: str = "info"
+177
aggregators/kagi-news/src/richtext_formatter.py
··· 1 + """ 2 + Rich Text Formatter for Coves posts. 3 + 4 + Converts KagiStory objects to Coves rich text format with facets. 5 + Handles UTF-8 byte position calculation for multi-byte characters. 6 + """ 7 + import logging 8 + from typing import Dict, List, Tuple 9 + from src.models import KagiStory, Perspective, Source 10 + 11 + logger = logging.getLogger(__name__) 12 + 13 + 14 + class RichTextFormatter: 15 + """ 16 + Formats KagiStory into Coves rich text with facets. 17 + 18 + Applies: 19 + - Bold facets for section headers and perspective actors 20 + - Italic facets for quotes 21 + - Link facets for all URLs 22 + """ 23 + 24 + def format_full(self, story: KagiStory) -> Dict: 25 + """ 26 + Format KagiStory into full rich text format. 27 + 28 + Args: 29 + story: KagiStory object to format 30 + 31 + Returns: 32 + Dictionary with 'content' (str) and 'facets' (list) 33 + """ 34 + builder = RichTextBuilder() 35 + 36 + # Summary 37 + builder.add_text(story.summary) 38 + builder.add_text("\n\n") 39 + 40 + # Highlights (if present) 41 + if story.highlights: 42 + builder.add_bold("Highlights:") 43 + builder.add_text("\n") 44 + for highlight in story.highlights: 45 + builder.add_text(f"• {highlight}\n") 46 + builder.add_text("\n") 47 + 48 + # Perspectives (if present) 49 + if story.perspectives: 50 + builder.add_bold("Perspectives:") 51 + builder.add_text("\n") 52 + for perspective in story.perspectives: 53 + # Bold the actor name 54 + actor_with_colon = f"{perspective.actor}:" 55 + builder.add_bold(actor_with_colon) 56 + builder.add_text(f" {perspective.description} (") 57 + 58 + # Add link to source 59 + source_link_text = "Source" 60 + builder.add_link(source_link_text, perspective.source_url) 61 + builder.add_text(")\n") 62 + builder.add_text("\n") 63 + 64 + # Quote (if present) 65 + if story.quote: 66 + quote_text = f'"{story.quote.text}"' 67 + builder.add_italic(quote_text) 68 + builder.add_text(f" — {story.quote.attribution}\n\n") 69 + 70 + # Sources (if present) 71 + if story.sources: 72 + builder.add_bold("Sources:") 73 + builder.add_text("\n") 74 + for source in story.sources: 75 + builder.add_text("• ") 76 + builder.add_link(source.title, source.url) 77 + builder.add_text(f" - {source.domain}\n") 78 + builder.add_text("\n") 79 + 80 + # Kagi News attribution 81 + builder.add_text("---\n📰 Story aggregated by ") 82 + builder.add_link("Kagi News", story.link) 83 + 84 + return builder.build() 85 + 86 + 87 + class RichTextBuilder: 88 + """ 89 + Helper class to build rich text content with facets. 90 + 91 + Handles UTF-8 byte position tracking automatically. 92 + """ 93 + 94 + def __init__(self): 95 + self.content_parts = [] 96 + self.facets = [] 97 + 98 + def add_text(self, text: str): 99 + """Add plain text without any facets.""" 100 + self.content_parts.append(text) 101 + 102 + def add_bold(self, text: str): 103 + """Add text with bold facet.""" 104 + start_byte = self._get_current_byte_position() 105 + self.content_parts.append(text) 106 + end_byte = self._get_current_byte_position() 107 + 108 + self.facets.append({ 109 + "index": { 110 + "byteStart": start_byte, 111 + "byteEnd": end_byte 112 + }, 113 + "features": [ 114 + {"$type": "social.coves.richtext.facet#bold"} 115 + ] 116 + }) 117 + 118 + def add_italic(self, text: str): 119 + """Add text with italic facet.""" 120 + start_byte = self._get_current_byte_position() 121 + self.content_parts.append(text) 122 + end_byte = self._get_current_byte_position() 123 + 124 + self.facets.append({ 125 + "index": { 126 + "byteStart": start_byte, 127 + "byteEnd": end_byte 128 + }, 129 + "features": [ 130 + {"$type": "social.coves.richtext.facet#italic"} 131 + ] 132 + }) 133 + 134 + def add_link(self, text: str, uri: str): 135 + """Add text with link facet.""" 136 + start_byte = self._get_current_byte_position() 137 + self.content_parts.append(text) 138 + end_byte = self._get_current_byte_position() 139 + 140 + self.facets.append({ 141 + "index": { 142 + "byteStart": start_byte, 143 + "byteEnd": end_byte 144 + }, 145 + "features": [ 146 + { 147 + "$type": "social.coves.richtext.facet#link", 148 + "uri": uri 149 + } 150 + ] 151 + }) 152 + 153 + def _get_current_byte_position(self) -> int: 154 + """ 155 + Get the current byte position in the content. 156 + 157 + Uses UTF-8 encoding to handle multi-byte characters correctly. 158 + """ 159 + current_content = ''.join(self.content_parts) 160 + return len(current_content.encode('utf-8')) 161 + 162 + def build(self) -> Dict: 163 + """ 164 + Build the final rich text object. 165 + 166 + Returns: 167 + Dictionary with 'content' and 'facets' 168 + """ 169 + content = ''.join(self.content_parts) 170 + 171 + # Sort facets by start position for consistency 172 + sorted_facets = sorted(self.facets, key=lambda f: f['index']['byteStart']) 173 + 174 + return { 175 + "content": content, 176 + "facets": sorted_facets 177 + }
+71
aggregators/kagi-news/src/rss_fetcher.py
··· 1 + """ 2 + RSS feed fetcher with retry logic and error handling. 3 + """ 4 + import time 5 + import logging 6 + import requests 7 + import feedparser 8 + from typing import Optional 9 + 10 + logger = logging.getLogger(__name__) 11 + 12 + 13 + class RSSFetcher: 14 + """Fetches RSS feeds with retry logic.""" 15 + 16 + def __init__(self, timeout: int = 30, max_retries: int = 3): 17 + """ 18 + Initialize RSS fetcher. 19 + 20 + Args: 21 + timeout: Request timeout in seconds 22 + max_retries: Maximum number of retry attempts 23 + """ 24 + self.timeout = timeout 25 + self.max_retries = max_retries 26 + 27 + def fetch_feed(self, url: str) -> feedparser.FeedParserDict: 28 + """ 29 + Fetch and parse an RSS feed. 30 + 31 + Args: 32 + url: RSS feed URL 33 + 34 + Returns: 35 + Parsed feed object 36 + 37 + Raises: 38 + ValueError: If URL is empty 39 + requests.RequestException: If all retry attempts fail 40 + """ 41 + if not url: 42 + raise ValueError("URL cannot be empty") 43 + 44 + last_error = None 45 + 46 + for attempt in range(self.max_retries): 47 + try: 48 + logger.info(f"Fetching feed from {url} (attempt {attempt + 1}/{self.max_retries})") 49 + 50 + response = requests.get(url, timeout=self.timeout) 51 + response.raise_for_status() 52 + 53 + # Parse with feedparser 54 + feed = feedparser.parse(response.content) 55 + 56 + logger.info(f"Successfully fetched feed: {feed.feed.get('title', 'Unknown')}") 57 + return feed 58 + 59 + except requests.RequestException as e: 60 + last_error = e 61 + logger.warning(f"Fetch attempt {attempt + 1} failed: {e}") 62 + 63 + if attempt < self.max_retries - 1: 64 + # Exponential backoff 65 + sleep_time = 2 ** attempt 66 + logger.info(f"Retrying in {sleep_time} seconds...") 67 + time.sleep(sleep_time) 68 + 69 + # All retries exhausted 70 + logger.error(f"Failed to fetch feed after {self.max_retries} attempts") 71 + raise last_error
+213
aggregators/kagi-news/src/state_manager.py
··· 1 + """ 2 + State Manager for tracking posted stories. 3 + 4 + Handles deduplication by tracking which stories have already been posted. 5 + Uses JSON file for persistence. 6 + """ 7 + import json 8 + import logging 9 + from pathlib import Path 10 + from datetime import datetime, timedelta 11 + from typing import Optional, Dict, List 12 + 13 + logger = logging.getLogger(__name__) 14 + 15 + 16 + class StateManager: 17 + """ 18 + Manages aggregator state for deduplication. 19 + 20 + Tracks: 21 + - Posted GUIDs per feed (with timestamps) 22 + - Last successful run timestamp per feed 23 + - Automatic cleanup of old entries 24 + """ 25 + 26 + def __init__(self, state_file: Path, max_guids_per_feed: int = 100, max_age_days: int = 30): 27 + """ 28 + Initialize state manager. 29 + 30 + Args: 31 + state_file: Path to JSON state file 32 + max_guids_per_feed: Maximum GUIDs to keep per feed (default: 100) 33 + max_age_days: Maximum age in days for GUIDs (default: 30) 34 + """ 35 + self.state_file = Path(state_file) 36 + self.max_guids_per_feed = max_guids_per_feed 37 + self.max_age_days = max_age_days 38 + self.state = self._load_state() 39 + 40 + def _load_state(self) -> Dict: 41 + """Load state from file, or create new state if file doesn't exist.""" 42 + if not self.state_file.exists(): 43 + logger.info(f"Creating new state file at {self.state_file}") 44 + state = {'feeds': {}} 45 + self._save_state(state) 46 + return state 47 + 48 + try: 49 + with open(self.state_file, 'r') as f: 50 + state = json.load(f) 51 + logger.info(f"Loaded state from {self.state_file}") 52 + return state 53 + except json.JSONDecodeError as e: 54 + logger.error(f"Failed to load state file: {e}. Creating new state.") 55 + state = {'feeds': {}} 56 + self._save_state(state) 57 + return state 58 + 59 + def _save_state(self, state: Optional[Dict] = None): 60 + """Save state to file.""" 61 + if state is None: 62 + state = self.state 63 + 64 + # Ensure parent directory exists 65 + self.state_file.parent.mkdir(parents=True, exist_ok=True) 66 + 67 + with open(self.state_file, 'w') as f: 68 + json.dump(state, f, indent=2) 69 + 70 + def _ensure_feed_exists(self, feed_url: str): 71 + """Ensure feed entry exists in state.""" 72 + if feed_url not in self.state['feeds']: 73 + self.state['feeds'][feed_url] = { 74 + 'posted_guids': [], 75 + 'last_successful_run': None 76 + } 77 + 78 + def is_posted(self, feed_url: str, guid: str) -> bool: 79 + """ 80 + Check if a story has already been posted. 81 + 82 + Args: 83 + feed_url: RSS feed URL 84 + guid: Story GUID 85 + 86 + Returns: 87 + True if already posted, False otherwise 88 + """ 89 + self._ensure_feed_exists(feed_url) 90 + 91 + posted_guids = self.state['feeds'][feed_url]['posted_guids'] 92 + return any(entry['guid'] == guid for entry in posted_guids) 93 + 94 + def mark_posted(self, feed_url: str, guid: str, post_uri: str): 95 + """ 96 + Mark a story as posted. 97 + 98 + Args: 99 + feed_url: RSS feed URL 100 + guid: Story GUID 101 + post_uri: AT Proto URI of created post 102 + """ 103 + self._ensure_feed_exists(feed_url) 104 + 105 + # Add to posted list 106 + entry = { 107 + 'guid': guid, 108 + 'post_uri': post_uri, 109 + 'posted_at': datetime.now().isoformat() 110 + } 111 + self.state['feeds'][feed_url]['posted_guids'].append(entry) 112 + 113 + # Auto-cleanup to keep state file manageable 114 + self.cleanup_old_entries(feed_url) 115 + 116 + # Save state 117 + self._save_state() 118 + 119 + logger.info(f"Marked as posted: {guid} -> {post_uri}") 120 + 121 + def get_last_run(self, feed_url: str) -> Optional[datetime]: 122 + """ 123 + Get last successful run timestamp for a feed. 124 + 125 + Args: 126 + feed_url: RSS feed URL 127 + 128 + Returns: 129 + Datetime of last run, or None if never run 130 + """ 131 + self._ensure_feed_exists(feed_url) 132 + 133 + timestamp_str = self.state['feeds'][feed_url]['last_successful_run'] 134 + if timestamp_str is None: 135 + return None 136 + 137 + return datetime.fromisoformat(timestamp_str) 138 + 139 + def update_last_run(self, feed_url: str, timestamp: datetime): 140 + """ 141 + Update last successful run timestamp. 142 + 143 + Args: 144 + feed_url: RSS feed URL 145 + timestamp: Timestamp of successful run 146 + """ 147 + self._ensure_feed_exists(feed_url) 148 + 149 + self.state['feeds'][feed_url]['last_successful_run'] = timestamp.isoformat() 150 + self._save_state() 151 + 152 + logger.info(f"Updated last run for {feed_url}: {timestamp}") 153 + 154 + def cleanup_old_entries(self, feed_url: str): 155 + """ 156 + Remove old entries from state. 157 + 158 + Removes entries that are: 159 + - Older than max_age_days 160 + - Beyond max_guids_per_feed limit (keeps most recent) 161 + 162 + Args: 163 + feed_url: RSS feed URL 164 + """ 165 + self._ensure_feed_exists(feed_url) 166 + 167 + posted_guids = self.state['feeds'][feed_url]['posted_guids'] 168 + 169 + # Filter out entries older than max_age_days 170 + cutoff_date = datetime.now() - timedelta(days=self.max_age_days) 171 + filtered = [ 172 + entry for entry in posted_guids 173 + if datetime.fromisoformat(entry['posted_at']) > cutoff_date 174 + ] 175 + 176 + # Keep only most recent max_guids_per_feed entries 177 + # Sort by posted_at (most recent first) 178 + filtered.sort(key=lambda x: x['posted_at'], reverse=True) 179 + filtered = filtered[:self.max_guids_per_feed] 180 + 181 + # Update state 182 + old_count = len(posted_guids) 183 + new_count = len(filtered) 184 + self.state['feeds'][feed_url]['posted_guids'] = filtered 185 + 186 + if old_count != new_count: 187 + logger.info(f"Cleaned up {old_count - new_count} old entries for {feed_url}") 188 + 189 + def get_posted_count(self, feed_url: str) -> int: 190 + """ 191 + Get count of posted items for a feed. 192 + 193 + Args: 194 + feed_url: RSS feed URL 195 + 196 + Returns: 197 + Number of posted items 198 + """ 199 + self._ensure_feed_exists(feed_url) 200 + return len(self.state['feeds'][feed_url]['posted_guids']) 201 + 202 + def get_all_posted_guids(self, feed_url: str) -> List[str]: 203 + """ 204 + Get all posted GUIDs for a feed. 205 + 206 + Args: 207 + feed_url: RSS feed URL 208 + 209 + Returns: 210 + List of GUIDs 211 + """ 212 + self._ensure_feed_exists(feed_url) 213 + return [entry['guid'] for entry in self.state['feeds'][feed_url]['posted_guids']]
+1
aggregators/kagi-news/tests/__init__.py
··· 1 + """Test suite for Kagi News aggregator."""
+12
aggregators/kagi-news/tests/fixtures/sample_rss_item.xml
··· 1 + <?xml version='1.0' encoding='UTF-8'?> 2 + <!-- Sample RSS item from Kagi News - includes quote, highlights, perspectives, sources --> 3 + <item> 4 + <title>Trump to meet Xi in South Korea on Oct 30</title> 5 + <link>https://kite.kagi.com/96cf948f-8a1b-4281-9ba4-8a9e1ad7b3c6/world/10</link> 6 + <description>&lt;p&gt;The White House confirmed President Trump will hold a bilateral meeting with Chinese President Xi Jinping in South Korea on October 30, at the end of an Asia trip that includes Malaysia and Japan . The administration said the meeting will take place Thursday morning local time, and Mr Trump indicated his first question to Xi would concern fentanyl and other bilateral issues . The talks come amid heightened trade tensions after Beijing expanded export curbs on rare-earth minerals and following Mr Trump's recent threat of additional tariffs on Chinese goods, making the meeting a focal point for discussions on trade, technology supply chains and energy .&lt;/p&gt;&lt;img src='https://kagiproxy.com/img/Q2SRXQtwTYBIiQeI0FG-X6taF_wHSJaXDiFUzju2kbCWGuOYIFUX--8L0BqE4VKxpbOJY3ylFPJkDpfSnyQYZ1qdOLXbphHTnsOK4jb7gqC4KCn5nf3ANbWCuaFD5ZUSijiK0k7wOLP2fyX6tynu2mPtXlCbotLo2lTrEswZl4-No2AI4mI4lkResfnRdp-YjpoEfCOHkNfbN1-0cNcHt9T2dmgBSXrQ2w' alt='News image associated with coverage of President Trump&amp;#x27;s Asia trip and planned meeting with President Xi' /&gt;&lt;br /&gt;&lt;h3&gt;Highlights:&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Itinerary details: The Asia swing begins in Malaysia, continues to Japan and ends with the bilateral meeting in South Korea on Thursday morning local time, White House press secretary Karoline Leavitt said at a briefing .&lt;/li&gt;&lt;li&gt;APEC context: US officials indicated the leaders will meet on the sidelines of the Asia-Pacific Economic Cooperation gathering, shaping expectations for short, high-level talks rather than a lengthy summit .&lt;/li&gt;&lt;li&gt;Tariff escalation: President Trump recently threatened an additional 100% tariff on Chinese goods starting in November, a step he has described as unsustainable but that has heightened urgency for talks .&lt;/li&gt;&lt;li&gt;Rare-earth impact: Beijing's expanded curbs on rare-earth exports have exposed supply vulnerabilities because US high-tech firms rely heavily on those materials, raising strategic and economic stakes for the meeting .&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;Work out a lot of our doubts and questions - President Trump&lt;/blockquote&gt;&lt;h3&gt;Perspectives:&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;President Trump: He said his first question to President Xi would be about fentanyl and indicated he hoped to resolve bilateral doubts and questions in the talks. (&lt;a href='https://www.straitstimes.com/world/united-states/trump-to-meet-xi-in-south-korea-on-oct-30-as-part-of-asia-swing'&gt;The Straits Times&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;White House (press secretary): Karoline Leavitt confirmed the bilateral meeting will occur Thursday morning local time during a White House briefing. (&lt;a href='https://www.scmp.com/news/us/diplomacy/article/3330131/donald-trump-meet-chinas-xi-jinping-next-thursday-south-korea-crunch-talks'&gt;South China Morning Post&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;Beijing/Chinese authorities: Officials have defended tighter export controls on rare-earths, a move described in reporting as not explicitly targeting the US though it has raised tensions. (&lt;a href='https://www.rt.com/news/626890-white-house-announces-trump-xi-meeting/'&gt;RT&lt;/a&gt;)&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Sources:&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;a href='https://www.straitstimes.com/world/united-states/trump-to-meet-xi-in-south-korea-on-oct-30-as-part-of-asia-swing'&gt;Trump to meet Xi in South Korea on Oct 30 as part of Asia swing&lt;/a&gt; - straitstimes.com&lt;/li&gt;&lt;li&gt;&lt;a href='https://www.scmp.com/news/us/diplomacy/article/3330131/donald-trump-meet-chinas-xi-jinping-next-thursday-south-korea-crunch-talks'&gt;Trump to meet Xi in South Korea next Thursday as part of key Asia trip&lt;/a&gt; - scmp.com&lt;/li&gt;&lt;li&gt;&lt;a href='https://www.rt.com/news/626890-white-house-announces-trump-xi-meeting/'&gt;White House announces Trump-Xi meeting&lt;/a&gt; - rt.com&lt;/li&gt;&lt;li&gt;&lt;a href='https://www.thehindu.com/news/international/trump-to-meet-xi-in-south-korea-as-part-of-asia-swing/article70195667.ece'&gt;Trump to meet Xi in South Korea as part of Asia swing&lt;/a&gt; - thehindu.com&lt;/li&gt;&lt;li&gt;&lt;a href='https://www.aljazeera.com/news/2025/10/24/white-house-confirms-trump-to-meet-xi-in-south-korea-as-part-of-asia-tour'&gt;White House confirms Trump to meet Xi in South Korea as part of Asia tour&lt;/a&gt; - aljazeera.com&lt;/li&gt;&lt;/ul&gt;</description> 7 + <guid isPermaLink="true">https://kite.kagi.com/96cf948f-8a1b-4281-9ba4-8a9e1ad7b3c6/world/10</guid> 8 + <category>World</category> 9 + <category>World/Diplomacy</category> 10 + <category>Diplomacy</category> 11 + <pubDate>Thu, 23 Oct 2025 20:56:00 +0000</pubDate> 12 + </item>
+246
aggregators/kagi-news/tests/test_config.py
··· 1 + """ 2 + Tests for Configuration Loader. 3 + 4 + Tests loading and validating aggregator configuration. 5 + """ 6 + import pytest 7 + import tempfile 8 + from pathlib import Path 9 + 10 + from src.config import ConfigLoader, ConfigError 11 + from src.models import AggregatorConfig, FeedConfig 12 + 13 + 14 + @pytest.fixture 15 + def valid_config_yaml(): 16 + """Valid configuration YAML.""" 17 + return """ 18 + coves_api_url: "https://api.coves.social" 19 + 20 + feeds: 21 + - name: "World News" 22 + url: "https://news.kagi.com/world.xml" 23 + community_handle: "world-news.coves.social" 24 + enabled: true 25 + 26 + - name: "Tech News" 27 + url: "https://news.kagi.com/tech.xml" 28 + community_handle: "tech.coves.social" 29 + enabled: true 30 + 31 + - name: "Science News" 32 + url: "https://news.kagi.com/science.xml" 33 + community_handle: "science.coves.social" 34 + enabled: false 35 + 36 + log_level: "info" 37 + """ 38 + 39 + 40 + @pytest.fixture 41 + def temp_config_file(valid_config_yaml): 42 + """Create a temporary config file.""" 43 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 44 + f.write(valid_config_yaml) 45 + temp_path = Path(f.name) 46 + yield temp_path 47 + # Cleanup 48 + if temp_path.exists(): 49 + temp_path.unlink() 50 + 51 + 52 + class TestConfigLoader: 53 + """Test suite for ConfigLoader.""" 54 + 55 + def test_load_valid_config(self, temp_config_file): 56 + """Test loading valid configuration.""" 57 + loader = ConfigLoader(temp_config_file) 58 + config = loader.load() 59 + 60 + assert isinstance(config, AggregatorConfig) 61 + assert config.coves_api_url == "https://api.coves.social" 62 + assert config.log_level == "info" 63 + assert len(config.feeds) == 3 64 + 65 + def test_parse_feed_configs(self, temp_config_file): 66 + """Test parsing feed configurations.""" 67 + loader = ConfigLoader(temp_config_file) 68 + config = loader.load() 69 + 70 + # Check first feed 71 + feed1 = config.feeds[0] 72 + assert isinstance(feed1, FeedConfig) 73 + assert feed1.name == "World News" 74 + assert feed1.url == "https://news.kagi.com/world.xml" 75 + assert feed1.community_handle == "world-news.coves.social" 76 + assert feed1.enabled is True 77 + 78 + # Check disabled feed 79 + feed3 = config.feeds[2] 80 + assert feed3.name == "Science News" 81 + assert feed3.enabled is False 82 + 83 + def test_get_enabled_feeds_only(self, temp_config_file): 84 + """Test getting only enabled feeds.""" 85 + loader = ConfigLoader(temp_config_file) 86 + config = loader.load() 87 + 88 + enabled_feeds = [f for f in config.feeds if f.enabled] 89 + assert len(enabled_feeds) == 2 90 + assert all(f.enabled for f in enabled_feeds) 91 + 92 + def test_missing_config_file_raises_error(self): 93 + """Test that missing config file raises error.""" 94 + with pytest.raises(ConfigError, match="not found"): 95 + loader = ConfigLoader(Path("nonexistent.yaml")) 96 + loader.load() 97 + 98 + def test_invalid_yaml_raises_error(self): 99 + """Test that invalid YAML raises error.""" 100 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 101 + f.write("invalid: yaml: content: [[[") 102 + temp_path = Path(f.name) 103 + 104 + try: 105 + with pytest.raises(ConfigError, match="Failed to parse"): 106 + loader = ConfigLoader(temp_path) 107 + loader.load() 108 + finally: 109 + temp_path.unlink() 110 + 111 + def test_missing_required_field_raises_error(self): 112 + """Test that missing required fields raise error.""" 113 + invalid_yaml = """ 114 + feeds: 115 + - name: "Test" 116 + url: "https://test.xml" 117 + # Missing community_handle! 118 + """ 119 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 120 + f.write(invalid_yaml) 121 + temp_path = Path(f.name) 122 + 123 + try: 124 + with pytest.raises(ConfigError, match="Missing required field"): 125 + loader = ConfigLoader(temp_path) 126 + loader.load() 127 + finally: 128 + temp_path.unlink() 129 + 130 + def test_missing_coves_api_url_raises_error(self): 131 + """Test that missing coves_api_url raises error.""" 132 + invalid_yaml = """ 133 + feeds: 134 + - name: "Test" 135 + url: "https://test.xml" 136 + community_handle: "test.coves.social" 137 + """ 138 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 139 + f.write(invalid_yaml) 140 + temp_path = Path(f.name) 141 + 142 + try: 143 + with pytest.raises(ConfigError, match="coves_api_url"): 144 + loader = ConfigLoader(temp_path) 145 + loader.load() 146 + finally: 147 + temp_path.unlink() 148 + 149 + def test_default_log_level(self): 150 + """Test that log_level defaults to 'info' if not specified.""" 151 + minimal_yaml = """ 152 + coves_api_url: "https://api.coves.social" 153 + feeds: 154 + - name: "Test" 155 + url: "https://test.xml" 156 + community_handle: "test.coves.social" 157 + """ 158 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 159 + f.write(minimal_yaml) 160 + temp_path = Path(f.name) 161 + 162 + try: 163 + loader = ConfigLoader(temp_path) 164 + config = loader.load() 165 + assert config.log_level == "info" 166 + finally: 167 + temp_path.unlink() 168 + 169 + def test_default_enabled_true(self): 170 + """Test that feed enabled defaults to True if not specified.""" 171 + yaml_content = """ 172 + coves_api_url: "https://api.coves.social" 173 + feeds: 174 + - name: "Test" 175 + url: "https://test.xml" 176 + community_handle: "test.coves.social" 177 + # No 'enabled' field 178 + """ 179 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 180 + f.write(yaml_content) 181 + temp_path = Path(f.name) 182 + 183 + try: 184 + loader = ConfigLoader(temp_path) 185 + config = loader.load() 186 + assert config.feeds[0].enabled is True 187 + finally: 188 + temp_path.unlink() 189 + 190 + def test_invalid_url_format_raises_error(self): 191 + """Test that invalid URLs raise error.""" 192 + invalid_yaml = """ 193 + coves_api_url: "https://api.coves.social" 194 + feeds: 195 + - name: "Test" 196 + url: "not-a-valid-url" 197 + community_handle: "test.coves.social" 198 + """ 199 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 200 + f.write(invalid_yaml) 201 + temp_path = Path(f.name) 202 + 203 + try: 204 + with pytest.raises(ConfigError, match="Invalid URL"): 205 + loader = ConfigLoader(temp_path) 206 + loader.load() 207 + finally: 208 + temp_path.unlink() 209 + 210 + def test_empty_feeds_list_raises_error(self): 211 + """Test that empty feeds list raises error.""" 212 + invalid_yaml = """ 213 + coves_api_url: "https://api.coves.social" 214 + feeds: [] 215 + """ 216 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.yaml') as f: 217 + f.write(invalid_yaml) 218 + temp_path = Path(f.name) 219 + 220 + try: 221 + with pytest.raises(ConfigError, match="at least one feed"): 222 + loader = ConfigLoader(temp_path) 223 + loader.load() 224 + finally: 225 + temp_path.unlink() 226 + 227 + def test_load_from_env_override(self, temp_config_file, monkeypatch): 228 + """Test that environment variables can override config values.""" 229 + # Set environment variable 230 + monkeypatch.setenv("COVES_API_URL", "https://test.coves.social") 231 + 232 + loader = ConfigLoader(temp_config_file) 233 + config = loader.load() 234 + 235 + # Should use env var instead of config file 236 + assert config.coves_api_url == "https://test.coves.social" 237 + 238 + def test_get_feed_by_url(self, temp_config_file): 239 + """Test helper to get feed config by URL.""" 240 + loader = ConfigLoader(temp_config_file) 241 + config = loader.load() 242 + 243 + feed = next((f for f in config.feeds if f.url == "https://news.kagi.com/tech.xml"), None) 244 + assert feed is not None 245 + assert feed.name == "Tech News" 246 + assert feed.community_handle == "tech.coves.social"
+433
aggregators/kagi-news/tests/test_e2e.py
··· 1 + """ 2 + End-to-End Integration Tests. 3 + 4 + Tests the complete aggregator workflow against live infrastructure: 5 + - Real HTTP mocking (Kagi RSS) 6 + - Real PDS (Coves test PDS via Docker) 7 + - Real community posting 8 + - Real state management 9 + 10 + Requires: 11 + - Coves test PDS running on localhost:3001 12 + - Test database with community: e2e-95206.community.coves.social 13 + - Aggregator account: kagi-news.local.coves.dev 14 + """ 15 + import os 16 + import pytest 17 + import responses 18 + from pathlib import Path 19 + from datetime import datetime 20 + 21 + from src.main import Aggregator 22 + from src.coves_client import CovesClient 23 + from src.config import ConfigLoader 24 + 25 + 26 + # Skip E2E tests by default (require live infrastructure) 27 + pytestmark = pytest.mark.skipif( 28 + os.getenv('RUN_E2E_TESTS') != '1', 29 + reason="E2E tests require RUN_E2E_TESTS=1 and live PDS" 30 + ) 31 + 32 + 33 + @pytest.fixture 34 + def test_community(aggregator_credentials): 35 + """Create a test community for E2E testing.""" 36 + import time 37 + import requests 38 + 39 + handle, password = aggregator_credentials 40 + 41 + # Authenticate 42 + auth_response = requests.post( 43 + "http://localhost:3001/xrpc/com.atproto.server.createSession", 44 + json={"identifier": handle, "password": password} 45 + ) 46 + token = auth_response.json()["accessJwt"] 47 + 48 + # Create community (use short name to avoid handle length limits) 49 + community_name = f"e2e-{int(time.time()) % 10000}" # Last 4 digits only 50 + create_response = requests.post( 51 + "http://localhost:8081/xrpc/social.coves.community.create", 52 + headers={"Authorization": f"Bearer {token}"}, 53 + json={ 54 + "name": community_name, 55 + "displayName": "E2E Test Community", 56 + "description": "Temporary community for aggregator E2E testing", 57 + "visibility": "public" 58 + } 59 + ) 60 + 61 + if create_response.ok: 62 + community = create_response.json() 63 + community_handle = f"{community_name}.community.coves.social" 64 + print(f"\n✅ Created test community: {community_handle}") 65 + return community_handle 66 + else: 67 + raise Exception(f"Failed to create community: {create_response.text}") 68 + 69 + 70 + @pytest.fixture 71 + def test_config_file(tmp_path, test_community): 72 + """Create test configuration file with dynamic community.""" 73 + config_content = f""" 74 + coves_api_url: http://localhost:8081 75 + 76 + feeds: 77 + - name: "Kagi World News" 78 + url: "https://news.kagi.com/world.xml" 79 + community_handle: "{test_community}" 80 + enabled: true 81 + 82 + log_level: debug 83 + """ 84 + config_file = tmp_path / "config.yaml" 85 + config_file.write_text(config_content) 86 + return config_file 87 + 88 + 89 + @pytest.fixture 90 + def test_state_file(tmp_path): 91 + """Create temporary state file.""" 92 + return tmp_path / "state.json" 93 + 94 + 95 + @pytest.fixture 96 + def mock_kagi_feed(): 97 + """Load real Kagi RSS feed fixture.""" 98 + # Load from data directory (where actual feed is stored) 99 + fixture_path = Path(__file__).parent.parent / "data" / "world.xml" 100 + if not fixture_path.exists(): 101 + # Fallback to tests/fixtures if moved 102 + fixture_path = Path(__file__).parent / "fixtures" / "world.xml" 103 + return fixture_path.read_text() 104 + 105 + 106 + @pytest.fixture 107 + def aggregator_credentials(): 108 + """Get aggregator credentials from environment.""" 109 + handle = os.getenv('AGGREGATOR_HANDLE', 'kagi-news.local.coves.dev') 110 + password = os.getenv('AGGREGATOR_PASSWORD', 'kagi-aggregator-2024-secure-pass') 111 + return handle, password 112 + 113 + 114 + class TestEndToEnd: 115 + """Full end-to-end integration tests.""" 116 + 117 + @responses.activate 118 + def test_full_aggregator_workflow( 119 + self, 120 + test_config_file, 121 + test_state_file, 122 + mock_kagi_feed, 123 + aggregator_credentials 124 + ): 125 + """ 126 + Test complete workflow: fetch → parse → format → post → verify. 127 + 128 + This test: 129 + 1. Mocks Kagi RSS HTTP request 130 + 2. Authenticates with real PDS 131 + 3. Parses real Kagi HTML content 132 + 4. Formats with rich text facets 133 + 5. Posts to real community 134 + 6. Verifies post was created 135 + 7. Tests deduplication (no repost) 136 + """ 137 + # Mock Kagi RSS feed 138 + responses.add( 139 + responses.GET, 140 + "https://news.kagi.com/world.xml", 141 + body=mock_kagi_feed, 142 + status=200, 143 + content_type="application/xml" 144 + ) 145 + 146 + # Allow passthrough for localhost (PDS) 147 + responses.add_passthru("http://localhost") 148 + 149 + # Set up environment 150 + handle, password = aggregator_credentials 151 + os.environ['AGGREGATOR_HANDLE'] = handle 152 + os.environ['AGGREGATOR_PASSWORD'] = password 153 + os.environ['PDS_URL'] = 'http://localhost:3001' # Auth through PDS 154 + 155 + # Create aggregator 156 + aggregator = Aggregator( 157 + config_path=test_config_file, 158 + state_file=test_state_file 159 + ) 160 + 161 + # Run first time: should post stories 162 + print("\n" + "="*60) 163 + print("🚀 Running first aggregator pass (should post stories)") 164 + print("="*60) 165 + aggregator.run() 166 + 167 + # Verify state was updated (stories marked as posted) 168 + posted_count = aggregator.state_manager.get_posted_count( 169 + "https://news.kagi.com/world.xml" 170 + ) 171 + print(f"\n✅ First pass: {posted_count} stories posted and tracked") 172 + assert posted_count > 0, "Should have posted at least one story" 173 + 174 + # Create new aggregator instance (simulates CRON re-run) 175 + aggregator2 = Aggregator( 176 + config_path=test_config_file, 177 + state_file=test_state_file 178 + ) 179 + 180 + # Run second time: should skip duplicates 181 + print("\n" + "="*60) 182 + print("🔄 Running second aggregator pass (should skip duplicates)") 183 + print("="*60) 184 + aggregator2.run() 185 + 186 + # Verify count didn't change (deduplication worked) 187 + posted_count2 = aggregator2.state_manager.get_posted_count( 188 + "https://news.kagi.com/world.xml" 189 + ) 190 + print(f"\n✅ Second pass: Still {posted_count2} stories (duplicates skipped)") 191 + assert posted_count2 == posted_count, "Should not post duplicates" 192 + 193 + @responses.activate 194 + def test_post_with_external_embed( 195 + self, 196 + test_config_file, 197 + test_state_file, 198 + mock_kagi_feed, 199 + aggregator_credentials 200 + ): 201 + """ 202 + Test that posts include external embeds with images. 203 + 204 + Verifies: 205 + - External embed is created 206 + - Thumbnail URL is included 207 + - Title and description are set 208 + """ 209 + # Mock Kagi RSS feed 210 + responses.add( 211 + responses.GET, 212 + "https://news.kagi.com/world.xml", 213 + body=mock_kagi_feed, 214 + status=200 215 + ) 216 + 217 + # Allow passthrough for localhost (PDS) 218 + responses.add_passthru("http://localhost") 219 + 220 + # Set up environment 221 + handle, password = aggregator_credentials 222 + os.environ['AGGREGATOR_HANDLE'] = handle 223 + os.environ['AGGREGATOR_PASSWORD'] = password 224 + os.environ['PDS_URL'] = 'http://localhost:3001' # Auth through PDS 225 + 226 + # Run aggregator 227 + aggregator = Aggregator( 228 + config_path=test_config_file, 229 + state_file=test_state_file 230 + ) 231 + 232 + print("\n" + "="*60) 233 + print("🖼️ Testing external embed creation") 234 + print("="*60) 235 + aggregator.run() 236 + 237 + # Verify posts were created 238 + posted_count = aggregator.state_manager.get_posted_count( 239 + "https://news.kagi.com/world.xml" 240 + ) 241 + print(f"\n✅ Posted {posted_count} stories with external embeds") 242 + assert posted_count > 0 243 + 244 + def test_authentication_with_live_pds(self, aggregator_credentials): 245 + """ 246 + Test authentication against live PDS. 247 + 248 + Verifies: 249 + - Can authenticate with aggregator account 250 + - Receives valid JWT tokens 251 + - DID matches expected format 252 + """ 253 + handle, password = aggregator_credentials 254 + 255 + print("\n" + "="*60) 256 + print(f"🔐 Testing authentication: {handle}") 257 + print("="*60) 258 + 259 + # Create client and authenticate 260 + client = CovesClient( 261 + api_url="http://localhost:8081", # AppView for posting 262 + handle=handle, 263 + password=password, 264 + pds_url="http://localhost:3001" # PDS for auth 265 + ) 266 + 267 + client.authenticate() 268 + 269 + print(f"\n✅ Authentication successful") 270 + print(f" Handle: {client.handle}") 271 + print(f" Authenticated: {client._authenticated}") 272 + 273 + assert client._authenticated is True 274 + assert hasattr(client, 'did') 275 + assert client.did.startswith("did:plc:") 276 + 277 + def test_state_persistence_across_runs( 278 + self, 279 + test_config_file, 280 + test_state_file, 281 + aggregator_credentials 282 + ): 283 + """ 284 + Test that state persists correctly across multiple runs. 285 + 286 + Verifies: 287 + - State file is created 288 + - Posted GUIDs are tracked 289 + - Last run timestamp is updated 290 + - State survives aggregator restart 291 + """ 292 + # Mock empty feed (to avoid posting) 293 + import responses as resp 294 + resp.start() 295 + resp.add( 296 + resp.GET, 297 + "https://news.kagi.com/world.xml", 298 + body='<?xml version="1.0"?><rss version="2.0"><channel></channel></rss>', 299 + status=200 300 + ) 301 + 302 + handle, password = aggregator_credentials 303 + os.environ['AGGREGATOR_HANDLE'] = handle 304 + os.environ['AGGREGATOR_PASSWORD'] = password 305 + 306 + print("\n" + "="*60) 307 + print("💾 Testing state persistence") 308 + print("="*60) 309 + 310 + # First run 311 + aggregator1 = Aggregator( 312 + config_path=test_config_file, 313 + state_file=test_state_file 314 + ) 315 + aggregator1.run() 316 + 317 + # Verify state file was created 318 + assert test_state_file.exists(), "State file should be created" 319 + print(f"\n✅ State file created: {test_state_file}") 320 + 321 + # Verify last run was recorded 322 + last_run1 = aggregator1.state_manager.get_last_run( 323 + "https://news.kagi.com/world.xml" 324 + ) 325 + assert last_run1 is not None, "Last run should be recorded" 326 + print(f" Last run: {last_run1}") 327 + 328 + # Second run (new instance) 329 + aggregator2 = Aggregator( 330 + config_path=test_config_file, 331 + state_file=test_state_file 332 + ) 333 + aggregator2.run() 334 + 335 + # Verify state persisted 336 + last_run2 = aggregator2.state_manager.get_last_run( 337 + "https://news.kagi.com/world.xml" 338 + ) 339 + assert last_run2 >= last_run1, "Last run should be updated" 340 + print(f" Last run (after restart): {last_run2}") 341 + print(f"\n✅ State persisted across aggregator restarts") 342 + 343 + resp.stop() 344 + resp.reset() 345 + 346 + def test_error_recovery( 347 + self, 348 + test_config_file, 349 + test_state_file, 350 + aggregator_credentials 351 + ): 352 + """ 353 + Test that aggregator handles errors gracefully. 354 + 355 + Verifies: 356 + - Continues processing on feed errors 357 + - Doesn't crash on network failures 358 + - Logs errors appropriately 359 + """ 360 + # Mock feed failure 361 + import responses as resp 362 + resp.start() 363 + resp.add( 364 + resp.GET, 365 + "https://news.kagi.com/world.xml", 366 + body="Internal Server Error", 367 + status=500 368 + ) 369 + 370 + handle, password = aggregator_credentials 371 + os.environ['AGGREGATOR_HANDLE'] = handle 372 + os.environ['AGGREGATOR_PASSWORD'] = password 373 + 374 + print("\n" + "="*60) 375 + print("🛡️ Testing error recovery") 376 + print("="*60) 377 + 378 + # Should not crash 379 + aggregator = Aggregator( 380 + config_path=test_config_file, 381 + state_file=test_state_file 382 + ) 383 + 384 + try: 385 + aggregator.run() 386 + print(f"\n✅ Aggregator handled feed error gracefully") 387 + except Exception as e: 388 + pytest.fail(f"Aggregator should handle errors gracefully: {e}") 389 + 390 + resp.stop() 391 + resp.reset() 392 + 393 + 394 + def test_coves_client_external_embed_format(aggregator_credentials): 395 + """ 396 + Test external embed formatting. 397 + 398 + Verifies: 399 + - Embed structure matches social.coves.embed.external 400 + - All required fields are present 401 + - Optional thumbnail is included when provided 402 + """ 403 + handle, password = aggregator_credentials 404 + 405 + client = CovesClient( 406 + api_url="http://localhost:8081", 407 + handle=handle, 408 + password=password 409 + ) 410 + 411 + # Test with thumbnail 412 + embed = client.create_external_embed( 413 + uri="https://example.com/story", 414 + title="Test Story", 415 + description="Test description", 416 + thumb="https://example.com/image.jpg" 417 + ) 418 + 419 + assert embed["$type"] == "social.coves.embed.external" 420 + assert embed["external"]["uri"] == "https://example.com/story" 421 + assert embed["external"]["title"] == "Test Story" 422 + assert embed["external"]["description"] == "Test description" 423 + assert embed["external"]["thumb"] == "https://example.com/image.jpg" 424 + 425 + # Test without thumbnail 426 + embed_no_thumb = client.create_external_embed( 427 + uri="https://example.com/story2", 428 + title="Test Story 2", 429 + description="Test description 2" 430 + ) 431 + 432 + assert "thumb" not in embed_no_thumb["external"] 433 + print("\n✅ External embed format correct")
+122
aggregators/kagi-news/tests/test_html_parser.py
··· 1 + """ 2 + Tests for Kagi HTML description parser. 3 + """ 4 + import pytest 5 + from pathlib import Path 6 + from datetime import datetime 7 + import html 8 + 9 + from src.html_parser import KagiHTMLParser 10 + from src.models import KagiStory, Perspective, Quote, Source 11 + 12 + 13 + @pytest.fixture 14 + def sample_html_description(): 15 + """Load sample HTML from RSS item fixture.""" 16 + # This is the escaped HTML from the RSS description field 17 + html_content = """<p>The White House confirmed President Trump will hold a bilateral meeting with Chinese President Xi Jinping in South Korea on October 30, at the end of an Asia trip that includes Malaysia and Japan . The administration said the meeting will take place Thursday morning local time, and Mr Trump indicated his first question to Xi would concern fentanyl and other bilateral issues . The talks come amid heightened trade tensions after Beijing expanded export curbs on rare-earth minerals and following Mr Trump's recent threat of additional tariffs on Chinese goods, making the meeting a focal point for discussions on trade, technology supply chains and energy .</p><img src='https://kagiproxy.com/img/Q2SRXQtwTYBIiQeI0FG-X6taF_wHSJaXDiFUzju2kbCWGuOYIFUX--8L0BqE4VKxpbOJY3ylFPJkDpfSnyQYZ1qdOLXbphHTnsOK4jb7gqC4KCn5nf3ANbWCuaFD5ZUSijiK0k7wOLP2fyX6tynu2mPtXlCbotLo2lTrEswZl4-No2AI4mI4lkResfnRdp-YjpoEfCOHkNfbN1-0cNcHt9T2dmgBSXrQ2w' alt='News image associated with coverage of President Trump&#x27;s Asia trip and planned meeting with President Xi' /><br /><h3>Highlights:</h3><ul><li>Itinerary details: The Asia swing begins in Malaysia, continues to Japan and ends with the bilateral meeting in South Korea on Thursday morning local time, White House press secretary Karoline Leavitt said at a briefing .</li><li>APEC context: US officials indicated the leaders will meet on the sidelines of the Asia-Pacific Economic Cooperation gathering, shaping expectations for short, high-level talks rather than a lengthy summit .</li></ul><blockquote>Work out a lot of our doubts and questions - President Trump</blockquote><h3>Perspectives:</h3><ul><li>President Trump: He said his first question to President Xi would be about fentanyl and indicated he hoped to resolve bilateral doubts and questions in the talks. (<a href='https://www.straitstimes.com/world/united-states/trump-to-meet-xi-in-south-korea-on-oct-30-as-part-of-asia-swing'>The Straits Times</a>)</li><li>White House (press secretary): Karoline Leavitt confirmed the bilateral meeting will occur Thursday morning local time during a White House briefing. (<a href='https://www.scmp.com/news/us/diplomacy/article/3330131/donald-trump-meet-chinas-xi-jinping-next-thursday-south-korea-crunch-talks'>South China Morning Post</a>)</li></ul><h3>Sources:</h3><ul><li><a href='https://www.straitstimes.com/world/united-states/trump-to-meet-xi-in-south-korea-on-oct-30-as-part-of-asia-swing'>Trump to meet Xi in South Korea on Oct 30 as part of Asia swing</a> - straitstimes.com</li><li><a href='https://www.scmp.com/news/us/diplomacy/article/3330131/donald-trump-meet-chinas-xi-jinping-next-thursday-south-korea-crunch-talks'>Trump to meet Xi in South Korea next Thursday as part of key Asia trip</a> - scmp.com</li></ul>""" 18 + return html_content 19 + 20 + 21 + class TestKagiHTMLParser: 22 + """Test suite for Kagi HTML parser.""" 23 + 24 + def test_parse_summary(self, sample_html_description): 25 + """Test extracting summary paragraph.""" 26 + parser = KagiHTMLParser() 27 + result = parser.parse(sample_html_description) 28 + 29 + assert result['summary'].startswith("The White House confirmed President Trump") 30 + assert "bilateral meeting with Chinese President Xi Jinping" in result['summary'] 31 + 32 + def test_parse_image_url(self, sample_html_description): 33 + """Test extracting image URL and alt text.""" 34 + parser = KagiHTMLParser() 35 + result = parser.parse(sample_html_description) 36 + 37 + assert result['image_url'] is not None 38 + assert result['image_url'].startswith("https://kagiproxy.com/img/") 39 + assert result['image_alt'] is not None 40 + assert "Trump" in result['image_alt'] 41 + 42 + def test_parse_highlights(self, sample_html_description): 43 + """Test extracting highlights list.""" 44 + parser = KagiHTMLParser() 45 + result = parser.parse(sample_html_description) 46 + 47 + assert len(result['highlights']) == 2 48 + assert "Itinerary details" in result['highlights'][0] 49 + assert "APEC context" in result['highlights'][1] 50 + 51 + def test_parse_quote(self, sample_html_description): 52 + """Test extracting blockquote.""" 53 + parser = KagiHTMLParser() 54 + result = parser.parse(sample_html_description) 55 + 56 + assert result['quote'] is not None 57 + assert result['quote']['text'] == "Work out a lot of our doubts and questions" 58 + assert result['quote']['attribution'] == "President Trump" 59 + 60 + def test_parse_perspectives(self, sample_html_description): 61 + """Test extracting perspectives list.""" 62 + parser = KagiHTMLParser() 63 + result = parser.parse(sample_html_description) 64 + 65 + assert len(result['perspectives']) == 2 66 + 67 + # First perspective 68 + assert result['perspectives'][0]['actor'] == "President Trump" 69 + assert "fentanyl" in result['perspectives'][0]['description'] 70 + assert result['perspectives'][0]['source_url'] == "https://www.straitstimes.com/world/united-states/trump-to-meet-xi-in-south-korea-on-oct-30-as-part-of-asia-swing" 71 + 72 + # Second perspective 73 + assert "White House" in result['perspectives'][1]['actor'] 74 + 75 + def test_parse_sources(self, sample_html_description): 76 + """Test extracting sources list.""" 77 + parser = KagiHTMLParser() 78 + result = parser.parse(sample_html_description) 79 + 80 + assert len(result['sources']) >= 2 81 + 82 + # Check first source 83 + assert result['sources'][0]['title'] == "Trump to meet Xi in South Korea on Oct 30 as part of Asia swing" 84 + assert result['sources'][0]['url'].startswith("https://www.straitstimes.com") 85 + assert result['sources'][0]['domain'] == "straitstimes.com" 86 + 87 + def test_parse_missing_sections(self): 88 + """Test parsing HTML with missing sections.""" 89 + html_minimal = "<p>Just a summary, no other sections.</p>" 90 + 91 + parser = KagiHTMLParser() 92 + result = parser.parse(html_minimal) 93 + 94 + assert result['summary'] == "Just a summary, no other sections." 95 + assert result['highlights'] == [] 96 + assert result['perspectives'] == [] 97 + assert result['sources'] == [] 98 + assert result['quote'] is None 99 + assert result['image_url'] is None 100 + 101 + def test_parse_to_kagi_story(self, sample_html_description): 102 + """Test converting parsed HTML to KagiStory object.""" 103 + parser = KagiHTMLParser() 104 + 105 + # Simulate full RSS item data 106 + story = parser.parse_to_story( 107 + title="Trump to meet Xi in South Korea on Oct 30", 108 + link="https://kite.kagi.com/test/world/10", 109 + guid="https://kite.kagi.com/test/world/10", 110 + pub_date=datetime(2025, 10, 23, 20, 56, 0), 111 + categories=["World", "World/Diplomacy"], 112 + html_description=sample_html_description 113 + ) 114 + 115 + assert isinstance(story, KagiStory) 116 + assert story.title == "Trump to meet Xi in South Korea on Oct 30" 117 + assert story.link == "https://kite.kagi.com/test/world/10" 118 + assert len(story.highlights) == 2 119 + assert len(story.perspectives) == 2 120 + assert len(story.sources) >= 2 121 + assert story.quote is not None 122 + assert story.image_url is not None
+460
aggregators/kagi-news/tests/test_main.py
··· 1 + """ 2 + Tests for Main Orchestration Script. 3 + 4 + Tests the complete flow: fetch → parse → format → dedupe → post → update state. 5 + """ 6 + import pytest 7 + from pathlib import Path 8 + from datetime import datetime 9 + from unittest.mock import Mock, MagicMock, patch, call 10 + import feedparser 11 + 12 + from src.main import Aggregator 13 + from src.models import KagiStory, AggregatorConfig, FeedConfig, Perspective, Quote, Source 14 + 15 + 16 + @pytest.fixture 17 + def mock_config(): 18 + """Mock aggregator configuration.""" 19 + return AggregatorConfig( 20 + coves_api_url="https://api.coves.social", 21 + feeds=[ 22 + FeedConfig( 23 + name="World News", 24 + url="https://news.kagi.com/world.xml", 25 + community_handle="world-news.coves.social", 26 + enabled=True 27 + ), 28 + FeedConfig( 29 + name="Tech News", 30 + url="https://news.kagi.com/tech.xml", 31 + community_handle="tech.coves.social", 32 + enabled=True 33 + ), 34 + FeedConfig( 35 + name="Disabled Feed", 36 + url="https://news.kagi.com/disabled.xml", 37 + community_handle="disabled.coves.social", 38 + enabled=False 39 + ) 40 + ], 41 + log_level="info" 42 + ) 43 + 44 + 45 + @pytest.fixture 46 + def sample_story(): 47 + """Sample KagiStory for testing.""" 48 + return KagiStory( 49 + title="Test Story", 50 + link="https://kite.kagi.com/test/world/1", 51 + guid="https://kite.kagi.com/test/world/1", 52 + pub_date=datetime(2024, 1, 15, 12, 0, 0), 53 + categories=["World"], 54 + summary="Test summary", 55 + highlights=["Highlight 1", "Highlight 2"], 56 + perspectives=[ 57 + Perspective( 58 + actor="Test Actor", 59 + description="Test description", 60 + source_url="https://example.com/source" 61 + ) 62 + ], 63 + quote=Quote(text="Test quote", attribution="Test Author"), 64 + sources=[ 65 + Source(title="Source 1", url="https://example.com/1", domain="example.com") 66 + ], 67 + image_url="https://example.com/image.jpg", 68 + image_alt="Test image" 69 + ) 70 + 71 + 72 + @pytest.fixture 73 + def mock_rss_feed(): 74 + """Mock RSS feed with sample entries.""" 75 + feed = MagicMock() 76 + feed.bozo = 0 77 + feed.entries = [ 78 + MagicMock( 79 + title="Story 1", 80 + link="https://kite.kagi.com/test/world/1", 81 + guid="https://kite.kagi.com/test/world/1", 82 + published_parsed=(2024, 1, 15, 12, 0, 0, 0, 15, 0), 83 + tags=[MagicMock(term="World")], 84 + description="<p>Story 1 description</p>" 85 + ), 86 + MagicMock( 87 + title="Story 2", 88 + link="https://kite.kagi.com/test/world/2", 89 + guid="https://kite.kagi.com/test/world/2", 90 + published_parsed=(2024, 1, 15, 13, 0, 0, 0, 15, 0), 91 + tags=[MagicMock(term="World")], 92 + description="<p>Story 2 description</p>" 93 + ) 94 + ] 95 + return feed 96 + 97 + 98 + class TestAggregator: 99 + """Test suite for Aggregator orchestration.""" 100 + 101 + def test_initialize_aggregator(self, mock_config, tmp_path): 102 + """Test aggregator initialization.""" 103 + state_file = tmp_path / "state.json" 104 + 105 + with patch('src.main.ConfigLoader') as MockConfigLoader: 106 + mock_loader = Mock() 107 + mock_loader.load.return_value = mock_config 108 + MockConfigLoader.return_value = mock_loader 109 + 110 + aggregator = Aggregator( 111 + config_path=Path("config.yaml"), 112 + state_file=state_file, 113 + coves_client=Mock() 114 + ) 115 + 116 + assert aggregator.config == mock_config 117 + assert aggregator.state_file == state_file 118 + 119 + def test_process_enabled_feeds_only(self, mock_config, tmp_path): 120 + """Test that only enabled feeds are processed.""" 121 + state_file = tmp_path / "state.json" 122 + mock_client = Mock() 123 + 124 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 125 + patch('src.main.RSSFetcher') as MockRSSFetcher: 126 + 127 + mock_loader = Mock() 128 + mock_loader.load.return_value = mock_config 129 + MockConfigLoader.return_value = mock_loader 130 + 131 + mock_fetcher = Mock() 132 + MockRSSFetcher.return_value = mock_fetcher 133 + 134 + aggregator = Aggregator( 135 + config_path=Path("config.yaml"), 136 + state_file=state_file, 137 + coves_client=mock_client 138 + ) 139 + 140 + # Mock empty feeds 141 + mock_fetcher.fetch_feed.return_value = MagicMock(bozo=0, entries=[]) 142 + 143 + aggregator.run() 144 + 145 + # Should only fetch enabled feeds (2) 146 + assert mock_fetcher.fetch_feed.call_count == 2 147 + 148 + def test_full_successful_flow(self, mock_config, mock_rss_feed, sample_story, tmp_path): 149 + """Test complete flow: fetch → parse → format → post → update state.""" 150 + state_file = tmp_path / "state.json" 151 + mock_client = Mock() 152 + mock_client.create_post.return_value = "at://did:plc:test/social.coves.post/abc123" 153 + 154 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 155 + patch('src.main.RSSFetcher') as MockRSSFetcher, \ 156 + patch('src.main.KagiHTMLParser') as MockHTMLParser, \ 157 + patch('src.main.RichTextFormatter') as MockFormatter: 158 + 159 + # Setup mocks 160 + mock_loader = Mock() 161 + mock_loader.load.return_value = mock_config 162 + MockConfigLoader.return_value = mock_loader 163 + 164 + mock_fetcher = Mock() 165 + mock_fetcher.fetch_feed.return_value = mock_rss_feed 166 + MockRSSFetcher.return_value = mock_fetcher 167 + 168 + mock_parser = Mock() 169 + mock_parser.parse_to_story.return_value = sample_story 170 + MockHTMLParser.return_value = mock_parser 171 + 172 + mock_formatter = Mock() 173 + mock_formatter.format_full.return_value = { 174 + "content": "Test content", 175 + "facets": [] 176 + } 177 + MockFormatter.return_value = mock_formatter 178 + 179 + # Run aggregator 180 + aggregator = Aggregator( 181 + config_path=Path("config.yaml"), 182 + state_file=state_file, 183 + coves_client=mock_client 184 + ) 185 + aggregator.run() 186 + 187 + # Verify RSS fetching 188 + assert mock_fetcher.fetch_feed.call_count == 2 189 + 190 + # Verify parsing (2 entries per feed * 2 feeds = 4 total) 191 + assert mock_parser.parse_to_story.call_count == 4 192 + 193 + # Verify formatting 194 + assert mock_formatter.format_full.call_count == 4 195 + 196 + # Verify posting (should call create_post for each story) 197 + assert mock_client.create_post.call_count == 4 198 + 199 + def test_deduplication_skips_posted_stories(self, mock_config, mock_rss_feed, sample_story, tmp_path): 200 + """Test that already-posted stories are skipped.""" 201 + state_file = tmp_path / "state.json" 202 + mock_client = Mock() 203 + mock_client.create_post.return_value = "at://did:plc:test/social.coves.post/abc123" 204 + 205 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 206 + patch('src.main.RSSFetcher') as MockRSSFetcher, \ 207 + patch('src.main.KagiHTMLParser') as MockHTMLParser, \ 208 + patch('src.main.RichTextFormatter') as MockFormatter: 209 + 210 + # Setup mocks 211 + mock_loader = Mock() 212 + mock_loader.load.return_value = mock_config 213 + MockConfigLoader.return_value = mock_loader 214 + 215 + mock_fetcher = Mock() 216 + mock_fetcher.fetch_feed.return_value = mock_rss_feed 217 + MockRSSFetcher.return_value = mock_fetcher 218 + 219 + mock_parser = Mock() 220 + mock_parser.parse_to_story.return_value = sample_story 221 + MockHTMLParser.return_value = mock_parser 222 + 223 + mock_formatter = Mock() 224 + mock_formatter.format_full.return_value = { 225 + "content": "Test content", 226 + "facets": [] 227 + } 228 + MockFormatter.return_value = mock_formatter 229 + 230 + # First run: posts all stories 231 + aggregator = Aggregator( 232 + config_path=Path("config.yaml"), 233 + state_file=state_file, 234 + coves_client=mock_client 235 + ) 236 + aggregator.run() 237 + 238 + # Verify first run posted stories 239 + first_run_posts = mock_client.create_post.call_count 240 + assert first_run_posts == 4 241 + 242 + # Second run: should skip all (already posted) 243 + mock_client.reset_mock() 244 + aggregator2 = Aggregator( 245 + config_path=Path("config.yaml"), 246 + state_file=state_file, 247 + coves_client=mock_client 248 + ) 249 + aggregator2.run() 250 + 251 + # Should not post any (all duplicates) 252 + assert mock_client.create_post.call_count == 0 253 + 254 + def test_continue_on_feed_error(self, mock_config, tmp_path): 255 + """Test that processing continues if one feed fails.""" 256 + state_file = tmp_path / "state.json" 257 + mock_client = Mock() 258 + 259 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 260 + patch('src.main.RSSFetcher') as MockRSSFetcher: 261 + 262 + mock_loader = Mock() 263 + mock_loader.load.return_value = mock_config 264 + MockConfigLoader.return_value = mock_loader 265 + 266 + mock_fetcher = Mock() 267 + # First feed fails, second succeeds 268 + mock_fetcher.fetch_feed.side_effect = [ 269 + Exception("Network error"), 270 + MagicMock(bozo=0, entries=[]) 271 + ] 272 + MockRSSFetcher.return_value = mock_fetcher 273 + 274 + aggregator = Aggregator( 275 + config_path=Path("config.yaml"), 276 + state_file=state_file, 277 + coves_client=mock_client 278 + ) 279 + 280 + # Should not raise exception 281 + aggregator.run() 282 + 283 + # Should have attempted both feeds 284 + assert mock_fetcher.fetch_feed.call_count == 2 285 + 286 + def test_handle_empty_feed(self, mock_config, tmp_path): 287 + """Test handling of empty RSS feeds.""" 288 + state_file = tmp_path / "state.json" 289 + mock_client = Mock() 290 + 291 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 292 + patch('src.main.RSSFetcher') as MockRSSFetcher: 293 + 294 + mock_loader = Mock() 295 + mock_loader.load.return_value = mock_config 296 + MockConfigLoader.return_value = mock_loader 297 + 298 + mock_fetcher = Mock() 299 + mock_fetcher.fetch_feed.return_value = MagicMock(bozo=0, entries=[]) 300 + MockRSSFetcher.return_value = mock_fetcher 301 + 302 + aggregator = Aggregator( 303 + config_path=Path("config.yaml"), 304 + state_file=state_file, 305 + coves_client=mock_client 306 + ) 307 + aggregator.run() 308 + 309 + # Should not post anything 310 + assert mock_client.create_post.call_count == 0 311 + 312 + def test_dont_update_state_on_failed_post(self, mock_config, mock_rss_feed, sample_story, tmp_path): 313 + """Test that state is not updated if posting fails.""" 314 + state_file = tmp_path / "state.json" 315 + mock_client = Mock() 316 + mock_client.create_post.side_effect = Exception("Post failed") 317 + 318 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 319 + patch('src.main.RSSFetcher') as MockRSSFetcher, \ 320 + patch('src.main.KagiHTMLParser') as MockHTMLParser, \ 321 + patch('src.main.RichTextFormatter') as MockFormatter: 322 + 323 + # Setup mocks 324 + mock_loader = Mock() 325 + mock_loader.load.return_value = mock_config 326 + MockConfigLoader.return_value = mock_loader 327 + 328 + mock_fetcher = Mock() 329 + mock_fetcher.fetch_feed.return_value = mock_rss_feed 330 + MockRSSFetcher.return_value = mock_fetcher 331 + 332 + mock_parser = Mock() 333 + mock_parser.parse_to_story.return_value = sample_story 334 + MockHTMLParser.return_value = mock_parser 335 + 336 + mock_formatter = Mock() 337 + mock_formatter.format_full.return_value = { 338 + "content": "Test content", 339 + "facets": [] 340 + } 341 + MockFormatter.return_value = mock_formatter 342 + 343 + # Run aggregator (posts will fail) 344 + aggregator = Aggregator( 345 + config_path=Path("config.yaml"), 346 + state_file=state_file, 347 + coves_client=mock_client 348 + ) 349 + aggregator.run() 350 + 351 + # Reset client to succeed 352 + mock_client.reset_mock() 353 + mock_client.create_post.return_value = "at://did:plc:test/social.coves.post/abc123" 354 + 355 + # Second run: should try to post again (state wasn't updated) 356 + aggregator2 = Aggregator( 357 + config_path=Path("config.yaml"), 358 + state_file=state_file, 359 + coves_client=mock_client 360 + ) 361 + aggregator2.run() 362 + 363 + # Should post stories (they weren't marked as posted) 364 + assert mock_client.create_post.call_count == 4 365 + 366 + def test_update_last_run_timestamp(self, mock_config, tmp_path): 367 + """Test that last_run timestamp is updated after successful processing.""" 368 + state_file = tmp_path / "state.json" 369 + mock_client = Mock() 370 + 371 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 372 + patch('src.main.RSSFetcher') as MockRSSFetcher: 373 + 374 + mock_loader = Mock() 375 + mock_loader.load.return_value = mock_config 376 + MockConfigLoader.return_value = mock_loader 377 + 378 + mock_fetcher = Mock() 379 + mock_fetcher.fetch_feed.return_value = MagicMock(bozo=0, entries=[]) 380 + MockRSSFetcher.return_value = mock_fetcher 381 + 382 + aggregator = Aggregator( 383 + config_path=Path("config.yaml"), 384 + state_file=state_file, 385 + coves_client=mock_client 386 + ) 387 + aggregator.run() 388 + 389 + # Verify last_run was updated for both feeds 390 + feed1_last_run = aggregator.state_manager.get_last_run( 391 + "https://news.kagi.com/world.xml" 392 + ) 393 + feed2_last_run = aggregator.state_manager.get_last_run( 394 + "https://news.kagi.com/tech.xml" 395 + ) 396 + 397 + assert feed1_last_run is not None 398 + assert feed2_last_run is not None 399 + 400 + def test_create_post_with_image_embed(self, mock_config, mock_rss_feed, sample_story, tmp_path): 401 + """Test that posts include external image embeds.""" 402 + state_file = tmp_path / "state.json" 403 + mock_client = Mock() 404 + mock_client.create_post.return_value = "at://did:plc:test/social.coves.post/abc123" 405 + 406 + # Mock create_external_embed to return proper embed structure 407 + mock_client.create_external_embed.return_value = { 408 + "$type": "social.coves.embed.external", 409 + "external": { 410 + "uri": sample_story.link, 411 + "title": sample_story.title, 412 + "description": sample_story.summary, 413 + "thumb": sample_story.image_url 414 + } 415 + } 416 + 417 + with patch('src.main.ConfigLoader') as MockConfigLoader, \ 418 + patch('src.main.RSSFetcher') as MockRSSFetcher, \ 419 + patch('src.main.KagiHTMLParser') as MockHTMLParser, \ 420 + patch('src.main.RichTextFormatter') as MockFormatter: 421 + 422 + # Setup mocks 423 + mock_loader = Mock() 424 + mock_loader.load.return_value = mock_config 425 + MockConfigLoader.return_value = mock_loader 426 + 427 + mock_fetcher = Mock() 428 + # Only one entry for simplicity 429 + single_entry_feed = MagicMock(bozo=0, entries=[mock_rss_feed.entries[0]]) 430 + mock_fetcher.fetch_feed.return_value = single_entry_feed 431 + MockRSSFetcher.return_value = mock_fetcher 432 + 433 + mock_parser = Mock() 434 + mock_parser.parse_to_story.return_value = sample_story 435 + MockHTMLParser.return_value = mock_parser 436 + 437 + mock_formatter = Mock() 438 + mock_formatter.format_full.return_value = { 439 + "content": "Test content", 440 + "facets": [] 441 + } 442 + MockFormatter.return_value = mock_formatter 443 + 444 + # Run aggregator 445 + aggregator = Aggregator( 446 + config_path=Path("config.yaml"), 447 + state_file=state_file, 448 + coves_client=mock_client 449 + ) 450 + aggregator.run() 451 + 452 + # Verify create_post was called with embed 453 + mock_client.create_post.assert_called() 454 + call_kwargs = mock_client.create_post.call_args.kwargs 455 + 456 + assert "embed" in call_kwargs 457 + assert call_kwargs["embed"]["$type"] == "social.coves.embed.external" 458 + assert call_kwargs["embed"]["external"]["uri"] == sample_story.link 459 + assert call_kwargs["embed"]["external"]["title"] == sample_story.title 460 + assert call_kwargs["embed"]["external"]["thumb"] == sample_story.image_url
+299
aggregators/kagi-news/tests/test_richtext_formatter.py
··· 1 + """ 2 + Tests for Rich Text Formatter. 3 + 4 + Tests conversion of KagiStory to Coves rich text format with facets. 5 + """ 6 + import pytest 7 + from datetime import datetime 8 + 9 + from src.richtext_formatter import RichTextFormatter 10 + from src.models import KagiStory, Perspective, Quote, Source 11 + 12 + 13 + @pytest.fixture 14 + def sample_story(): 15 + """Create a sample KagiStory for testing.""" 16 + return KagiStory( 17 + title="Trump to meet Xi in South Korea", 18 + link="https://kite.kagi.com/test/world/10", 19 + guid="https://kite.kagi.com/test/world/10", 20 + pub_date=datetime(2025, 10, 23, 20, 56, 0), 21 + categories=["World", "World/Diplomacy"], 22 + summary="The White House confirmed President Trump will hold a bilateral meeting with Chinese President Xi Jinping in South Korea on October 30.", 23 + highlights=[ 24 + "Itinerary details: The Asia swing begins in Malaysia, continues to Japan.", 25 + "APEC context: US officials indicated the leaders will meet on the sidelines." 26 + ], 27 + perspectives=[ 28 + Perspective( 29 + actor="President Trump", 30 + description="He said his first question to President Xi would be about fentanyl.", 31 + source_url="https://www.straitstimes.com/world/test" 32 + ), 33 + Perspective( 34 + actor="White House (press secretary)", 35 + description="Karoline Leavitt confirmed the bilateral meeting.", 36 + source_url="https://www.scmp.com/news/test" 37 + ) 38 + ], 39 + quote=Quote( 40 + text="Work out a lot of our doubts and questions", 41 + attribution="President Trump" 42 + ), 43 + sources=[ 44 + Source( 45 + title="Trump to meet Xi in South Korea", 46 + url="https://www.straitstimes.com/world/test", 47 + domain="straitstimes.com" 48 + ), 49 + Source( 50 + title="Trump meeting Xi next Thursday", 51 + url="https://www.scmp.com/news/test", 52 + domain="scmp.com" 53 + ) 54 + ], 55 + image_url="https://kagiproxy.com/img/test123", 56 + image_alt="Test image" 57 + ) 58 + 59 + 60 + class TestRichTextFormatter: 61 + """Test suite for RichTextFormatter.""" 62 + 63 + def test_format_full_returns_content_and_facets(self, sample_story): 64 + """Test that format_full returns content and facets.""" 65 + formatter = RichTextFormatter() 66 + result = formatter.format_full(sample_story) 67 + 68 + assert 'content' in result 69 + assert 'facets' in result 70 + assert isinstance(result['content'], str) 71 + assert isinstance(result['facets'], list) 72 + 73 + def test_content_structure(self, sample_story): 74 + """Test that content has correct structure.""" 75 + formatter = RichTextFormatter() 76 + result = formatter.format_full(sample_story) 77 + content = result['content'] 78 + 79 + # Check all sections are present 80 + assert sample_story.summary in content 81 + assert "Highlights:" in content 82 + assert "Perspectives:" in content 83 + assert "Sources:" in content 84 + assert sample_story.quote.text in content 85 + assert "📰 Story aggregated by Kagi News" in content 86 + 87 + def test_facets_for_bold_headers(self, sample_story): 88 + """Test that section headers have bold facets.""" 89 + formatter = RichTextFormatter() 90 + result = formatter.format_full(sample_story) 91 + 92 + # Find bold facets 93 + bold_facets = [ 94 + f for f in result['facets'] 95 + if any(feat.get('$type') == 'social.coves.richtext.facet#bold' 96 + for feat in f['features']) 97 + ] 98 + 99 + assert len(bold_facets) > 0 100 + 101 + # Check that "Highlights:" is bolded 102 + content = result['content'] 103 + highlights_pos = content.find("Highlights:") 104 + 105 + # Should have a bold facet covering "Highlights:" 106 + has_highlights_bold = any( 107 + f['index']['byteStart'] <= highlights_pos and 108 + f['index']['byteEnd'] >= highlights_pos + len("Highlights:") 109 + for f in bold_facets 110 + ) 111 + assert has_highlights_bold 112 + 113 + def test_facets_for_italic_quote(self, sample_story): 114 + """Test that quotes have italic facets.""" 115 + formatter = RichTextFormatter() 116 + result = formatter.format_full(sample_story) 117 + 118 + # Find italic facets 119 + italic_facets = [ 120 + f for f in result['facets'] 121 + if any(feat.get('$type') == 'social.coves.richtext.facet#italic' 122 + for feat in f['features']) 123 + ] 124 + 125 + assert len(italic_facets) > 0 126 + 127 + # The quote text is wrapped with quotes, so search for that 128 + content = result['content'] 129 + quote_with_quotes = f'"{sample_story.quote.text}"' 130 + quote_char_pos = content.find(quote_with_quotes) 131 + 132 + # Convert character position to byte position 133 + quote_byte_start = len(content[:quote_char_pos].encode('utf-8')) 134 + quote_byte_end = len(content[:quote_char_pos + len(quote_with_quotes)].encode('utf-8')) 135 + 136 + has_quote_italic = any( 137 + f['index']['byteStart'] <= quote_byte_start and 138 + f['index']['byteEnd'] >= quote_byte_end 139 + for f in italic_facets 140 + ) 141 + assert has_quote_italic 142 + 143 + def test_facets_for_links(self, sample_story): 144 + """Test that URLs have link facets.""" 145 + formatter = RichTextFormatter() 146 + result = formatter.format_full(sample_story) 147 + 148 + # Find link facets 149 + link_facets = [ 150 + f for f in result['facets'] 151 + if any(feat.get('$type') == 'social.coves.richtext.facet#link' 152 + for feat in f['features']) 153 + ] 154 + 155 + # Should have links for: 2 sources + 2 perspectives + 1 Kagi News link = 5 minimum 156 + assert len(link_facets) >= 5 157 + 158 + # Check that first source URL has a link facet 159 + source_urls = [s.url for s in sample_story.sources] 160 + for url in source_urls: 161 + has_link = any( 162 + any(feat.get('uri') == url for feat in f['features']) 163 + for f in link_facets 164 + ) 165 + assert has_link, f"Missing link facet for {url}" 166 + 167 + def test_utf8_byte_positions(self): 168 + """Test UTF-8 byte position calculation with multi-byte characters.""" 169 + # Create story with emoji and non-ASCII characters 170 + story = KagiStory( 171 + title="Test 👋 Story", 172 + link="https://test.com", 173 + guid="https://test.com", 174 + pub_date=datetime.now(), 175 + categories=["Test"], 176 + summary="Hello 世界 this is a test with emoji 🎉", 177 + highlights=["Test highlight"], 178 + perspectives=[], 179 + quote=None, 180 + sources=[], 181 + ) 182 + 183 + formatter = RichTextFormatter() 184 + result = formatter.format_full(story) 185 + 186 + # Verify content contains the emoji 187 + assert "👋" in result['content'] or "🎉" in result['content'] 188 + 189 + # Verify all facet byte positions are valid 190 + content_bytes = result['content'].encode('utf-8') 191 + for facet in result['facets']: 192 + start = facet['index']['byteStart'] 193 + end = facet['index']['byteEnd'] 194 + 195 + # Positions should be within bounds 196 + assert 0 <= start < len(content_bytes) 197 + assert start < end <= len(content_bytes) 198 + 199 + def test_format_story_without_optional_fields(self): 200 + """Test formatting story with missing optional fields.""" 201 + minimal_story = KagiStory( 202 + title="Minimal Story", 203 + link="https://test.com", 204 + guid="https://test.com", 205 + pub_date=datetime.now(), 206 + categories=["Test"], 207 + summary="Just a summary.", 208 + highlights=[], # Empty 209 + perspectives=[], # Empty 210 + quote=None, # Missing 211 + sources=[], # Empty 212 + ) 213 + 214 + formatter = RichTextFormatter() 215 + result = formatter.format_full(minimal_story) 216 + 217 + # Should still have content and facets 218 + assert result['content'] 219 + assert result['facets'] 220 + 221 + # Should have summary 222 + assert "Just a summary." in result['content'] 223 + 224 + # Should NOT have empty sections 225 + assert "Highlights:" not in result['content'] 226 + assert "Perspectives:" not in result['content'] 227 + 228 + def test_perspective_actor_is_bolded(self, sample_story): 229 + """Test that perspective actor names are bolded.""" 230 + formatter = RichTextFormatter() 231 + result = formatter.format_full(sample_story) 232 + 233 + content = result['content'] 234 + bold_facets = [ 235 + f for f in result['facets'] 236 + if any(feat.get('$type') == 'social.coves.richtext.facet#bold' 237 + for feat in f['features']) 238 + ] 239 + 240 + # Find "President Trump:" in perspectives section 241 + actor = "President Trump:" 242 + perspectives_start = content.find("Perspectives:") 243 + actor_char_pos = content.find(actor, perspectives_start) 244 + 245 + if actor_char_pos != -1: # If found in perspectives 246 + # Convert character position to byte position 247 + actor_byte_start = len(content[:actor_char_pos].encode('utf-8')) 248 + actor_byte_end = len(content[:actor_char_pos + len(actor)].encode('utf-8')) 249 + 250 + has_actor_bold = any( 251 + f['index']['byteStart'] <= actor_byte_start and 252 + f['index']['byteEnd'] >= actor_byte_end 253 + for f in bold_facets 254 + ) 255 + assert has_actor_bold 256 + 257 + def test_kagi_attribution_link(self, sample_story): 258 + """Test that Kagi News attribution has a link to the story.""" 259 + formatter = RichTextFormatter() 260 + result = formatter.format_full(sample_story) 261 + 262 + # Should have link to Kagi story 263 + link_facets = [ 264 + f for f in result['facets'] 265 + if any(feat.get('$type') == 'social.coves.richtext.facet#link' 266 + for feat in f['features']) 267 + ] 268 + 269 + # Find link to the Kagi story URL 270 + kagi_link = any( 271 + any(feat.get('uri') == sample_story.link for feat in f['features']) 272 + for f in link_facets 273 + ) 274 + assert kagi_link, "Missing link to Kagi story in attribution" 275 + 276 + def test_facets_do_not_overlap(self, sample_story): 277 + """Test that facets with same feature type don't overlap.""" 278 + formatter = RichTextFormatter() 279 + result = formatter.format_full(sample_story) 280 + 281 + # Group facets by type 282 + facets_by_type = {} 283 + for facet in result['facets']: 284 + for feature in facet['features']: 285 + ftype = feature['$type'] 286 + if ftype not in facets_by_type: 287 + facets_by_type[ftype] = [] 288 + facets_by_type[ftype].append(facet) 289 + 290 + # Check for overlaps within each type 291 + for ftype, facets in facets_by_type.items(): 292 + for i, f1 in enumerate(facets): 293 + for f2 in facets[i+1:]: 294 + start1, end1 = f1['index']['byteStart'], f1['index']['byteEnd'] 295 + start2, end2 = f2['index']['byteStart'], f2['index']['byteEnd'] 296 + 297 + # Check if they overlap 298 + overlaps = (start1 < end2 and start2 < end1) 299 + assert not overlaps, f"Overlapping facets of type {ftype}: {f1} and {f2}"
+91
aggregators/kagi-news/tests/test_rss_fetcher.py
··· 1 + """ 2 + Tests for RSS feed fetching functionality. 3 + """ 4 + import pytest 5 + import responses 6 + from pathlib import Path 7 + 8 + from src.rss_fetcher import RSSFetcher 9 + 10 + 11 + @pytest.fixture 12 + def sample_rss_feed(): 13 + """Load sample RSS feed from fixtures.""" 14 + fixture_path = Path(__file__).parent / "fixtures" / "world.xml" 15 + # For now, use a minimal test feed 16 + return """<?xml version='1.0' encoding='UTF-8'?> 17 + <rss version="2.0"> 18 + <channel> 19 + <title>Kagi News - World</title> 20 + <item> 21 + <title>Test Story</title> 22 + <link>https://kite.kagi.com/test/world/1</link> 23 + <guid>https://kite.kagi.com/test/world/1</guid> 24 + <pubDate>Fri, 24 Oct 2025 12:00:00 +0000</pubDate> 25 + <category>World</category> 26 + </item> 27 + </channel> 28 + </rss>""" 29 + 30 + 31 + class TestRSSFetcher: 32 + """Test suite for RSSFetcher.""" 33 + 34 + @responses.activate 35 + def test_fetch_feed_success(self, sample_rss_feed): 36 + """Test successful RSS feed fetch.""" 37 + url = "https://news.kagi.com/world.xml" 38 + responses.add(responses.GET, url, body=sample_rss_feed, status=200) 39 + 40 + fetcher = RSSFetcher() 41 + feed = fetcher.fetch_feed(url) 42 + 43 + assert feed is not None 44 + assert feed.feed.title == "Kagi News - World" 45 + assert len(feed.entries) == 1 46 + assert feed.entries[0].title == "Test Story" 47 + 48 + @responses.activate 49 + def test_fetch_feed_timeout(self): 50 + """Test fetch with timeout.""" 51 + url = "https://news.kagi.com/world.xml" 52 + responses.add(responses.GET, url, body="timeout", status=408) 53 + 54 + fetcher = RSSFetcher(timeout=5) 55 + 56 + with pytest.raises(Exception): # Should raise on timeout 57 + fetcher.fetch_feed(url) 58 + 59 + @responses.activate 60 + def test_fetch_feed_with_retry(self, sample_rss_feed): 61 + """Test fetch with retry on failure then success.""" 62 + url = "https://news.kagi.com/world.xml" 63 + 64 + # First call fails, second succeeds 65 + responses.add(responses.GET, url, body="error", status=500) 66 + responses.add(responses.GET, url, body=sample_rss_feed, status=200) 67 + 68 + fetcher = RSSFetcher(max_retries=2) 69 + feed = fetcher.fetch_feed(url) 70 + 71 + assert feed is not None 72 + assert len(feed.entries) == 1 73 + 74 + @responses.activate 75 + def test_fetch_feed_invalid_xml(self): 76 + """Test handling of invalid XML.""" 77 + url = "https://news.kagi.com/world.xml" 78 + responses.add(responses.GET, url, body="Not valid XML!", status=200) 79 + 80 + fetcher = RSSFetcher() 81 + feed = fetcher.fetch_feed(url) 82 + 83 + # feedparser is lenient, but should have bozo flag set 84 + assert feed.bozo == 1 # feedparser uses 1 for True 85 + 86 + def test_fetch_feed_requires_url(self): 87 + """Test that fetch_feed requires a URL.""" 88 + fetcher = RSSFetcher() 89 + 90 + with pytest.raises((ValueError, TypeError)): 91 + fetcher.fetch_feed("")
+227
aggregators/kagi-news/tests/test_state_manager.py
··· 1 + """ 2 + Tests for State Manager. 3 + 4 + Tests deduplication state tracking and persistence. 5 + """ 6 + import pytest 7 + import json 8 + import tempfile 9 + from pathlib import Path 10 + from datetime import datetime, timedelta 11 + 12 + from src.state_manager import StateManager 13 + 14 + 15 + @pytest.fixture 16 + def temp_state_file(): 17 + """Create a temporary state file for testing.""" 18 + with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as f: 19 + temp_path = Path(f.name) 20 + yield temp_path 21 + # Cleanup 22 + if temp_path.exists(): 23 + temp_path.unlink() 24 + 25 + 26 + class TestStateManager: 27 + """Test suite for StateManager.""" 28 + 29 + def test_initialize_new_state_file(self, temp_state_file): 30 + """Test initializing a new state file.""" 31 + manager = StateManager(temp_state_file) 32 + 33 + # Should create an empty state 34 + assert temp_state_file.exists() 35 + state = json.loads(temp_state_file.read_text()) 36 + assert 'feeds' in state 37 + assert state['feeds'] == {} 38 + 39 + def test_is_posted_returns_false_for_new_guid(self, temp_state_file): 40 + """Test that is_posted returns False for new GUIDs.""" 41 + manager = StateManager(temp_state_file) 42 + feed_url = "https://news.kagi.com/world.xml" 43 + guid = "https://kite.kagi.com/test/world/1" 44 + 45 + assert not manager.is_posted(feed_url, guid) 46 + 47 + def test_mark_posted_stores_guid(self, temp_state_file): 48 + """Test that mark_posted stores GUIDs.""" 49 + manager = StateManager(temp_state_file) 50 + feed_url = "https://news.kagi.com/world.xml" 51 + guid = "https://kite.kagi.com/test/world/1" 52 + post_uri = "at://did:plc:test/social.coves.post/abc123" 53 + 54 + manager.mark_posted(feed_url, guid, post_uri) 55 + 56 + # Should now return True 57 + assert manager.is_posted(feed_url, guid) 58 + 59 + def test_state_persists_across_instances(self, temp_state_file): 60 + """Test that state persists when creating new instances.""" 61 + feed_url = "https://news.kagi.com/world.xml" 62 + guid = "https://kite.kagi.com/test/world/1" 63 + post_uri = "at://did:plc:test/social.coves.post/abc123" 64 + 65 + # First instance marks as posted 66 + manager1 = StateManager(temp_state_file) 67 + manager1.mark_posted(feed_url, guid, post_uri) 68 + 69 + # Second instance should see the same state 70 + manager2 = StateManager(temp_state_file) 71 + assert manager2.is_posted(feed_url, guid) 72 + 73 + def test_track_last_run_timestamp(self, temp_state_file): 74 + """Test tracking last successful run timestamp.""" 75 + manager = StateManager(temp_state_file) 76 + feed_url = "https://news.kagi.com/world.xml" 77 + timestamp = datetime.now() 78 + 79 + manager.update_last_run(feed_url, timestamp) 80 + 81 + retrieved = manager.get_last_run(feed_url) 82 + assert retrieved is not None 83 + # Compare timestamps (allow small difference due to serialization) 84 + assert abs((retrieved - timestamp).total_seconds()) < 1 85 + 86 + def test_get_last_run_returns_none_for_new_feed(self, temp_state_file): 87 + """Test that get_last_run returns None for new feeds.""" 88 + manager = StateManager(temp_state_file) 89 + feed_url = "https://news.kagi.com/world.xml" 90 + 91 + assert manager.get_last_run(feed_url) is None 92 + 93 + def test_cleanup_old_guids(self, temp_state_file): 94 + """Test cleanup of old GUIDs (> 30 days).""" 95 + manager = StateManager(temp_state_file) 96 + feed_url = "https://news.kagi.com/world.xml" 97 + 98 + # Add recent GUID 99 + recent_guid = "https://kite.kagi.com/test/world/1" 100 + manager.mark_posted(feed_url, recent_guid, "at://test/1") 101 + 102 + # Manually add old GUID (> 30 days) 103 + old_timestamp = (datetime.now() - timedelta(days=31)).isoformat() 104 + state_data = json.loads(temp_state_file.read_text()) 105 + state_data['feeds'][feed_url]['posted_guids'].append({ 106 + 'guid': 'https://kite.kagi.com/test/world/old', 107 + 'post_uri': 'at://test/old', 108 + 'posted_at': old_timestamp 109 + }) 110 + temp_state_file.write_text(json.dumps(state_data, indent=2)) 111 + 112 + # Reload and cleanup 113 + manager = StateManager(temp_state_file) 114 + manager.cleanup_old_entries(feed_url) 115 + 116 + # Recent GUID should still be there 117 + assert manager.is_posted(feed_url, recent_guid) 118 + 119 + # Old GUID should be removed 120 + assert not manager.is_posted(feed_url, 'https://kite.kagi.com/test/world/old') 121 + 122 + def test_limit_guids_to_100_per_feed(self, temp_state_file): 123 + """Test that only last 100 GUIDs are kept per feed.""" 124 + manager = StateManager(temp_state_file) 125 + feed_url = "https://news.kagi.com/world.xml" 126 + 127 + # Add 150 GUIDs 128 + for i in range(150): 129 + guid = f"https://kite.kagi.com/test/world/{i}" 130 + manager.mark_posted(feed_url, guid, f"at://test/{i}") 131 + 132 + # Cleanup (should limit to 100) 133 + manager.cleanup_old_entries(feed_url) 134 + 135 + # Reload state 136 + manager = StateManager(temp_state_file) 137 + 138 + # Should have exactly 100 entries (most recent) 139 + state_data = json.loads(temp_state_file.read_text()) 140 + assert len(state_data['feeds'][feed_url]['posted_guids']) == 100 141 + 142 + # Oldest entries should be removed 143 + assert not manager.is_posted(feed_url, "https://kite.kagi.com/test/world/0") 144 + assert not manager.is_posted(feed_url, "https://kite.kagi.com/test/world/49") 145 + 146 + # Recent entries should still be there 147 + assert manager.is_posted(feed_url, "https://kite.kagi.com/test/world/149") 148 + assert manager.is_posted(feed_url, "https://kite.kagi.com/test/world/100") 149 + 150 + def test_multiple_feeds_tracked_separately(self, temp_state_file): 151 + """Test that multiple feeds are tracked independently.""" 152 + manager = StateManager(temp_state_file) 153 + 154 + feed1 = "https://news.kagi.com/world.xml" 155 + feed2 = "https://news.kagi.com/tech.xml" 156 + guid1 = "https://kite.kagi.com/test/world/1" 157 + guid2 = "https://kite.kagi.com/test/tech/1" 158 + 159 + manager.mark_posted(feed1, guid1, "at://test/1") 160 + manager.mark_posted(feed2, guid2, "at://test/2") 161 + 162 + # Each feed should only know about its own GUIDs 163 + assert manager.is_posted(feed1, guid1) 164 + assert not manager.is_posted(feed1, guid2) 165 + 166 + assert manager.is_posted(feed2, guid2) 167 + assert not manager.is_posted(feed2, guid1) 168 + 169 + def test_get_posted_count(self, temp_state_file): 170 + """Test getting count of posted items per feed.""" 171 + manager = StateManager(temp_state_file) 172 + feed_url = "https://news.kagi.com/world.xml" 173 + 174 + # Initially 0 175 + assert manager.get_posted_count(feed_url) == 0 176 + 177 + # Add 5 items 178 + for i in range(5): 179 + manager.mark_posted(feed_url, f"guid-{i}", f"post-{i}") 180 + 181 + assert manager.get_posted_count(feed_url) == 5 182 + 183 + def test_state_file_format_is_valid_json(self, temp_state_file): 184 + """Test that state file is always valid JSON.""" 185 + manager = StateManager(temp_state_file) 186 + feed_url = "https://news.kagi.com/world.xml" 187 + 188 + manager.mark_posted(feed_url, "test-guid", "test-post-uri") 189 + manager.update_last_run(feed_url, datetime.now()) 190 + 191 + # Should be valid JSON 192 + with open(temp_state_file) as f: 193 + state = json.load(f) 194 + 195 + assert 'feeds' in state 196 + assert feed_url in state['feeds'] 197 + assert 'posted_guids' in state['feeds'][feed_url] 198 + assert 'last_successful_run' in state['feeds'][feed_url] 199 + 200 + def test_automatic_cleanup_on_mark_posted(self, temp_state_file): 201 + """Test that cleanup happens automatically when marking posted.""" 202 + manager = StateManager(temp_state_file) 203 + feed_url = "https://news.kagi.com/world.xml" 204 + 205 + # Add old entry manually 206 + old_timestamp = (datetime.now() - timedelta(days=31)).isoformat() 207 + state_data = { 208 + 'feeds': { 209 + feed_url: { 210 + 'posted_guids': [{ 211 + 'guid': 'old-guid', 212 + 'post_uri': 'old-uri', 213 + 'posted_at': old_timestamp 214 + }], 215 + 'last_successful_run': None 216 + } 217 + } 218 + } 219 + temp_state_file.write_text(json.dumps(state_data, indent=2)) 220 + 221 + # Reload and add new entry (should trigger cleanup) 222 + manager = StateManager(temp_state_file) 223 + manager.mark_posted(feed_url, "new-guid", "new-uri") 224 + 225 + # Old entry should be gone 226 + assert not manager.is_posted(feed_url, "old-guid") 227 + assert manager.is_posted(feed_url, "new-guid")
+40
docs/PRD_COMMUNITIES.md
··· 201 201 202 202 --- 203 203 204 + ### Blob Upload Proxy System 205 + **Status:** Design documented, implementation TODO 206 + **Priority:** CRITICAL for Beta - Required for image/video posts in communities 207 + 208 + **Problem:** Users on external PDSs cannot directly upload blobs to community-owned PDS repositories because they lack authentication credentials for the community's PDS. 209 + 210 + **Solution:** Coves AppView acts as an authenticated proxy for blob uploads: 211 + 212 + **Flow:** 213 + 1. User uploads blob to Coves AppView via `social.coves.blob.uploadForCommunity` 214 + 2. AppView validates user can post to community (not banned, community accessible) 215 + 3. AppView uses community's PDS credentials to upload blob via `com.atproto.repo.uploadBlob` 216 + 4. AppView returns CID to user 217 + 5. User creates post record referencing the CID 218 + 6. Post and blob both live in community's PDS 219 + 220 + **Implementation Checklist:** 221 + - [ ] Handler: `social.coves.blob.uploadForCommunity` endpoint 222 + - [ ] Validation: Check user authorization to post in community 223 + - [ ] Credential Management: Reuse community token refresh logic 224 + - [ ] Upload Proxy: Forward blob to community's PDS with community credentials 225 + - [ ] Security: Size limits, content-type validation, rate limiting 226 + - [ ] Testing: E2E test with federated user uploading to community 227 + 228 + **Why This Approach:** 229 + - ✅ Works with federated users (any PDS) 230 + - ✅ Reuses existing community credential infrastructure 231 + - ✅ Matches V2 architecture (AppView orchestrates, communities own data) 232 + - ✅ Blobs stored on correct PDS (community's repository) 233 + - ❌ AppView becomes upload intermediary (bandwidth cost) 234 + 235 + **Alternative Considered:** Direct user uploads to community PDS 236 + - Rejected: Would require creating temporary user accounts on every community PDS (complex, insecure) 237 + 238 + **See:** Design discussion in context of ATProto blob architecture 239 + 240 + --- 241 + 204 242 ### Posts in Communities 205 243 **Status:** Lexicon designed, implementation TODO 206 244 **Priority:** HIGHEST for Beta 1 ··· 214 252 - [ ] Decide membership requirements for posting 215 253 216 254 **Without posts, communities exist but can't be used!** 255 + 256 + **Depends on:** Blob Upload Proxy System (for image/video posts) 217 257 218 258 --- 219 259
+704 -884
docs/aggregators/PRD_KAGI_NEWS_RSS.md
··· 1 1 # Kagi News RSS Aggregator PRD 2 2 3 - **Status:** Planning Phase 3 + **Status:** ✅ Phase 1 Complete - Ready for Deployment 4 4 **Owner:** Platform Team 5 - **Last Updated:** 2025-10-20 5 + **Last Updated:** 2025-10-24 6 6 **Parent PRD:** [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md) 7 + **Implementation:** Python + Docker Compose 8 + 9 + ## 🎉 Implementation Complete 10 + 11 + All core components have been implemented and tested: 12 + 13 + - ✅ **RSS Fetcher** - Fetches feeds with retry logic and error handling 14 + - ✅ **HTML Parser** - Extracts all structured data (summary, highlights, perspectives, quote, sources) 15 + - ✅ **Rich Text Formatter** - Formats content with proper facets for Coves 16 + - ✅ **State Manager** - Tracks posted stories to prevent duplicates 17 + - ✅ **Config Manager** - Loads and validates YAML configuration 18 + - ✅ **Coves Client** - Handles authentication and post creation 19 + - ✅ **Main Orchestrator** - Coordinates all components 20 + - ✅ **Comprehensive Tests** - 57 tests with 83% code coverage 21 + - ✅ **Documentation** - README with setup and deployment instructions 22 + - ✅ **Example Configs** - config.example.yaml and .env.example 23 + 24 + **Test Results:** 25 + ``` 26 + 57 passed, 6 skipped, 1 warning in 8.76s 27 + Coverage: 83% 28 + ``` 29 + 30 + **Ready for:** 31 + - Integration testing with live Coves API 32 + - Aggregator DID creation and authorization 33 + - Production deployment 7 34 8 35 ## Overview 9 36 ··· 15 42 - **Rich metadata**: Categories, highlights, source links included 16 43 - **Legal & free**: CC BY-NC licensed for non-commercial use 17 44 - **Low complexity**: No LLM deduplication needed (Kagi does it) 45 + - **Simple deployment**: Python + Docker Compose, runs alongside Coves on same instance 18 46 19 47 ## Data Source: Kagi News RSS Feeds 20 48 ··· 46 74 47 75 **Known Categories:** 48 76 - `world.xml` - World news 49 - - `tech.xml` - Technology (likely) 50 - - `business.xml` - Business (likely) 77 + - `tech.xml` - Technology 78 + - `business.xml` - Business 51 79 - `sports.xml` - Sports (likely) 52 80 - Additional categories TBD (need to scrape homepage) 53 81 ··· 55 83 56 84 **Update Frequency:** One daily update (~noon UTC) 57 85 86 + **Important Note on Domain Migration (October 2025):** 87 + Kagi migrated their RSS feeds from `kite.kagi.com` to `news.kagi.com`. The old domain now redirects (302) to the new domain, but for reliability, always use `news.kagi.com` directly in your feed URLs. Story links within the RSS feed still reference `kite.kagi.com` as permalinks. 88 + 58 89 --- 59 90 60 91 ### RSS Item Schema ··· 99 130 </ul> 100 131 ``` 101 132 133 + **✅ Verified Feed Structure:** 134 + Analysis of live Kagi News feeds confirms the following structure: 135 + - **Only 3 H3 sections:** Highlights, Perspectives, Sources (no other sections like Timeline or Historical Background) 136 + - **Historical context** is woven into the summary paragraph and highlights (not a separate section) 137 + - **Not all stories have all sections** - Quote (blockquote) and image are optional 138 + - **Feed contains everything shown on website** except for Timeline (which is a frontend-only feature) 139 + 102 140 **Key Features:** 103 141 - Multiple source citations inline 104 142 - Balanced perspectives from different actors 105 - - Highlights extract key points 106 - - Direct quotes preserved 143 + - Highlights extract key points with historical context 144 + - Direct quotes preserved (when available) 107 145 - All sources linked with attribution 146 + - Images from Kagi's proxy CDN 108 147 109 148 --- 110 149 ··· 123 162 │ HTTP GET one job after update 124 163 125 164 ┌─────────────────────────────────────────────────────────────┐ 126 - │ Kagi News Aggregator Service │ 127 - │ DID: did:web:kagi-news.coves.social │ 165 + │ Kagi News Aggregator Service (Python + Docker Compose) │ 166 + │ DID: did:plc:[generated-on-creation] │ 167 + │ Location: aggregators/kagi-news/ │ 128 168 │ │ 129 169 │ Components: │ 130 - │ 1. Feed Poller: Fetches RSS feeds on schedule │ 131 - │ 2. Item Parser: Extracts structured data from HTML │ 132 - │ 3. Deduplication: Tracks posted GUIDs (no LLM needed) │ 133 - │ 4. Category Mapper: Maps Kagi categories to communities │ 170 + │ 1. RSS Fetcher: Fetches RSS feeds on schedule (feedparser) │ 171 + │ 2. Item Parser: Extracts structured data from HTML (bs4) │ 172 + │ 3. Deduplication: Tracks posted items via JSON state file │ 173 + │ 4. Feed Mapper: Maps feed URLs to community handles │ 134 174 │ 5. Post Formatter: Converts to Coves post format │ 135 - │ 6. Post Publisher: Calls social.coves.post.create │ 175 + │ 6. Post Publisher: Calls social.coves.post.create via XRPC │ 176 + │ 7. Blob Uploader: Handles image upload to ATProto │ 136 177 └─────────────────────────────────────────────────────────────┘ 137 178 138 179 │ Authenticated XRPC calls ··· 140 181 ┌─────────────────────────────────────────────────────────────┐ 141 182 │ Coves AppView (social.coves.post.create) │ 142 183 │ - Validates aggregator authorization │ 143 - │ - Creates post with author = did:web:kagi-news.coves.social│ 184 + │ - Creates post with author = did:plc:[aggregator-did] │ 144 185 │ - Indexes to community feeds │ 145 186 └─────────────────────────────────────────────────────────────┘ 146 187 ``` ··· 152 193 ```json 153 194 { 154 195 "$type": "social.coves.aggregator.service", 155 - "did": "did:web:kagi-news.coves.social", 196 + "did": "did:plc:[generated-on-creation]", 156 197 "displayName": "Kagi News Aggregator", 157 198 "description": "Automatically posts breaking news from Kagi News RSS feeds. Kagi News aggregates multiple sources per story with balanced perspectives and comprehensive source citations.", 158 199 "aggregatorType": "social.coves.aggregator.types#rss", ··· 160 201 "configSchema": { 161 202 "type": "object", 162 203 "properties": { 163 - "categories": { 164 - "type": "array", 165 - "items": { 166 - "type": "string", 167 - "enum": ["world", "tech", "business", "sports", "science"] 168 - }, 169 - "description": "Kagi News categories to monitor", 170 - "minItems": 1 171 - }, 172 - "subcategoryFilter": { 173 - "type": "array", 174 - "items": { "type": "string" }, 175 - "description": "Optional: only post stories with these subcategories (e.g., 'World/Middle East', 'Tech/AI')" 176 - }, 177 - "minSources": { 178 - "type": "integer", 179 - "minimum": 1, 180 - "default": 2, 181 - "description": "Minimum number of sources required for a story to be posted" 182 - }, 183 - "includeImages": { 184 - "type": "boolean", 185 - "default": true, 186 - "description": "Include images from Kagi proxy in posts" 187 - }, 188 - "postFormat": { 204 + "feedUrl": { 189 205 "type": "string", 190 - "enum": ["full", "summary", "minimal"], 191 - "default": "full", 192 - "description": "How much content to include: full (all sections), summary (main paragraph + sources), minimal (title + link only)" 206 + "format": "uri", 207 + "description": "Kagi News RSS feed URL (e.g., https://news.kagi.com/world.xml)" 193 208 } 194 209 }, 195 - "required": ["categories"] 210 + "required": ["feedUrl"] 196 211 }, 197 212 "sourceUrl": "https://github.com/coves-social/kagi-news-aggregator", 198 213 "maintainer": "did:plc:coves-platform", 199 - "createdAt": "2025-10-20T12:00:00Z" 214 + "createdAt": "2025-10-23T00:00:00Z" 200 215 } 201 216 ``` 202 217 218 + **Note:** The MVP implementation uses a simpler configuration model. Feed-to-community mappings are defined in the aggregator's own config file rather than per-community configuration. This allows one aggregator instance to post to multiple communities. 219 + 203 220 --- 204 221 205 - ## Community Configuration Examples 222 + ## Aggregator Configuration (MVP) 206 223 207 - ### Example 1: World News Community 224 + The MVP uses a simplified configuration model where the aggregator service defines feed-to-community mappings in its own config file. 208 225 209 - ```json 210 - { 211 - "aggregatorDid": "did:web:kagi-news.coves.social", 212 - "enabled": true, 213 - "config": { 214 - "categories": ["world"], 215 - "minSources": 3, 216 - "includeImages": true, 217 - "postFormat": "full" 218 - } 219 - } 220 - ``` 226 + ### Configuration File: `config.yaml` 221 227 222 - **Result:** Posts all world news stories with 3+ sources, full content including images/highlights/perspectives. 228 + ```yaml 229 + # Aggregator credentials (from environment variables) 230 + # AGGREGATOR_DID=did:plc:xyz... 231 + # AGGREGATOR_PRIVATE_KEY=base64-encoded-key... 223 232 224 - --- 233 + # Coves API endpoint 234 + coves_api_url: "https://api.coves.social" 225 235 226 - ### Example 2: AI/Tech Community (Filtered) 236 + # Feed-to-community mappings 237 + feeds: 238 + - name: "World News" 239 + url: "https://news.kagi.com/world.xml" 240 + community_handle: "world-news.coves.social" 241 + enabled: true 227 242 228 - ```json 229 - { 230 - "aggregatorDid": "did:web:kagi-news.coves.social", 231 - "enabled": true, 232 - "config": { 233 - "categories": ["tech", "business"], 234 - "subcategoryFilter": ["Tech/AI", "Tech/Machine Learning", "Business/Tech Industry"], 235 - "minSources": 2, 236 - "includeImages": true, 237 - "postFormat": "full" 238 - } 239 - } 240 - ``` 243 + - name: "Tech News" 244 + url: "https://news.kagi.com/tech.xml" 245 + community_handle: "tech.coves.social" 246 + enabled: true 241 247 242 - **Result:** Only posts tech stories about AI/ML or tech industry business news with 2+ sources. 248 + - name: "Science News" 249 + url: "https://news.kagi.com/science.xml" 250 + community_handle: "science.coves.social" 251 + enabled: false # Can be disabled without removing 243 252 244 - --- 245 - 246 - ### Example 3: Breaking News (Minimal) 253 + # Scheduling 254 + check_interval: "24h" # Run once daily 247 255 248 - ```json 249 - { 250 - "aggregatorDid": "did:web:kagi-news.coves.social", 251 - "enabled": true, 252 - "config": { 253 - "categories": ["world", "business", "tech"], 254 - "minSources": 5, 255 - "includeImages": false, 256 - "postFormat": "minimal" 257 - } 258 - } 256 + # Logging 257 + log_level: "info" 259 258 ``` 260 259 261 - **Result:** Only major stories (5+ sources), minimal format (headline + link), no images. 260 + **Key Decisions:** 261 + - Uses **community handles** (not DIDs) for easier configuration - resolved at runtime 262 + - One aggregator can post to multiple communities 263 + - Feed mappings managed in aggregator config (not per-community config) 264 + - No complex filtering logic in MVP - one feed = one community 262 265 263 266 --- 264 267 ··· 269 272 ```json 270 273 { 271 274 "$type": "social.coves.post.record", 272 - "author": "did:web:kagi-news.coves.social", 273 - "community": "did:plc:worldnews123", 275 + "author": "did:plc:[aggregator-did]", 276 + "community": "world-news.coves.social", 274 277 "title": "{Kagi story title}", 275 - "content": "{formatted content based on postFormat config}", 278 + "content": "{formatted content - full format for MVP}", 276 279 "embed": { 277 - "$type": "app.bsky.embed.external", 280 + "$type": "social.coves.embed.external", 278 281 "external": { 279 - "uri": "https://kite.kagi.com/{uuid}/{category}/{id}", 282 + "uri": "{Kagi story URL}", 280 283 "title": "{story title}", 281 - "description": "{summary excerpt}", 282 - "thumb": "{image blob if includeImages=true}" 284 + "description": "{summary excerpt - first 200 chars}", 285 + "thumb": "{Kagi proxy image URL from HTML}" 283 286 } 284 287 }, 285 288 "federatedFrom": { ··· 296 299 } 297 300 ``` 298 301 302 + **MVP Notes:** 303 + - Uses `social.coves.embed.external` for hot-linked images (no blob upload) 304 + - Community specified as handle (resolved to DID by post creation endpoint) 305 + - Images referenced via original Kagi proxy URLs 306 + - "Full" format only for MVP (no format variations) 307 + - Content uses Coves rich text with facets (not markdown) 308 + 299 309 --- 300 310 301 - ### Content Formatting by `postFormat` 311 + ### Content Formatting (MVP: "Full" Format Only) 302 312 303 - #### Format: `full` (Default) 313 + The MVP implements a single "full" format using Coves rich text with facets: 304 314 305 - ```markdown 315 + **Plain Text Structure:** 316 + ``` 306 317 {Main summary paragraph with source citations} 307 318 308 - **Highlights:** 319 + Highlights: 309 320 • {Bullet point 1} 310 321 • {Bullet point 2} 311 322 • ... 312 323 313 - **Perspectives:** 314 - • **{Actor}**: {Their perspective} ([Source]({url})) 324 + Perspectives: 325 + • {Actor}: {Their perspective} (Source) 315 326 • ... 316 327 317 - > {Notable quote} — {Attribution} 328 + "{Notable quote}" — {Attribution} 318 329 319 - **Sources:** 320 - • [{Title}]({url}) - {domain} 330 + Sources: 331 + • {Title} - {domain} 321 332 • ... 322 333 323 334 --- 324 - 📰 Story aggregated by [Kagi News]({kagi_story_url}) 335 + 📰 Story aggregated by Kagi News 325 336 ``` 326 337 327 - **Rationale:** Preserves Kagi's rich multi-source analysis, provides maximum value. 338 + **Rich Text Facets Applied:** 339 + - **Bold** (`social.coves.richtext.facet#bold`) on section headers: "Highlights:", "Perspectives:", "Sources:" 340 + - **Bold** on perspective actors 341 + - **Italic** (`social.coves.richtext.facet#italic`) on quotes 342 + - **Link** (`social.coves.richtext.facet#link`) on all URLs (source links, Kagi story link, perspective sources) 343 + - Byte ranges calculated using UTF-8 byte positions 328 344 329 - --- 345 + **Example with Facets:** 346 + ```json 347 + { 348 + "content": "Main summary [source.com#1]\n\nHighlights:\n• Key point 1...", 349 + "facets": [ 350 + { 351 + "index": {"byteStart": 35, "byteEnd": 46}, 352 + "features": [{"$type": "social.coves.richtext.facet#bold"}] 353 + }, 354 + { 355 + "index": {"byteStart": 15, "byteEnd": 26}, 356 + "features": [{"$type": "social.coves.richtext.facet#link", "uri": "https://source.com"}] 357 + } 358 + ] 359 + } 360 + ``` 330 361 331 - #### Format: `summary` 332 - 333 - ```markdown 334 - {Main summary paragraph with source citations} 335 - 336 - **Sources:** 337 - • [{Title}]({url}) - {domain} 338 - • ... 362 + **Rationale:** 363 + - Uses native Coves rich text format (not markdown) 364 + - Preserves Kagi's rich multi-source analysis 365 + - Provides maximum value to communities 366 + - Meets CC BY-NC attribution requirements 367 + - Additional formats ("summary", "minimal") can be added post-MVP 339 368 340 369 --- 341 - 📰 Story aggregated by [Kagi News]({kagi_story_url}) 342 - ``` 343 370 344 - **Rationale:** Clean summary with source links, less overwhelming. 371 + ## Implementation Details (Python MVP) 345 372 346 - --- 373 + ### Technology Stack 347 374 348 - #### Format: `minimal` 375 + **Language:** Python 3.11+ 349 376 350 - ```markdown 351 - {Story title} 352 - 353 - Read more: {kagi_story_url} 377 + **Key Libraries:** 378 + - `feedparser` - RSS/Atom parsing 379 + - `beautifulsoup4` - HTML parsing for RSS item descriptions 380 + - `requests` - HTTP client for fetching feeds 381 + - `atproto` - Official ATProto Python SDK for authentication 382 + - `pyyaml` - Configuration file parsing 383 + - `pytest` - Testing framework 354 384 355 - **Sources:** {domain1}, {domain2}, {domain3}... 385 + ### Project Structure 356 386 357 - --- 358 - 📰 Via [Kagi News]({kagi_story_url}) 359 387 ``` 360 - 361 - **Rationale:** Just headlines with link, for high-volume communities or breaking news alerts. 388 + aggregators/kagi-news/ 389 + ├── Dockerfile 390 + ├── docker-compose.yml 391 + ├── requirements.txt 392 + ├── config.example.yaml 393 + ├── crontab # CRON schedule configuration 394 + ├── .env.example # Environment variables template 395 + ├── scripts/ 396 + │ └── generate_did.py # Helper to generate aggregator DID 397 + ├── src/ 398 + │ ├── main.py # Entry point (single run, called by CRON) 399 + │ ├── config.py # Configuration loading and validation 400 + │ ├── rss_fetcher.py # RSS feed fetching with retry logic 401 + │ ├── html_parser.py # Parse Kagi HTML to structured data 402 + │ ├── richtext_formatter.py # Format content with rich text facets 403 + │ ├── atproto_client.py # ATProto authentication and operations 404 + │ ├── state_manager.py # Deduplication state tracking (JSON) 405 + │ └── models.py # Data models (KagiStory, etc.) 406 + ├── tests/ 407 + │ ├── test_parser.py 408 + │ ├── test_richtext_formatter.py 409 + │ ├── test_state_manager.py 410 + │ └── fixtures/ # Sample RSS feeds for testing 411 + └── README.md 412 + ``` 362 413 363 414 --- 364 415 365 - ## Implementation Details 416 + ### Component 1: RSS Fetcher (`rss_fetcher.py`) ✅ COMPLETE 366 417 367 - ### Component 1: Feed Poller 418 + **Responsibility:** Fetch RSS feeds with retry logic and error handling 368 419 369 - **Responsibility:** Fetch RSS feeds on schedule 420 + **Key Functions:** 421 + - `fetch_feed(url: str) -> feedparser.FeedParserDict` 422 + - Uses `requests` with timeout (30s) 423 + - Retry logic: 3 attempts with exponential backoff 424 + - Returns parsed RSS feed or raises exception 370 425 371 - ```go 372 - type FeedPoller struct { 373 - categories []string 374 - pollInterval time.Duration 375 - httpClient *http.Client 376 - } 426 + **Error Handling:** 427 + - Network timeouts 428 + - Invalid XML 429 + - HTTP errors (404, 500, etc.) 377 430 378 - func (p *FeedPoller) Start(ctx context.Context) error { 379 - ticker := time.NewTicker(p.pollInterval) // 15 minutes 380 - defer ticker.Stop() 381 - 382 - for { 383 - select { 384 - case <-ticker.C: 385 - for _, category := range p.categories { 386 - feedURL := fmt.Sprintf("https://news.kagi.com/%s.xml", category) 387 - feed, err := p.fetchFeed(feedURL) 388 - if err != nil { 389 - log.Printf("Failed to fetch %s: %v", feedURL, err) 390 - continue 391 - } 392 - p.handleFeed(ctx, category, feed) 393 - } 394 - case <-ctx.Done(): 395 - return nil 396 - } 397 - } 398 - } 399 - 400 - func (p *FeedPoller) fetchFeed(url string) (*gofeed.Feed, error) { 401 - parser := gofeed.NewParser() 402 - feed, err := parser.ParseURL(url) 403 - return feed, err 404 - } 405 - ``` 406 - 407 - **Libraries:** 408 - - `github.com/mmcdole/gofeed` - RSS/Atom parser 431 + **Implementation Status:** 432 + - ✅ Implemented with comprehensive error handling 433 + - ✅ Tests passing (5 tests) 434 + - ✅ Handles retries with exponential backoff 409 435 410 436 --- 411 437 412 - ### Component 2: Item Parser 438 + ### Component 2: HTML Parser (`html_parser.py`) ✅ COMPLETE 413 439 414 - **Responsibility:** Extract structured data from RSS item HTML 415 - 416 - ```go 417 - type KagiStory struct { 418 - Title string 419 - Link string 420 - GUID string 421 - PubDate time.Time 422 - Categories []string 423 - 424 - // Parsed from HTML description 425 - Summary string 426 - Highlights []string 427 - Perspectives []Perspective 428 - Quote *Quote 429 - Sources []Source 430 - ImageURL string 431 - ImageAlt string 432 - } 440 + **Responsibility:** Extract structured data from Kagi's HTML description field 433 441 434 - type Perspective struct { 435 - Actor string 436 - Description string 437 - SourceURL string 438 - } 442 + **Key Class:** `KagiHTMLParser` 439 443 440 - type Quote struct { 441 - Text string 442 - Attribution string 443 - } 444 + **Data Model (`models.py`):** 445 + ```python 446 + @dataclass 447 + class KagiStory: 448 + title: str 449 + link: str 450 + guid: str 451 + pub_date: datetime 452 + categories: List[str] 444 453 445 - type Source struct { 446 - Title string 447 - URL string 448 - Domain string 449 - } 454 + # Parsed from HTML 455 + summary: str 456 + highlights: List[str] 457 + perspectives: List[Perspective] 458 + quote: Optional[Quote] 459 + sources: List[Source] 460 + image_url: Optional[str] 461 + image_alt: Optional[str] 450 462 451 - func (p *ItemParser) Parse(item *gofeed.Item) (*KagiStory, error) { 452 - doc, err := goquery.NewDocumentFromReader(strings.NewReader(item.Description)) 453 - if err != nil { 454 - return nil, err 455 - } 463 + @dataclass 464 + class Perspective: 465 + actor: str 466 + description: str 467 + source_url: str 456 468 457 - story := &KagiStory{ 458 - Title: item.Title, 459 - Link: item.Link, 460 - GUID: item.GUID, 461 - PubDate: *item.PublishedParsed, 462 - Categories: item.Categories, 463 - } 469 + @dataclass 470 + class Quote: 471 + text: str 472 + attribution: str 464 473 465 - // Extract summary (first <p> tag) 466 - story.Summary = doc.Find("p").First().Text() 474 + @dataclass 475 + class Source: 476 + title: str 477 + url: str 478 + domain: str 479 + ``` 467 480 468 - // Extract highlights 469 - doc.Find("h3:contains('Highlights')").Next("ul").Find("li").Each(func(i int, s *goquery.Selection) { 470 - story.Highlights = append(story.Highlights, s.Text()) 471 - }) 481 + **Parsing Strategy:** 482 + - Use BeautifulSoup to parse HTML description 483 + - Extract sections by finding `<h3>` tags (Highlights, Perspectives, Sources) 484 + - Handle missing sections gracefully (not all stories have all sections) 485 + - Clean and normalize text 472 486 473 - // Extract perspectives 474 - doc.Find("h3:contains('Perspectives')").Next("ul").Find("li").Each(func(i int, s *goquery.Selection) { 475 - text := s.Text() 476 - link := s.Find("a").First() 477 - sourceURL, _ := link.Attr("href") 487 + **Implementation Status:** 488 + - ✅ Extracts all 3 H3 sections (Highlights, Perspectives, Sources) 489 + - ✅ Handles optional elements (quote, image) 490 + - ✅ Tests passing (8 tests) 491 + - ✅ Validates against real feed data 478 492 479 - // Parse format: "Actor: Description (Source)" 480 - parts := strings.SplitN(text, ":", 2) 481 - if len(parts) == 2 { 482 - story.Perspectives = append(story.Perspectives, Perspective{ 483 - Actor: strings.TrimSpace(parts[0]), 484 - Description: strings.TrimSpace(parts[1]), 485 - SourceURL: sourceURL, 486 - }) 487 - } 488 - }) 493 + --- 489 494 490 - // Extract quote 491 - doc.Find("blockquote").Each(func(i int, s *goquery.Selection) { 492 - text := s.Text() 493 - parts := strings.Split(text, " - ") 494 - if len(parts) == 2 { 495 - story.Quote = &Quote{ 496 - Text: strings.TrimSpace(parts[0]), 497 - Attribution: strings.TrimSpace(parts[1]), 498 - } 499 - } 500 - }) 495 + ### Component 3: State Manager (`state_manager.py`) ✅ COMPLETE 501 496 502 - // Extract sources 503 - doc.Find("h3:contains('Sources')").Next("ul").Find("li").Each(func(i int, s *goquery.Selection) { 504 - link := s.Find("a").First() 505 - url, _ := link.Attr("href") 506 - title := link.Text() 507 - domain := extractDomain(s.Text()) 497 + **Responsibility:** Track processed stories to prevent duplicates 508 498 509 - story.Sources = append(story.Sources, Source{ 510 - Title: title, 511 - URL: url, 512 - Domain: domain, 513 - }) 514 - }) 499 + **Implementation:** Simple JSON file persistence 515 500 516 - // Extract image 517 - img := doc.Find("img").First() 518 - if img.Length() > 0 { 519 - story.ImageURL, _ = img.Attr("src") 520 - story.ImageAlt, _ = img.Attr("alt") 501 + **State File Format:** 502 + ```json 503 + { 504 + "feeds": { 505 + "https://news.kagi.com/world.xml": { 506 + "last_successful_run": "2025-10-23T12:00:00Z", 507 + "posted_guids": [ 508 + "https://kite.kagi.com/uuid1/world/123", 509 + "https://kite.kagi.com/uuid2/world/124" 510 + ] 521 511 } 522 - 523 - return story, nil 512 + } 524 513 } 525 514 ``` 526 515 527 - **Libraries:** 528 - - `github.com/PuerkitoBio/goquery` - HTML parsing 516 + **Key Functions:** 517 + - `is_posted(feed_url: str, guid: str) -> bool` 518 + - `mark_posted(feed_url: str, guid: str, post_uri: str)` 519 + - `get_last_run(feed_url: str) -> Optional[datetime]` 520 + - `update_last_run(feed_url: str, timestamp: datetime)` 529 521 530 - --- 522 + **Deduplication Strategy:** 523 + - Keep last 100 GUIDs per feed (rolling window) 524 + - Stories older than 30 days are automatically removed 525 + - Simple, no database needed 531 526 532 - ### Component 3: Deduplication 527 + **Implementation Status:** 528 + - ✅ JSON-based persistence with atomic writes 529 + - ✅ GUID tracking with rolling window 530 + - ✅ Tests passing (12 tests) 531 + - ✅ Thread-safe operations 533 532 534 - **Responsibility:** Track posted stories to prevent duplicates 533 + --- 535 534 536 - ```go 537 - type Deduplicator struct { 538 - db *sql.DB 539 - } 535 + ### Component 4: Rich Text Formatter (`richtext_formatter.py`) ✅ COMPLETE 540 536 541 - func (d *Deduplicator) AlreadyPosted(guid string) (bool, error) { 542 - var exists bool 543 - err := d.db.QueryRow(` 544 - SELECT EXISTS( 545 - SELECT 1 FROM kagi_news_posted_stories 546 - WHERE guid = $1 547 - ) 548 - `, guid).Scan(&exists) 549 - return exists, err 550 - } 537 + **Responsibility:** Format parsed Kagi stories into Coves rich text with facets 551 538 552 - func (d *Deduplicator) MarkPosted(guid, postURI string) error { 553 - _, err := d.db.Exec(` 554 - INSERT INTO kagi_news_posted_stories (guid, post_uri, posted_at) 555 - VALUES ($1, $2, NOW()) 556 - ON CONFLICT (guid) DO NOTHING 557 - `, guid, postURI) 558 - return err 559 - } 560 - ``` 539 + **Key Function:** 540 + - `format_full(story: KagiStory) -> dict` 541 + - Returns: `{"content": str, "facets": List[dict]}` 542 + - Builds plain text content with all sections 543 + - Calculates UTF-8 byte positions for facets 544 + - Applies bold, italic, and link facets 545 + - Includes all sections: summary, highlights, perspectives, quote, sources 546 + - Adds Kagi News attribution footer with link 561 547 562 - **Database Table:** 563 - ```sql 564 - CREATE TABLE kagi_news_posted_stories ( 565 - guid TEXT PRIMARY KEY, 566 - post_uri TEXT NOT NULL, 567 - posted_at TIMESTAMPTZ NOT NULL DEFAULT NOW() 568 - ); 548 + **Facet Types Applied:** 549 + - `social.coves.richtext.facet#bold` - Section headers, perspective actors 550 + - `social.coves.richtext.facet#italic` - Quotes 551 + - `social.coves.richtext.facet#link` - All URLs (sources, Kagi story link) 569 552 570 - CREATE INDEX idx_kagi_posted_at ON kagi_news_posted_stories(posted_at DESC); 571 - ``` 553 + **Key Challenge:** UTF-8 byte position calculation 554 + - Must handle multi-byte characters correctly (emoji, non-ASCII) 555 + - Use `str.encode('utf-8')` to get byte positions 556 + - Test with complex characters 572 557 573 - **Cleanup:** Periodic job deletes rows older than 30 days (Kagi unlikely to re-post old stories). 558 + **Implementation Status:** 559 + - ✅ Full rich text formatting with facets 560 + - ✅ UTF-8 byte position calculation working correctly 561 + - ✅ Tests passing (10 tests) 562 + - ✅ Handles all sections: summary, highlights, perspectives, quote, sources 574 563 575 564 --- 576 565 577 - ### Component 4: Category Mapper 566 + ### Component 5: Coves Client (`coves_client.py`) ✅ COMPLETE 578 567 579 - **Responsibility:** Map Kagi categories to authorized communities 568 + **Responsibility:** Handle authentication and post creation via Coves API 580 569 581 - ```go 582 - func (m *CategoryMapper) GetTargetCommunities(story *KagiStory) ([]*CommunityAuth, error) { 583 - // Get all communities that have authorized this aggregator 584 - allAuths, err := m.aggregator.GetAuthorizedCommunities(context.Background()) 585 - if err != nil { 586 - return nil, err 587 - } 570 + **Implementation Note:** Uses direct HTTP client instead of ATProto SDK for simplicity in MVP. 588 571 589 - var targets []*CommunityAuth 590 - for _, auth := range allAuths { 591 - if !auth.Enabled { 592 - continue 593 - } 572 + **Key Functions:** 573 + - `authenticate() -> dict` 574 + - Authenticates aggregator using credentials 575 + - Returns auth token for subsequent API calls 594 576 595 - config := auth.Config 577 + - `create_post(community_handle: str, title: str, content: str, facets: List[dict], ...) -> dict` 578 + - Calls Coves post creation endpoint 579 + - Includes aggregator authentication 580 + - Returns post URI and metadata 596 581 597 - // Check if story's primary category is in config.categories 598 - primaryCategory := story.Categories[0] 599 - if !contains(config["categories"], primaryCategory) { 600 - continue 601 - } 582 + **Authentication Flow:** 583 + - Load aggregator credentials from environment 584 + - Authenticate with Coves API 585 + - Store and use auth token for requests 586 + - Handle token refresh if needed 602 587 603 - // Check subcategory filter (if specified) 604 - if subcatFilter, ok := config["subcategoryFilter"].([]string); ok && len(subcatFilter) > 0 { 605 - if !hasAnySubcategory(story.Categories, subcatFilter) { 606 - continue 607 - } 608 - } 609 - 610 - // Check minimum sources requirement 611 - minSources := config["minSources"].(int) 612 - if len(story.Sources) < minSources { 613 - continue 614 - } 615 - 616 - targets = append(targets, auth) 617 - } 618 - 619 - return targets, nil 620 - } 621 - ``` 588 + **Implementation Status:** 589 + - ✅ HTTP-based client implementation 590 + - ✅ Authentication and token management 591 + - ✅ Post creation with all required fields 592 + - ✅ Error handling and retries 622 593 623 594 --- 624 595 625 - ### Component 5: Post Formatter 626 - 627 - **Responsibility:** Convert Kagi story to Coves post format 628 - 629 - ```go 630 - func (f *PostFormatter) Format(story *KagiStory, format string) string { 631 - switch format { 632 - case "full": 633 - return f.formatFull(story) 634 - case "summary": 635 - return f.formatSummary(story) 636 - case "minimal": 637 - return f.formatMinimal(story) 638 - default: 639 - return f.formatFull(story) 640 - } 641 - } 642 - 643 - func (f *PostFormatter) formatFull(story *KagiStory) string { 644 - var buf strings.Builder 645 - 646 - // Summary 647 - buf.WriteString(story.Summary) 648 - buf.WriteString("\n\n") 649 - 650 - // Highlights 651 - if len(story.Highlights) > 0 { 652 - buf.WriteString("**Highlights:**\n") 653 - for _, h := range story.Highlights { 654 - buf.WriteString(fmt.Sprintf("• %s\n", h)) 655 - } 656 - buf.WriteString("\n") 657 - } 596 + ### Component 6: Config Manager (`config.py`) ✅ COMPLETE 658 597 659 - // Perspectives 660 - if len(story.Perspectives) > 0 { 661 - buf.WriteString("**Perspectives:**\n") 662 - for _, p := range story.Perspectives { 663 - buf.WriteString(fmt.Sprintf("• **%s**: %s ([Source](%s))\n", p.Actor, p.Description, p.SourceURL)) 664 - } 665 - buf.WriteString("\n") 666 - } 598 + **Responsibility:** Load and validate configuration from YAML and environment 667 599 668 - // Quote 669 - if story.Quote != nil { 670 - buf.WriteString(fmt.Sprintf("> %s — %s\n\n", story.Quote.Text, story.Quote.Attribution)) 671 - } 600 + **Key Functions:** 601 + - `load_config(config_path: str) -> AggregatorConfig` 602 + - Loads YAML configuration 603 + - Validates structure and required fields 604 + - Merges with environment variables 605 + - Returns validated config object 672 606 673 - // Sources 674 - buf.WriteString("**Sources:**\n") 675 - for _, s := range story.Sources { 676 - buf.WriteString(fmt.Sprintf("• [%s](%s) - %s\n", s.Title, s.URL, s.Domain)) 677 - } 678 - buf.WriteString("\n") 607 + **Implementation Status:** 608 + - ✅ YAML parsing with validation 609 + - ✅ Environment variable support 610 + - ✅ Tests passing (3 tests) 611 + - ✅ Clear error messages for config issues 679 612 680 - // Attribution 681 - buf.WriteString(fmt.Sprintf("---\n📰 Story aggregated by [Kagi News](%s)", story.Link)) 613 + --- 682 614 683 - return buf.String() 684 - } 615 + ### Main Orchestration (`main.py`) ✅ COMPLETE 685 616 686 - func (f *PostFormatter) formatSummary(story *KagiStory) string { 687 - var buf strings.Builder 617 + **Responsibility:** Coordinate all components in a single execution (called by CRON) 688 618 689 - buf.WriteString(story.Summary) 690 - buf.WriteString("\n\n**Sources:**\n") 691 - for _, s := range story.Sources { 692 - buf.WriteString(fmt.Sprintf("• [%s](%s) - %s\n", s.Title, s.URL, s.Domain)) 693 - } 694 - buf.WriteString("\n") 695 - buf.WriteString(fmt.Sprintf("---\n📰 Story aggregated by [Kagi News](%s)", story.Link)) 619 + **Flow (Single Run):** 620 + 1. Load configuration from `config.yaml` 621 + 2. Load environment variables (AGGREGATOR_DID, AGGREGATOR_PRIVATE_KEY) 622 + 3. Initialize all components (fetcher, parser, formatter, client, state) 623 + 4. For each enabled feed in config: 624 + a. Fetch RSS feed 625 + b. Parse all items 626 + c. Filter out already-posted items (check state) 627 + d. For each new item: 628 + - Parse HTML to structured KagiStory 629 + - Format post content with rich text facets 630 + - Build post record (with hot-linked image if present) 631 + - Create post via XRPC 632 + - Mark as posted in state 633 + e. Update last run timestamp 634 + 5. Save state to disk 635 + 6. Log summary (posts created, errors encountered) 636 + 7. Exit (CRON will call again on schedule) 696 637 697 - return buf.String() 698 - } 638 + **Error Isolation:** 639 + - Feed-level: One feed failing doesn't stop others 640 + - Item-level: One item failing doesn't stop feed processing 641 + - Continue on non-fatal errors, log all failures 642 + - Exit code 0 even with partial failures (CRON won't alert) 643 + - Exit code 1 only on catastrophic failure (config missing, auth failure) 699 644 700 - func (f *PostFormatter) formatMinimal(story *KagiStory) string { 701 - sourceDomains := make([]string, len(story.Sources)) 702 - for i, s := range story.Sources { 703 - sourceDomains[i] = s.Domain 704 - } 705 - 706 - return fmt.Sprintf( 707 - "%s\n\nRead more: %s\n\n**Sources:** %s\n\n---\n📰 Via [Kagi News](%s)", 708 - story.Title, 709 - story.Link, 710 - strings.Join(sourceDomains, ", "), 711 - story.Link, 712 - ) 713 - } 714 - ``` 645 + **Implementation Status:** 646 + - ✅ Complete orchestration logic implemented 647 + - ✅ Feed-level and item-level error isolation 648 + - ✅ Structured logging throughout 649 + - ✅ Tests passing (9 tests covering various scenarios) 650 + - ✅ Dry-run mode for testing 715 651 716 652 --- 717 653 718 - ### Component 6: Post Publisher 719 - 720 - **Responsibility:** Create posts via Coves API 654 + ## Deployment (Docker Compose with CRON) 721 655 722 - ```go 723 - func (p *PostPublisher) PublishStory(ctx context.Context, story *KagiStory, communities []*CommunityAuth) error { 724 - for _, comm := range communities { 725 - config := comm.Config 656 + ### Dockerfile 726 657 727 - // Format content based on config 728 - postFormat := config["postFormat"].(string) 729 - content := p.formatter.Format(story, postFormat) 730 - 731 - // Build embed 732 - var embed *aggregator.Embed 733 - if config["includeImages"].(bool) && story.ImageURL != "" { 734 - // TODO: Handle image upload/blob creation 735 - embed = &aggregator.Embed{ 736 - Type: "app.bsky.embed.external", 737 - External: &aggregator.External{ 738 - URI: story.Link, 739 - Title: story.Title, 740 - Description: truncate(story.Summary, 300), 741 - Thumb: story.ImageURL, // or blob reference 742 - }, 743 - } 744 - } 658 + ```dockerfile 659 + FROM python:3.11-slim 745 660 746 - // Create post 747 - post := aggregator.Post{ 748 - Title: story.Title, 749 - Content: content, 750 - Embed: embed, 751 - FederatedFrom: &aggregator.FederatedSource{ 752 - Platform: "kagi-news-rss", 753 - URI: story.Link, 754 - ID: story.GUID, 755 - OriginalCreatedAt: story.PubDate, 756 - }, 757 - ContentLabels: story.Categories, 758 - } 661 + WORKDIR /app 759 662 760 - err := p.aggregator.CreatePost(ctx, comm.CommunityDID, post) 761 - if err != nil { 762 - log.Printf("Failed to create post in %s: %v", comm.CommunityDID, err) 763 - continue 764 - } 663 + # Install cron 664 + RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/* 765 665 766 - // Mark as posted 767 - _ = p.deduplicator.MarkPosted(story.GUID, "post-uri-from-response") 768 - } 666 + # Install dependencies 667 + COPY requirements.txt . 668 + RUN pip install --no-cache-dir -r requirements.txt 769 669 770 - return nil 771 - } 772 - ``` 670 + # Copy source code and scripts 671 + COPY src/ ./src/ 672 + COPY scripts/ ./scripts/ 673 + COPY crontab /etc/cron.d/kagi-news-cron 773 674 774 - --- 675 + # Set up cron 676 + RUN chmod 0644 /etc/cron.d/kagi-news-cron && \ 677 + crontab /etc/cron.d/kagi-news-cron && \ 678 + touch /var/log/cron.log 775 679 776 - ## Image Handling Strategy 680 + # Create non-root user for security 681 + RUN useradd --create-home appuser && \ 682 + chown -R appuser:appuser /app && \ 683 + chown appuser:appuser /var/log/cron.log 777 684 778 - ### Initial Implementation (MVP) 685 + USER appuser 779 686 780 - **Approach:** Use Kagi proxy URLs directly in embeds 687 + # Run cron in foreground 688 + CMD ["cron", "-f"] 689 + ``` 781 690 782 - **Rationale:** 783 - - Simplest implementation 784 - - Kagi proxy likely allows hotlinking for non-commercial use 785 - - No storage costs 786 - - Images are already optimized by Kagi 691 + ### Crontab Configuration (`crontab`) 787 692 788 - **Risk Mitigation:** 789 - - Monitor for broken images 790 - - Add fallback: if image fails to load, skip embed 791 - - Prepare migration plan to self-hosting if needed 693 + ```bash 694 + # Run Kagi News aggregator daily at 1 PM UTC (after Kagi updates around noon) 695 + 0 13 * * * cd /app && /usr/local/bin/python -m src.main >> /var/log/cron.log 2>&1 792 696 793 - **Code:** 794 - ```go 795 - if config["includeImages"].(bool) && story.ImageURL != "" { 796 - // Use Kagi proxy URL directly 797 - embed = &aggregator.Embed{ 798 - External: &aggregator.External{ 799 - Thumb: story.ImageURL, // https://kagiproxy.com/img/... 800 - }, 801 - } 802 - } 697 + # Blank line required at end of crontab 803 698 ``` 804 699 805 700 --- 806 701 807 - ### Future Enhancement (If Issues Arise) 702 + ### docker-compose.yml 808 703 809 - **Approach:** Download and re-host images 704 + ```yaml 705 + version: '3.8' 810 706 811 - **Implementation:** 812 - 1. Download image from Kagi proxy 813 - 2. Upload to Coves blob storage (or S3/CDN) 814 - 3. Use blob reference in embed 707 + services: 708 + kagi-news-aggregator: 709 + build: . 710 + container_name: kagi-news-aggregator 711 + restart: unless-stopped 815 712 816 - **Code:** 817 - ```go 818 - func (p *PostPublisher) uploadImage(imageURL string) (string, error) { 819 - // Download from Kagi proxy 820 - resp, err := http.Get(imageURL) 821 - if err != nil { 822 - return "", err 823 - } 824 - defer resp.Body.Close() 713 + environment: 714 + # Aggregator identity (from aggregator creation) 715 + - AGGREGATOR_DID=${AGGREGATOR_DID} 716 + - AGGREGATOR_PRIVATE_KEY=${AGGREGATOR_PRIVATE_KEY} 825 717 826 - // Upload to blob storage 827 - blob, err := p.blobStore.Upload(resp.Body, resp.Header.Get("Content-Type")) 828 - if err != nil { 829 - return "", err 830 - } 718 + volumes: 719 + # Config file (read-only) 720 + - ./config.yaml:/app/config.yaml:ro 721 + # State file (read-write for deduplication) 722 + - ./data/state.json:/app/data/state.json 831 723 832 - return blob.Ref, nil 833 - } 724 + logging: 725 + driver: "json-file" 726 + options: 727 + max-size: "10m" 728 + max-file: "3" 834 729 ``` 835 730 836 - **Decision Point:** Only implement if: 837 - - Kagi blocks hotlinking 838 - - Kagi proxy becomes unreliable 839 - - Legal clarification needed 731 + **Environment Variables:** 732 + - `AGGREGATOR_DID`: PLC DID created for this aggregator instance 733 + - `AGGREGATOR_PRIVATE_KEY`: Base64-encoded private key for signing 840 734 841 - --- 735 + **Volumes:** 736 + - `config.yaml`: Feed-to-community mappings (user-editable) 737 + - `data/state.json`: Deduplication state (managed by aggregator) 842 738 843 - ## Rate Limiting & Performance 739 + **Deployment:** 740 + ```bash 741 + # On same host as Coves 742 + cd aggregators/kagi-news 743 + cp config.example.yaml config.yaml 744 + # Edit config.yaml with your feed mappings 844 745 845 - ### Rate Limits 746 + # Set environment variables 747 + export AGGREGATOR_DID="did:plc:xyz..." 748 + export AGGREGATOR_PRIVATE_KEY="base64-key..." 846 749 847 - **RSS Fetching:** 848 - - Poll each category feed every 15 minutes 849 - - Max 4 categories = 4 requests per 15 min = 16 req/hour 850 - - Well within any reasonable limit 750 + # Start aggregator 751 + docker-compose up -d 851 752 852 - **Post Creation:** 853 - - Aggregator rate limit: 10 posts/hour per community 854 - - Global limit: 100 posts/hour across all communities 855 - - Kagi News publishes ~5-10 stories per category per day 856 - - = ~20-40 posts/day total across all categories 857 - - = ~2-4 posts/hour average 858 - - Well within limits 859 - 860 - **Performance Targets:** 861 - - Story posted within 15 minutes of appearing in RSS feed 862 - - < 1 second to parse and format a story 863 - - < 500ms to publish a post via API 753 + # View logs 754 + docker-compose logs -f 755 + ``` 864 756 865 757 --- 866 758 867 - ## Monitoring & Observability 759 + ## Image Handling Strategy (MVP) 868 760 869 - ### Metrics to Track 761 + ### Approach: Hot-Linked Images via External Embed 870 762 871 - **Feed Polling:** 872 - - `kagi_feed_poll_total` (counter) - Total feed polls by category 873 - - `kagi_feed_poll_errors` (counter) - Failed polls by category/error 874 - - `kagi_feed_items_fetched` (gauge) - Items per poll by category 875 - - `kagi_feed_poll_duration_seconds` (histogram) - Poll latency 763 + The MVP uses hot-linked images from Kagi's proxy: 876 764 877 - **Story Processing:** 878 - - `kagi_stories_parsed_total` (counter) - Successfully parsed stories 879 - - `kagi_stories_parse_errors` (counter) - Parse failures by error type 880 - - `kagi_stories_filtered` (counter) - Stories filtered out by reason (duplicate, min sources, category) 881 - - `kagi_stories_posted` (counter) - Stories successfully posted by community 765 + **Flow:** 766 + 1. Extract image URL from HTML description (`https://kagiproxy.com/img/...`) 767 + 2. Include in post using `social.coves.embed.external`: 768 + ```json 769 + { 770 + "$type": "social.coves.embed.external", 771 + "external": { 772 + "uri": "{Kagi story URL}", 773 + "title": "{Story title}", 774 + "description": "{Summary excerpt}", 775 + "thumb": "{Kagi proxy image URL}" 776 + } 777 + } 778 + ``` 779 + 3. Frontend renders image from Kagi proxy URL 882 780 883 - **Post Publishing:** 884 - - `kagi_posts_created_total` (counter) - Total posts created 885 - - `kagi_posts_failed` (counter) - Failed posts by error type 886 - - `kagi_post_publish_duration_seconds` (histogram) - Post creation latency 781 + **Rationale:** 782 + - Simpler MVP implementation (no blob upload complexity) 783 + - No storage requirements on our end 784 + - Kagi proxy is reliable and CDN-backed 785 + - Faster posting (no download/upload step) 786 + - Images already properly sized and optimized 887 787 888 - **Health:** 889 - - `kagi_aggregator_up` (gauge) - Service health (1 = healthy, 0 = down) 890 - - `kagi_last_successful_poll_timestamp` (gauge) - Last successful poll time by category 788 + **Future Consideration:** If Kagi proxy becomes unreliable, migrate to blob storage in Phase 2. 891 789 892 790 --- 893 791 894 - ### Logging 792 + ## Rate Limiting & Performance (MVP) 895 793 896 - **Structured Logging:** 897 - ```go 898 - log.Info("Story posted", 899 - "guid", story.GUID, 900 - "title", story.Title, 901 - "community", comm.CommunityDID, 902 - "post_uri", postURI, 903 - "sources", len(story.Sources), 904 - "format", postFormat, 905 - ) 794 + ### Simplified Rate Strategy 906 795 907 - log.Error("Failed to parse story", 908 - "guid", item.GUID, 909 - "feed", feedURL, 910 - "error", err, 911 - ) 912 - ``` 796 + **RSS Fetching:** 797 + - Poll each feed once per day (~noon UTC after Kagi updates) 798 + - No aggressive polling needed (Kagi updates daily) 799 + - ~3-5 feeds = minimal load 913 800 914 - **Log Levels:** 915 - - DEBUG: Feed items, parsing details 916 - - INFO: Stories posted, communities targeted 917 - - WARN: Parse errors, rate limit approaching 918 - - ERROR: Failed posts, feed fetch failures 801 + **Post Creation:** 802 + - One run per day = 5-15 posts per feed 803 + - Total: ~15-75 posts/day across all communities 804 + - Well within any reasonable rate limits 919 805 920 - --- 806 + **Performance:** 807 + - RSS fetch + parse: < 5 seconds per feed 808 + - Image download + upload: < 3 seconds per image 809 + - Post creation: < 1 second per post 810 + - Total runtime per day: < 5 minutes 921 811 922 - ### Alerts 923 - 924 - **Critical:** 925 - - Feed polling failing for > 1 hour 926 - - Post creation failing for > 10 consecutive attempts 927 - - Aggregator unauthorized (auth record disabled/deleted) 928 - 929 - **Warning:** 930 - - Post creation rate < 50% of expected 931 - - Parse errors > 10% of items 932 - - Approaching rate limits (> 80% of quota) 812 + No complex rate limiting needed for MVP. 933 813 934 814 --- 935 815 936 - ## Deployment 937 - 938 - ### Infrastructure 939 - 940 - **Service Type:** Long-running daemon 816 + ## Logging & Observability (MVP) 941 817 942 - **Hosting:** Kubernetes (same cluster as Coves AppView) 818 + ### Structured Logging 943 819 944 - **Resources:** 945 - - CPU: 0.5 cores (low CPU usage, mostly I/O) 946 - - Memory: 512 MB (small in-memory cache for recent GUIDs) 947 - - Storage: 1 GB (SQLite for deduplication tracking) 820 + **Python logging module** with JSON formatter: 948 821 949 - --- 822 + ```python 823 + import logging 824 + import json 950 825 951 - ### Configuration 826 + logging.basicConfig( 827 + level=logging.INFO, 828 + format='%(message)s' 829 + ) 952 830 953 - **Environment Variables:** 954 - ```bash 955 - # Aggregator identity 956 - AGGREGATOR_DID=did:web:kagi-news.coves.social 957 - AGGREGATOR_PRIVATE_KEY_PATH=/secrets/private-key.pem 831 + logger = logging.getLogger(__name__) 958 832 959 - # Coves API 960 - COVES_API_URL=https://api.coves.social 833 + # Example structured log 834 + logger.info(json.dumps({ 835 + "event": "post_created", 836 + "feed": "world.xml", 837 + "story_title": "Breaking News...", 838 + "community": "world-news.coves.social", 839 + "post_uri": "at://...", 840 + "timestamp": "2025-10-23T12:00:00Z" 841 + })) 842 + ``` 961 843 962 - # Feed polling 963 - POLL_INTERVAL=15m 964 - CATEGORIES=world,tech,business,sports 844 + **Key Events to Log:** 845 + - `feed_fetched`: RSS feed successfully fetched 846 + - `story_parsed`: Story successfully parsed from HTML 847 + - `post_created`: Post successfully created 848 + - `error`: Any failures (with context) 849 + - `run_completed`: Summary of entire run 965 850 966 - # Database (for deduplication) 967 - DB_PATH=/data/kagi-news.db 851 + **Log Levels:** 852 + - INFO: Successful operations 853 + - WARNING: Retryable errors, skipped items 854 + - ERROR: Fatal errors, failed posts 968 855 969 - # Monitoring 970 - METRICS_PORT=9090 971 - LOG_LEVEL=info 972 - ``` 856 + ### Simple Monitoring 973 857 974 - --- 858 + **Health Check:** Check last successful run timestamp 859 + - If > 48 hours: alert (should run daily) 860 + - If errors > 50% of items: investigate 975 861 976 - ### Deployment Manifest 862 + **Metrics to Track (manually via logs):** 863 + - Posts created per run 864 + - Parse failures per run 865 + - Post creation failures per run 866 + - Total runtime 977 867 978 - ```yaml 979 - apiVersion: apps/v1 980 - kind: Deployment 981 - metadata: 982 - name: kagi-news-aggregator 983 - namespace: coves 984 - spec: 985 - replicas: 1 986 - selector: 987 - matchLabels: 988 - app: kagi-news-aggregator 989 - template: 990 - metadata: 991 - labels: 992 - app: kagi-news-aggregator 993 - spec: 994 - containers: 995 - - name: aggregator 996 - image: coves/kagi-news-aggregator:latest 997 - env: 998 - - name: AGGREGATOR_DID 999 - value: did:web:kagi-news.coves.social 1000 - - name: COVES_API_URL 1001 - value: https://api.coves.social 1002 - - name: POLL_INTERVAL 1003 - value: 15m 1004 - - name: CATEGORIES 1005 - value: world,tech,business,sports 1006 - - name: DB_PATH 1007 - value: /data/kagi-news.db 1008 - - name: AGGREGATOR_PRIVATE_KEY_PATH 1009 - value: /secrets/private-key.pem 1010 - volumeMounts: 1011 - - name: data 1012 - mountPath: /data 1013 - - name: secrets 1014 - mountPath: /secrets 1015 - readOnly: true 1016 - ports: 1017 - - name: metrics 1018 - containerPort: 9090 1019 - resources: 1020 - requests: 1021 - cpu: 250m 1022 - memory: 256Mi 1023 - limits: 1024 - cpu: 500m 1025 - memory: 512Mi 1026 - volumes: 1027 - - name: data 1028 - persistentVolumeClaim: 1029 - claimName: kagi-news-data 1030 - - name: secrets 1031 - secret: 1032 - secretName: kagi-news-private-key 1033 - ``` 868 + No complex metrics infrastructure needed for MVP - Docker logs are sufficient. 1034 869 1035 870 --- 1036 871 1037 - ## Testing Strategy 872 + ## Testing Strategy ✅ COMPLETE 1038 873 1039 - ### Unit Tests 874 + ### Unit Tests - 57 Tests Passing (83% Coverage) 1040 875 1041 - **Feed Parsing:** 1042 - ```go 1043 - func TestParseFeed(t *testing.T) { 1044 - feed := loadTestFeed("testdata/world.xml") 1045 - stories, err := parser.Parse(feed) 1046 - assert.NoError(t, err) 1047 - assert.Len(t, stories, 10) 876 + **Test Coverage by Component:** 877 + - ✅ **RSS Fetcher** (5 tests) 878 + - Successful feed fetch 879 + - Timeout handling 880 + - Retry logic with exponential backoff 881 + - Invalid XML handling 882 + - Empty URL validation 1048 883 1049 - story := stories[0] 1050 - assert.NotEmpty(t, story.Title) 1051 - assert.NotEmpty(t, story.Summary) 1052 - assert.Greater(t, len(story.Sources), 1) 1053 - } 884 + - ✅ **HTML Parser** (8 tests) 885 + - Summary extraction 886 + - Image URL and alt text extraction 887 + - Highlights list parsing 888 + - Quote extraction with attribution 889 + - Perspectives parsing with actors and sources 890 + - Sources list extraction 891 + - Missing sections handling 892 + - Full story object creation 1054 893 1055 - func TestParseStoryHTML(t *testing.T) { 1056 - html := `<p>Summary [source.com#1]</p> 1057 - <h3>Highlights:</h3> 1058 - <ul><li>Point 1</li></ul> 1059 - <h3>Sources:</h3> 1060 - <ul><li><a href="https://example.com">Title</a> - example.com</li></ul>` 894 + - ✅ **Rich Text Formatter** (10 tests) 895 + - Full format generation 896 + - Bold facets on headers and actors 897 + - Italic facets on quotes 898 + - Link facets on URLs 899 + - UTF-8 byte position calculation 900 + - Multi-byte character handling (emoji, special chars) 901 + - All sections formatted correctly 1061 902 1062 - story, err := parser.ParseHTML(html) 1063 - assert.NoError(t, err) 1064 - assert.Equal(t, "Summary [source.com#1]", story.Summary) 1065 - assert.Len(t, story.Highlights, 1) 1066 - assert.Len(t, story.Sources, 1) 1067 - } 1068 - ``` 903 + - ✅ **State Manager** (12 tests) 904 + - GUID tracking 905 + - Duplicate detection 906 + - Rolling window (100 GUID limit) 907 + - Age-based cleanup (30 days) 908 + - Last run timestamp tracking 909 + - JSON persistence 910 + - Atomic file writes 911 + - Concurrent access safety 1069 912 1070 - **Formatting:** 1071 - ```go 1072 - func TestFormatFull(t *testing.T) { 1073 - story := &KagiStory{ 1074 - Summary: "Test summary", 1075 - Sources: []Source{ 1076 - {Title: "Article", URL: "https://example.com", Domain: "example.com"}, 1077 - }, 1078 - } 913 + - ✅ **Config Manager** (3 tests) 914 + - YAML loading and validation 915 + - Environment variable merging 916 + - Error handling for missing/invalid config 1079 917 1080 - content := formatter.Format(story, "full") 1081 - assert.Contains(t, content, "Test summary") 1082 - assert.Contains(t, content, "**Sources:**") 1083 - assert.Contains(t, content, "📰 Story aggregated by") 1084 - } 1085 - ``` 918 + - ✅ **Main Orchestrator** (9 tests) 919 + - End-to-end flow 920 + - Feed-level error isolation 921 + - Item-level error isolation 922 + - Dry-run mode 923 + - State persistence across runs 924 + - Multiple feed handling 1086 925 1087 - **Deduplication:** 1088 - ```go 1089 - func TestDeduplication(t *testing.T) { 1090 - guid := "test-guid-123" 926 + - ✅ **E2E Tests** (6 skipped - require live API) 927 + - Integration with Coves API (manual testing required) 928 + - Authentication flow 929 + - Post creation 1091 930 1092 - posted, err := deduplicator.AlreadyPosted(guid) 1093 - assert.NoError(t, err) 1094 - assert.False(t, posted) 1095 - 1096 - err = deduplicator.MarkPosted(guid, "at://...") 1097 - assert.NoError(t, err) 1098 - 1099 - posted, err = deduplicator.AlreadyPosted(guid) 1100 - assert.NoError(t, err) 1101 - assert.True(t, posted) 1102 - } 931 + **Test Results:** 932 + ``` 933 + 57 passed, 6 skipped, 1 warning in 8.76s 934 + Coverage: 83% 1103 935 ``` 1104 936 1105 - --- 937 + **Test Fixtures:** 938 + - Real Kagi News RSS item with all sections 939 + - Sample HTML descriptions 940 + - Mock HTTP responses 1106 941 1107 942 ### Integration Tests 1108 943 1109 - **With Mock Coves API:** 1110 - ```go 1111 - func TestPublishStory(t *testing.T) { 1112 - // Setup mock Coves API 1113 - mockAPI := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { 1114 - assert.Equal(t, "/xrpc/social.coves.post.create", r.URL.Path) 944 + **Manual Integration Testing Required:** 945 + - [ ] Can authenticate with live Coves API 946 + - [ ] Can create post via Coves API 947 + - [ ] Can fetch real Kagi RSS feed 948 + - [ ] Images display correctly from Kagi proxy 949 + - [ ] State persistence works in production 950 + - [ ] CRON scheduling works correctly 1115 951 1116 - var input CreatePostInput 1117 - json.NewDecoder(r.Body).Decode(&input) 952 + **Pre-deployment Checklist:** 953 + - [x] All unit tests passing 954 + - [x] Can parse real Kagi HTML 955 + - [x] State persistence works 956 + - [x] Config validation works 957 + - [x] Error handling comprehensive 958 + - [ ] Aggregator DID created 959 + - [ ] Can authenticate with Coves API 960 + - [ ] Docker container builds and runs 1118 961 1119 - assert.Equal(t, "did:plc:test-community", input.Community) 1120 - assert.NotEmpty(t, input.Title) 1121 - assert.Contains(t, input.Content, "📰 Story aggregated by") 962 + --- 1122 963 1123 - w.WriteHeader(200) 1124 - json.NewEncoder(w).Encode(CreatePostOutput{URI: "at://..."}) 1125 - })) 1126 - defer mockAPI.Close() 964 + ## Success Metrics 1127 965 1128 - // Test story publishing 1129 - publisher := NewPostPublisher(mockAPI.URL) 1130 - err := publisher.PublishStory(ctx, testStory, []*CommunityAuth{testComm}) 1131 - assert.NoError(t, err) 1132 - } 1133 - ``` 966 + ### ✅ Phase 1: Implementation - COMPLETE 1134 967 1135 - --- 968 + - [x] All core components implemented 969 + - [x] 57 tests passing with 83% coverage 970 + - [x] RSS fetching and parsing working 971 + - [x] Rich text formatting with facets 972 + - [x] State management and deduplication 973 + - [x] Configuration management 974 + - [x] Comprehensive error handling 975 + - [x] Documentation complete 1136 976 1137 - ### E2E Tests 977 + ### 🔄 Phase 2: Integration Testing - IN PROGRESS 1138 978 1139 - **With Real RSS Feed:** 1140 - ```go 1141 - func TestE2E_FetchAndParse(t *testing.T) { 1142 - if testing.Short() { 1143 - t.Skip("Skipping E2E test") 1144 - } 979 + - [ ] Aggregator DID created (PLC) 980 + - [ ] Aggregator authorized in 1+ test communities 981 + - [ ] Can authenticate with Coves API 982 + - [ ] First post created end-to-end 983 + - [ ] Attribution visible ("Via Kagi News") 984 + - [ ] No duplicate posts on repeated runs 985 + - [ ] Images display correctly 1145 986 1146 - // Fetch real Kagi News feed 1147 - feed, err := poller.fetchFeed("https://news.kagi.com/world.xml") 1148 - assert.NoError(t, err) 1149 - assert.NotEmpty(t, feed.Items) 987 + ### 📋 Phase 3: Alpha Deployment (First Week) 1150 988 1151 - // Parse first item 1152 - story, err := parser.Parse(feed.Items[0]) 1153 - assert.NoError(t, err) 1154 - assert.NotEmpty(t, story.Title) 1155 - assert.NotEmpty(t, story.Summary) 1156 - assert.Greater(t, len(story.Sources), 0) 1157 - } 1158 - ``` 1159 - 1160 - **With Test Coves Instance:** 1161 - ```go 1162 - func TestE2E_CreatePost(t *testing.T) { 1163 - if testing.Short() { 1164 - t.Skip("Skipping E2E test") 1165 - } 1166 - 1167 - // Create post in test community 1168 - post := aggregator.Post{ 1169 - Title: "Test Kagi News Post", 1170 - Content: "Test content...", 1171 - } 989 + - [ ] Docker Compose runs successfully in production 990 + - [ ] 2-3 communities receiving posts 991 + - [ ] 20+ posts created successfully 992 + - [ ] Zero duplicates 993 + - [ ] < 10% errors (parse or post creation) 994 + - [ ] CRON scheduling reliable 1172 995 1173 - err := aggregator.CreatePost(ctx, testCommunityDID, post) 1174 - assert.NoError(t, err) 996 + ### 🎯 Phase 4: Beta (First Month) 1175 997 1176 - // Verify post appears in feed 1177 - // (requires test community setup) 1178 - } 1179 - ``` 998 + - [ ] 5+ communities using aggregator 999 + - [ ] 200+ posts created 1000 + - [ ] Positive community feedback 1001 + - [ ] No rate limit issues 1002 + - [ ] < 5% error rate 1003 + - [ ] Performance metrics tracked 1180 1004 1181 1005 --- 1182 1006 1183 - ## Success Metrics 1007 + ## What's Next: Integration & Deployment 1184 1008 1185 - ### Pre-Launch Checklist 1009 + ### Immediate Next Steps 1186 1010 1187 - - [ ] Aggregator service declaration published 1188 - - [ ] DID created and configured (did:web:kagi-news.coves.social) 1189 - - [ ] RSS feed parser handles all Kagi HTML structures 1190 - - [ ] Deduplication prevents duplicate posts 1191 - - [ ] Category mapping works for all configs 1192 - - [ ] All 3 post formats render correctly 1193 - - [ ] Attribution to Kagi News visible on all posts 1194 - - [ ] Rate limiting prevents spam 1195 - - [ ] Monitoring/alerting configured 1196 - - [ ] E2E tests passing against test instance 1011 + 1. **Create Aggregator Identity** 1012 + - Generate DID for aggregator 1013 + - Store credentials securely 1014 + - Test authentication with Coves API 1197 1015 1198 - --- 1016 + 2. **Integration Testing** 1017 + - Test with live Coves API 1018 + - Verify post creation works 1019 + - Validate rich text rendering 1020 + - Check image display from Kagi proxy 1199 1021 1200 - ### Alpha Goals (First Week) 1022 + 3. **Docker Deployment** 1023 + - Build Docker image 1024 + - Test docker-compose setup 1025 + - Verify CRON scheduling 1026 + - Set up monitoring/logging 1201 1027 1202 - - [ ] 3+ communities using Kagi News aggregator 1203 - - [ ] 50+ posts created successfully 1204 - - [ ] Zero duplicate posts 1205 - - [ ] < 5% parse errors 1206 - - [ ] < 1% post creation failures 1207 - - [ ] Stories posted within 15 minutes of RSS publication 1028 + 4. **Community Authorization** 1029 + - Get aggregator authorized in test community 1030 + - Verify authorization flow works 1031 + - Test posting to real community 1208 1032 1209 - --- 1033 + 5. **Production Deployment** 1034 + - Deploy to production server 1035 + - Configure feeds for real communities 1036 + - Monitor first batch of posts 1037 + - Gather community feedback 1210 1038 1211 - ### Beta Goals (First Month) 1039 + ### Open Questions to Resolve 1212 1040 1213 - - [ ] 10+ communities using aggregator 1214 - - [ ] 500+ posts created 1215 - - [ ] Community feedback positive (surveys) 1216 - - [ ] Attribution compliance verified 1217 - - [ ] No rate limit violations 1218 - - [ ] < 1% error rate (parsing + posting) 1041 + 1. **Aggregator DID Creation:** 1042 + - Need helper script or manual process? 1043 + - Where to store credentials securely? 1219 1044 1220 - --- 1045 + 2. **Authorization Flow:** 1046 + - How does community admin authorize aggregator? 1047 + - UI flow or XRPC endpoint? 1221 1048 1222 - ## Future Enhancements 1049 + 3. **Image Strategy:** 1050 + - Confirm Kagi proxy images work reliably 1051 + - Fallback plan if proxy becomes unreliable? 1223 1052 1224 - ### Phase 2 Features 1053 + 4. **Monitoring:** 1054 + - What metrics to track initially? 1055 + - Alerting strategy for failures? 1225 1056 1226 - **Smart Category Detection:** 1227 - - Use LLM to suggest additional categories for stories 1228 - - Map Kagi categories to community tags automatically 1057 + --- 1229 1058 1230 - **Customizable Templates:** 1231 - - Allow communities to customize post format with templates 1232 - - Support Markdown/Handlebars templates in config 1059 + ## Future Enhancements (Post-MVP) 1233 1060 1234 - **Story Scoring:** 1235 - - Prioritize high-impact stories (many sources, breaking news) 1236 - - Delay low-priority stories to avoid flooding feed 1061 + ### Phase 2 1062 + - Multiple post formats (summary, minimal) 1063 + - Per-community filtering (subcategories, min sources) 1064 + - More sophisticated deduplication 1065 + - Metrics dashboard 1237 1066 1238 - **Cross-posting Prevention:** 1239 - - Detect when multiple communities authorize same category 1240 - - Intelligently cross-post vs. duplicate 1067 + ### Phase 3 1068 + - Interactive features (bot responds to comments) 1069 + - Cross-posting prevention 1070 + - Federation support 1241 1071 1242 1072 --- 1243 1073 1244 - ### Phase 3 Features 1074 + ## References 1245 1075 1246 - **Interactive Features:** 1247 - - Bot responds to comments with additional sources 1248 - - Updates megathread with new sources as story develops 1249 - 1250 - **Analytics Dashboard:** 1251 - - Show communities which stories get most engagement 1252 - - Trending topics from Kagi News 1253 - - Source diversity metrics 1254 - 1255 - **Federation:** 1256 - - Support other Coves instances using same aggregator 1257 - - Shared deduplication across instances 1076 + - Kagi News About: https://news.kagi.com/about 1077 + - Kagi News RSS: https://news.kagi.com/world.xml 1078 + - CC BY-NC License: https://creativecommons.org/licenses/by-nc/4.0/ 1079 + - Parent PRD: [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md) 1080 + - ATProto Python SDK: https://github.com/MarshalX/atproto 1081 + - Implementation: [aggregators/kagi-news/](/aggregators/kagi-news/) 1258 1082 1259 1083 --- 1260 1084 1261 - ## Open Questions 1085 + ## Implementation Summary 1262 1086 1263 - ### Need to Resolve Before Launch 1087 + **Phase 1 Status:** ✅ **COMPLETE** 1264 1088 1265 - 1. **Image Licensing:** 1266 - - ❓ Are images from Kagi proxy covered by CC BY-NC? 1267 - - ❓ Do we need to attribute original image sources? 1268 - - **Action:** Email support@kagi.com for clarification 1089 + The Kagi News RSS Aggregator implementation is complete and ready for integration testing and deployment. All 7 core components have been implemented with comprehensive test coverage (57 tests, 83% coverage). 1269 1090 1270 - 2. **Hotlinking Policy:** 1271 - - ❓ Is embedding Kagi proxy images acceptable? 1272 - - ❓ Should we download and re-host? 1273 - - **Action:** Test in staging, monitor for issues 1091 + **What Was Built:** 1092 + - Complete RSS feed fetching and parsing pipeline 1093 + - HTML parser that extracts all structured data from Kagi News feeds (summary, highlights, perspectives, quote, sources) 1094 + - Rich text formatter with proper facets for Coves 1095 + - State management system for deduplication 1096 + - Configuration management with YAML and environment variables 1097 + - HTTP client for Coves API authentication and post creation 1098 + - Main orchestrator with robust error handling 1099 + - Comprehensive test suite with real feed fixtures 1100 + - Documentation and example configurations 1274 1101 1275 - 3. **Category Discovery:** 1276 - - ❓ How to discover all available category feeds? 1277 - - ❓ Are there categories beyond world/tech/business/sports? 1278 - - **Action:** Scrape https://news.kagi.com/ for all .xml links 1102 + **Key Findings:** 1103 + - Kagi News RSS feeds contain only 3 structured sections (Highlights, Perspectives, Sources) 1104 + - Historical context is woven into the summary and highlights, not a separate section 1105 + - Timeline feature visible on Kagi website is not in the RSS feed 1106 + - All essential data for rich posts is available in the feed 1107 + - Feed structure is stable and well-formed 1279 1108 1280 - 4. **Attribution Format:** 1281 - - ❓ Is "📰 Story aggregated by Kagi News" sufficient? 1282 - - ❓ Do we need more prominent attribution? 1283 - - **Action:** Review CC BY-NC best practices 1109 + **Next Phase:** 1110 + Integration testing with live Coves API, followed by alpha deployment to test communities. 1284 1111 1285 1112 --- 1286 1113 1287 - ## References 1288 - 1289 - - Kagi News About Page: https://news.kagi.com/about 1290 - - Kagi News RSS Example: https://news.kagi.com/world.xml 1291 - - Kagi Kite Public Repo: https://github.com/kagisearch/kite-public 1292 - - CC BY-NC License: https://creativecommons.org/licenses/by-nc/4.0/ 1293 - - Parent PRD: [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md) 1294 - - Aggregator SDK: [TBD] 1114 + **End of PRD - Phase 1 Implementation Complete**