Lawtee Blog

Migrating Soomal.cc to Hugo

Migrating Soomal.cc to Hugo

Earlier this year, after obtaining the source code for the Soomal.com website, I uploaded it to my VPS. However, due to the outdated architecture of the original site, which was inconvenient to manage and not mobile-friendly, I recently undertook a complete overhaul, converting and migrating the entire site to Hugo.

Migration Plan Design

I had long considered revamping Soomal.cc. I had previously run some preliminary tests but encountered numerous issues, which led me to shelve the project temporarily.

Challenges and Difficulties

  1. Large Volume of Articles

    Soomal contains 9,630 articles, with the earliest dating back to 2003, totaling 19 million words.

    The site also hosts 326,000 JPG images across more than 4,700 folders. Most images come in three sizes, though some are missing, resulting in a total size of nearly 70 GB.

  2. Complexity in Article Conversion

    The Soomal source code only includes HTML files for article pages. While these files might have been generated by the same program, preliminary tests revealed that the page structure had undergone multiple changes over time, with different tags used in various periods, making information extraction from the HTML files highly challenging.

    • Encoding Issues: The original HTML files use GB2312 encoding and were likely hosted on a Windows server, requiring special handling for character encoding and escape sequences during conversion.

    • Image Issues: The site contains a vast number of images, which are the essence of Soomal. However, these images use diverse tags and styles, making it difficult to extract links and descriptions without omissions.

    • Tags and Categories: The site has nearly 12,000 article tags and over 20 categories. However, the HTML files lack category information, which can only be found in the 2,000+ category slice HTML files. Tags also present problems—some contain spaces, special characters, or duplicates within the same article.

    • Article Content: The HTML files include the main text, related articles, and tags, all nested under the DOC tag. Initially, I overlooked that related articles use lowercase doc tags, leading to extraction errors during testing. It was only after noticing this discrepancy while browsing the site that I restarted the conversion project.

  3. Storage Solution Dilemma

    I initially hosted Soomal.cc on a VPS. Over a few months, despite low traffic, data usage soared to nearly 1.5TB. Although the VPS offers unlimited bandwidth, this was concerning. After migrating to Hugo, I found that most free hosting services impose restrictions—GitHub recommends repositories under 1GB, CloudFlare Pages limits files to 20,000, CloudFlare R2 caps storage at 10GB, and Vercel and Netlify both limit traffic to 100GB.


Conversion Methodology

Given the potential challenges in converting Soomal to Hugo, I devised a five-step migration plan.

Step 1: Convert HTML Files to Markdown

  1. Define Conversion Requirements

    • Extract Titles: Retrieve article titles from the <head> tag. For example, extract 谈谈手机产业链和手机厂商的相互影响 from <title>刘延作品 - 谈谈手机产业链和手机厂商的相互影响 [Soomal]</title>.
    • Extract Tags: Use keyword filtering to locate tags in the HTML, extract tag names, and enclose them in quotes to handle spaces in tag names.
    • Extract Main Text: Retrieve the article body from the DOC tag and truncate content after the doc tag.
    • Extract Metadata: Gather publication dates, author information, and header images from the HTML.
    • Extract Images: Identify and extract all image references (e.g., smallpic, bigpic, smallpic2, wrappic).
    • Extract Special Content: Include subheadings, download links, tables, etc.
  2. File Conversion Given the clear requirements, I used Python scripts for the conversion.

Click to View Conversion Script Example
  1import os
  2import re
  3from bs4 import BeautifulSoup, Tag, NavigableString
  4from datetime import datetime
  5
  6def convert_html_to_md(html_path, output_dir):
  7    try:
  8        # Read HTML files with GB2312 encoding
  9        with open(html_path, 'r', encoding='gb2312', errors='ignore') as f:
 10            html_content = f.read()
 11        
 12        soup = BeautifulSoup(html_content, 'html.parser')
 13        
 14        # 1. Extract title
 15        title = extract_title(soup)
 16        
 17        # 2. Extract bookmark tags
 18        bookmarks = extract_bookmarks(soup)
 19        
 20        # 3. Extract title image and info
 21        title_img, info_content = extract_title_info(soup)
 22        
 23        # 4. Extract main content
 24        body_content = extract_body_content(soup)
 25        
 26        # Generate YAML frontmatter
 27        frontmatter = f"""---
 28title: "{title}"
 29date: {datetime.now().strftime('%Y-%m-%dT%H:%M:%S+08:00')}
 30tags: {bookmarks}
 31title_img: "{title_img}"
 32info: "{info_content}"
 33---\n\n"""
 34        
 35        # Generate Markdown content
 36        markdown_content = frontmatter + body_content
 37        
 38        # Save Markdown file
 39        output_path = os.path.join(output_dir, os.path.basename(html_path).replace('.htm', '.md'))
 40        with open(output_path, 'w', encoding='utf-8') as f:
 41            f.write(markdown_content)
 42            
 43        return f"Conversion successful: {os.path.basename(html_path)}"
 44    except Exception as e:
 45        return f"Conversion failed {os.path.basename(html_path)}: {str(e)}"
 46
 47def extract_title(soup):
 48    """Extract title"""
 49    if soup.title:
 50        return soup.title.string.strip()
 51    return ""
 52
 53def extract_bookmarks(soup):
 54    """Extract bookmark tags, each enclosed in quotes"""
 55    bookmarks = []
 56    bookmark_element = soup.find(string=re.compile(r'本文的相关书签:'))
 57    
 58    if bookmark_element:
 59        parent = bookmark_element.find_parent(['ul', 'li'])
 60        if parent:
 61            # Extract text from all <a> tags
 62            for a_tag in parent.find_all('a'):
 63                text = a_tag.get_text().strip()
 64                if text:
 65                    # Enclose each tag in quotes
 66                    bookmarks.append(f'"{text}"')
 67    
 68    return f"[{', '.join(bookmarks)}]" if bookmarks else "[]"
 69
 70def extract_title_info(soup):
 71    """Extract title image and info content"""
 72    title_img = ""
 73    info_content = ""
 74    
 75    titlebox = soup.find('div', class_='titlebox')
 76    if titlebox:
 77        # Extract title image
 78        title_img_div = titlebox.find('div', class_='titleimg')
 79        if title_img_div and title_img_div.img:
 80            title_img = title_img_div.img['src']
 81        
 82        # Extract info content
 83        info_div = titlebox.find('div', class_='info')
 84        if info_div:
 85            # Remove all HTML tags, keeping only text
 86            info_content = info_div.get_text().strip()
 87    
 88    return title_img, info_content
 89
 90def extract_body_content(soup):
 91    """Extract main content and process images"""
 92    body_content = ""
 93    doc_div = soup.find('div', class_='Doc')  # Note uppercase 'D'
 94    
 95    if doc_div:
 96        # Remove all nested div class="doc" (lowercase)
 97        for nested_doc in doc_div.find_all('div', class_='doc'):
 98            nested_doc.decompose()
 99        
100        # Process images
101        process_images(doc_div)
102        
103        # Iterate through all child elements to build Markdown content
104        for element in doc_div.children:
105            if isinstance(element, Tag):
106                if element.name == 'div' and 'subpagetitle' in element.get('class', []):
107                    # Convert to subheading
108                    body_content += f"## {element.get_text().strip()}\n\n"
109                else:
110                    # Preserve other content
111                    body_content += element.get_text().strip() + "\n\n"
112            elif isinstance(element, NavigableString):
113                body_content += element.strip() + "\n\n"
114    
115    return body_content.strip()
116
117def process_images(container):
118    """Process image content (Rules A/B/C)"""
119    # A: Handle <li data-src> tags
120    for li in container.find_all('li', attrs={'data-src': True}):
121        img_url = li['data-src'].replace('..', 'https://soomal.cc', 1)
122        caption_div = li.find('div', class_='caption')
123        content_div = li.find('div', class_='content')
124        
125        alt_text = caption_div.get_text().strip() if caption_div else ""
126        meta_text = content_div.get_text().strip() if content_div else ""
127        
128        # Create Markdown image syntax
129        img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
130        li.replace_with(img_md)
131    
132    # B: Process <span class="smallpic"> tags
133    for span in container.find_all('span', class_='smallpic'):
134        img = span.find('img')
135        if img and 'src' in img.attrs:
136            img_url = img['src'].replace('..', 'https://soomal.cc', 1)
137            caption_div = span.find('div', class_='caption')
138            content_div = span.find('div', class_='content')
139            
140            alt_text = caption_div.get_text().strip() if caption_div else ""
141            meta_text = content_div.get_text().strip() if content_div else ""
142            
143            # Create Markdown image syntax
144            img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
145            span.replace_with(img_md)
146            
147    # C: Process <div class="bigpic"> tags
148    for div in container.find_all('div', class_='bigpic'):
149        img = div.find('img')
150        if img and 'src' in img.attrs:
151            img_url = img['src'].replace('..', 'https://soomal.cc', 1)
152            caption_div = div.find('div', class_='caption')
153            content_div = div.find('div', class_='content')
154            
155            alt_text = caption_div.get_text().strip() if caption_div else ""
156            meta_text = content_div.get_text().strip() if content_div else ""
157            
158            # Create Markdown image syntax
159            img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
160            div.replace_with(img_md)
161
162if __name__ == "__main__":
163    input_dir = 'doc'
164    output_dir = 'markdown_output'
165    
166    # Create output directory
167    os.makedirs(output_dir, exist_ok=True)
168    
169    # Process all HTML files
170    for filename in os.listdir(input_dir):
171        if filename.endswith('.htm'):
172            html_path = os.path.join(input_dir, filename)
173            result = convert_html_to_md(html_path, output_dir)
174            print(result)

Step 2: Processing Categories and Abstracts

Due to the original HTML files not containing category information, the article category directories had to be processed separately. During category processing, article abstracts were also handled simultaneously.

  1. Extracting Category and Abstract Information

    Primarily using Python to extract and format category and abstract information from over 2,000 category pages.

Click to view conversion code
  1import os
  2import re
  3from bs4 import BeautifulSoup
  4import codecs
  5from collections import defaultdict
  6
  7def extract_category_info(folder_path):
  8    # Use defaultdict to automatically initialize nested dictionaries
  9    article_categories = defaultdict(set)  # Stores article ID to category mapping
 10    article_summaries = {}  # Stores article ID to abstract mapping
 11    
 12    # Iterate through all HTM files in the folder
 13    for filename in os.listdir(folder_path):
 14        if not filename.endswith('.htm'):
 15            continue
 16            
 17        file_path = os.path.join(folder_path, filename)
 18        
 19        try:
 20            # Read file with GB2312 encoding and convert to UTF-8
 21            with codecs.open(file_path, 'r', encoding='gb2312', errors='replace') as f:
 22                content = f.read()
 23                
 24            soup = BeautifulSoup(content, 'html.parser')
 25            
 26            # Extract category name
 27            title_tag = soup.title
 28            if title_tag:
 29                title_text = title_tag.get_text().strip()
 30                # Extract content before the first hyphen
 31                category_match = re.search(r'^([^-]+)', title_text)
 32                if category_match:
 33                    category_name = category_match.group(1).strip()
 34                    # Add quotes if category name contains spaces
 35                    if ' ' in category_name:
 36                        category_name = f'"{category_name}"'
 37                else:
 38                    category_name = "Unknown_Category"
 39            else:
 40                category_name = "Unknown_Category"
 41            
 42            # Extract article information
 43            for item in soup.find_all('div', class_='item'):
 44                # Extract article ID
 45                article_link = item.find('a', href=True)
 46                if article_link:
 47                    href = article_link['href']
 48                    article_id = re.search(r'../doc/(\d+)\.htm', href)
 49                    if article_id:
 50                        article_id = article_id.group(1)
 51                    else:
 52                        continue
 53                else:
 54                    continue
 55                
 56                # Extract article abstract
 57                synopsis_div = item.find('div', class_='synopsis')
 58                synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
 59                
 60                # Store category information
 61                article_categories[article_id].add(category_name)
 62                
 63                # Store abstract (only once to avoid overwriting)
 64                if article_id not in article_summaries:
 65                    article_summaries[article_id] = synopsis
 66    
 67        except UnicodeDecodeError:
 68            # Attempt using GBK encoding as fallback
 69            try:
 70                with codecs.open(file_path, 'r', encoding='gbk', errors='replace') as f:
 71                    content = f.read()
 72                # Reprocess content...
 73                # Note: Repeated processing code omitted here; should be extracted as a function
 74                # For code completeness, we include the repeated logic
 75                soup = BeautifulSoup(content, 'html.parser')
 76                title_tag = soup.title
 77                if title_tag:
 78                    title_text = title_tag.get_text().strip()
 79                    category_match = re.search(r'^([^-]+)', title_text)
 80                    if category_match:
 81                        category_name = category_match.group(1).strip()
 82                        if ' ' in category_name:
 83                            category_name = f'"{category_name}"'
 84                    else:
 85                        category_name = "Unknown_Category"
 86                else:
 87                    category_name = "Unknown_Category"
 88                
 89                for item in soup.find_all('div', class_='item'):
 90                    article_link = item.find('a', href=True)
 91                    if article_link:
 92                        href = article_link['href']
 93                        article_id = re.search(r'../doc/(\d+)\.htm', href)
 94                        if article_id:
 95                            article_id = article_id.group(1)
 96                        else:
 97                            continue```python
 98else:
 99    continue
100
101synopsis_div = item.find('div', class_='synopsis')
102synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
103
104article_categories[article_id].add(category_name)
105
106if article_id not in article_summaries:
107    article_summaries[article_id] = synopsis
108
109except Exception as e:
110    print(f"Error processing file {filename} (after trying GBK): {str(e)}")
111    continue
112
113except Exception as e:
114    print(f"Error processing file {filename}: {str(e)}")
115    continue
116
117return article_categories, article_summaries
118
119def save_to_markdown(article_categories, article_summaries, output_path):
120    with open(output_path, 'w', encoding='utf-8') as md_file:
121        # Write Markdown header
122        md_file.write("# Article Categories and Summaries\n\n")
123        md_file.write("> This file contains IDs, categories and summaries of all articles\n\n")
124        
125        # Sort by article ID
126        sorted_article_ids = sorted(article_categories.keys(), key=lambda x: int(x))
127        
128        for article_id in sorted_article_ids:
129            # Get sorted category list
130            categories = sorted(article_categories[article_id])
131            # Format as list string
132            categories_str = ", ".join(categories)
133            
134            # Get summary
135            summary = article_summaries.get(article_id, "No summary available")
136            
137            # Write Markdown content
138            md_file.write(f"## Filename: {article_id}\n")
139            md_file.write(f"**Categories**: {categories_str}\n")
140            md_file.write(f"**Summary**: {summary}\n\n")
141            md_file.write("---\n\n")
142
143if __name__ == "__main__":
144    # Configure input and output paths
145    input_folder = 'Categories'  # Replace with your HTM folder path
146    output_md = 'articles_categories.md'
147    
148    # Execute extraction
149    article_categories, article_summaries = extract_category_info(input_folder)
150    
151    # Save results to Markdown file
152    save_to_markdown(article_categories, article_summaries, output_md)
153    
154    # Print statistics
155    print(f"Successfully processed data for {len(article_categories)} articles")
156    print(f"Saved to {output_md}")
157    print(f"Found {len(article_summaries)} articles with summaries")
  1. Writing category and summary information to markdown files

    This step is relatively simple - writing the extracted category and summary data into the previously converted markdown files one by one.

Click to view the writing script
  1import os
  2import re
  3import ruamel.yaml
  4from collections import defaultdict
  5
  6def parse_articles_categories(md_file_path):
  7    """
  8    Parse articles_categories.md file to extract article IDs, categories and summaries
  9    """
 10    article_info = defaultdict(dict)
 11    current_id = None
 12    
 13    try:
 14        with open(md_file_path, 'r', encoding='utf-8') as f:
 15            for line in f:
 16                # Match filename
 17                filename_match = re.match(r'^## Filename: (\d+)$', line.strip())
 18                if filename_match:
 19                    current_id = filename_match.group(1)
 20                    continue
 21                
 22                # Match category information
 23                categories_match = re.match(r'^\*\*Categories\*\*: (.+)$', line.strip())
 24                if categories_match and current_id:
 25                    categories_str = categories_match.group(1)
 26                    # Clean category string, remove extra spaces and quotes
 27                    categories = [cat.strip().strip('"') for cat in categories_str.split(',')]
 28                    article_info[current_id]['categories'] = categories
 29                    continue
 30                
 31                # Match summary information
 32                summary_match = re.match(r'^\*\*Summary\*\*: (.+)$', line.strip())
 33                if summary_match and current_id:
 34                    summary = summary_match.group(1)
 35                    article_info[current_id]['summary'] = summary
 36                    continue
 37                
 38                # Reset current ID when encountering separator
 39                if line.startswith('---'):
 40                    current_id = None
 41    
 42    except Exception as e:
 43        print(f"Error parsing articles_categories.md file: {str(e)}")
 44    
 45    return article_info
 46
 47def update_markdown_files(article_info, md_folder):
 48    """
 49    Update Markdown files by adding category and summary information to frontmatter
 50    """
 51    updated_count = 0
 52    skipped_count = 0
 53    
 54    # Initialize YAML parser
 55    yaml = ruamel.yaml.YAML()
 56    yaml.preserve_quotes = True
 57    yaml.width = 1000  # Prevent long summaries from line breaking
 58    
 59    for filename in os.listdir(md_folder):
 60        if not filename.endswith('.md'):
 61            continue
 62            
 63        article_id = filename[:-3]  # Remove .md extension
 64        file_path = os.path.join(md_folder, filename)
 65        
 66        # Check if information exists for this article
 67        if article_id not in article_info:
 68            skipped_count += 1
 69            continue
 70            
 71        try:
 72            with open(file_path, 'r', encoding='utf-8') as f:
 73                content = f.read()
 74            
 75            # Parse frontmatter
 76            frontmatter_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
 77            if not frontmatter_match:
 78                print(f"No frontmatter found in file {filename}, skipping")
 79                skipped_count += 1
 80                continue
 81                
 82            frontmatter_content = frontmatter_match.group(1)
 83            
 84            # Convert frontmatter to dictionary
 85            data = yaml.load(frontmatter_content)
 86            if data is None:
 87                data = {}
 88            
 89            # Add category and summary information
 90            info = article_info[article_id]
 91            
 92            # Add categories
 93            if 'categories' in info:
 94                # If categories already exist, merge them (deduplicate)
 95                existing_categories = set(data.get('categories', []))
 96                new_categories = set(info['categories'])
 97                combined_categories = sorted(existing_categories.union(new_categories))
 98                data['categories'] = combined_categories
 99            
100            # Add summary (if summary exists and is not empty)
101            if 'summary' in info and info['summary']:
102                # Only update if summary doesn't exist or new summary is not empty
103                if 'summary' not in data or info['summary']:
104                    data['summary'] = info['summary']
105            
106            # Regenerate frontmatter
107            new_frontmatter = '---\n'
108            with ruamel.yaml.String
109```Here's the English translation of the provided text:
110
111```python
112IO() as stream:
113    yaml.dump(data, stream)
114    new_frontmatter += stream.getvalue().strip()
115new_frontmatter += '\n---'
116
117# Replace original frontmatter
118new_content = content.replace(frontmatter_match.group(0), new_frontmatter)
119
120# Write to file
121with open(file_path, 'w', encoding='utf-8') as f:
122    f.write(new_content)
123    
124updated_count += 1
125
126except Exception as e:
127    print(f"Error updating file {filename}: {str(e)}")
128    skipped_count += 1
129
130return updated_count, skipped_count
131
132if __name__ == "__main__":
133    # Configure paths
134    articles_md = 'articles_categories.md'  # Markdown file containing category and summary information
135    md_folder = 'markdown_output'  # Folder containing Markdown articles
136    
137    # Parse articles_categories.md file
138    print("Parsing articles_categories.md file...")
139    article_info = parse_articles_categories(articles_md)
140    print(f"Successfully parsed information for {len(article_info)} articles")
141    
142    # Update Markdown files
143    print(f"\nUpdating category and summary information for {len(article_info)} articles...")
144    updated, skipped = update_markdown_files(article_info, md_folder)
145    
146    # Print statistics
147    print(f"\nProcessing complete!")
148    print(f"Successfully updated: {updated} files")
149    print(f"Skipped: {skipped} files")
150    print(f"Articles with found information: {len(article_info)}")

Step 3: Convert article frontmatter information

This step primarily involves correcting the frontmatter section of the output Markdown files to meet Hugo theme requirements.

  1. Revise article header information according to frontmatter specifications Mainly handles special characters, date formats, authors, featured images, tags, categories, etc.
View conversion code
  1import os
  2import re
  3import frontmatter
  4import yaml
  5from datetime import datetime
  6
  7def escape_special_characters(text):
  8    """Escape special characters in YAML"""
  9    # Escape backslashes while preserving already escaped characters
 10    return re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'\\\\', text)
 11
 12def process_md_files(folder_path):
 13    for filename in os.listdir(folder_path):
 14        if filename.endswith(".md"):
 15            file_path = os.path.join(folder_path, filename)
 16            try:
 17                # Read file content
 18                with open(file_path, 'r', encoding='utf-8') as f:
 19                    content = f.read()
 20                
 21                # Manually split frontmatter and content
 22                if content.startswith('---\n'):
 23                    parts = content.split('---\n', 2)
 24                    if len(parts) >= 3:
 25                        fm_text = parts[1]
 26                        body_content = parts[2] if len(parts) > 2 else ""
 27                        
 28                        # Escape special characters
 29                        fm_text = escape_special_characters(fm_text)
 30                        
 31                        # Recombine content
 32                        new_content = f"---\n{fm_text}---\n{body_content}"
 33                        
 34                        # Parse frontmatter using safe loading
 35                        post = frontmatter.loads(new_content)
 36                        
 37                        # Process info field
 38                        if 'info' in post.metadata:
 39                            info = post.metadata['info']
 40                            
 41                            # Extract date
 42                            date_match = re.search(r'On (\d{4}\.\d{1,2}\.\d{1,2} \d{1,2}:\d{2}:\d{2})', info)
 43                            if date_match:
 44                                date_str = date_match.group(1)
 45                                try:
 46                                    dt = datetime.strptime(date_str, "%Y.%m.%d %H:%M:%S")
 47                                    post.metadata['date'] = dt.strftime("%Y-%m-%dT%H:%M:%S+08:00")
 48                                except ValueError:
 49                                    # Keep original date as fallback
 50                                    pass
 51                            
 52                            # Extract author
 53                            author_match = re.match(r'^(.+?)作品', info)
 54                            if author_match:
 55                                authors = author_match.group(1).strip()
 56                                # Split multiple authors
 57                                author_list = [a.strip() for a in re.split(r'\s+', authors) if a.strip()]
 58                                post.metadata['author'] = author_list
 59                            
 60                            # Create description
 61                            desc_parts = info.split('|', 1)
 62                            if len(desc_parts) > 1:
 63                                post.metadata['description'] = desc_parts[1].strip()
 64                            
 65                            # Remove original info
 66                            del post.metadata['info']
 67                        
 68                        # Process title_img
 69                        if 'title_img' in post.metadata:
 70                            img_url = post.metadata['title_img'].replace("../", "https://soomal.cc/")
 71                            # Handle potential double slashes
 72                            img_url = re.sub(r'(?<!:)/{2,}', '/', img_url)
 73                            post.metadata['cover'] = {
 74                                'image': img_url,
 75                                'caption': "",
 76                                'alt': "",
 77                                'relative': False
 78                            }
 79                            del post.metadata['title_img']
 80                        
 81                        # Modify title
 82                        if 'title' in post.metadata:
 83                            title = post.metadata['title']
 84                            # Remove content before "-"
 85                            if '-' in title:
 86                                new_title = title.split('-', 1)[1].strip()
 87                                post.metadata['title'] = new_title
 88                        
 89                        # Save modified file
 90                        with open(file_path, 'w', encoding='utf-8') as f_out:
 91                            f_out.write(frontmatter.dumps(post))
 92            except Exception as e:
 93                print(f"Error processing file {filename}: {str(e)}")
 94                # Log error files for later review
 95                with open("processing_errors.log", "a", encoding="utf-8") as log:
 96                    log.write(f"Error in {filename}: {str(e)}\n")
 97
 98if __na
 99``````python
100if __name__ == "__main__":
101    folder_path = "markdown_output"  # Replace with your actual path
102    process_md_files(folder_path)
103    print("Frontmatter processing completed for all Markdown files!")
  1. Streamlining Tags and Categories Soomal.com originally had over 20 article categories, some of which were meaningless (e.g., the “All Articles” category). Additionally, there was significant overlap between article categories and tags. To ensure uniqueness between categories and tags, further simplification was implemented. Another goal was to minimize the number of files generated during the final website build.
View the code for streamlining tags and categories
 1import os
 2import yaml
 3import frontmatter
 4
 5def clean_hugo_tags_categories(folder_path):
 6    """
 7    Clean up tags and categories in Hugo articles:
 8    1. Remove "All Articles" from categories
 9    2. Remove tags that duplicate categories
10    """
11    # Valid categories list ("All Articles" removed)
12    valid_categories = [
13        "Digital Devices", "Audio", "Music", "Mobile Digital", "Reviews", "Introductions", 
14        "Evaluation Reports", "Galleries", "Smartphones", "Android", "Headphones", 
15        "Musicians", "Imaging", "Digital Terminals", "Speakers", "iOS", "Cameras", 
16        "Sound Cards", "Album Reviews", "Tablets", "Technology", "Applications", 
17        "Portable Players", "Windows", "Digital Accessories", "Essays", "DACs", 
18        "Audio Systems", "Lenses", "Musical Instruments", "Audio Codecs"
19    ]
20    
21    # Process all Markdown files in the folder
22    for filename in os.listdir(folder_path):
23        if not filename.endswith('.md'):
24            continue
25            
26        filepath = os.path.join(folder_path, filename)
27        with open(filepath, 'r', encoding='utf-8') as f:
28            post = frontmatter.load(f)
29            
30            # 1. Clean categories (remove invalid entries and deduplicate)
31            if 'categories' in post.metadata:
32                # Convert to set for deduplication + filter invalid categories
33                categories = list(set(post.metadata['categories']))
34                cleaned_categories = [
35                    cat for cat in categories 
36                    if cat in valid_categories
37                ]
38                post.metadata['categories'] = cleaned_categories
39            
40            # 2. Clean tags (remove duplicates with categories)
41            if 'tags' in post.metadata:
42                current_cats = post.metadata.get('categories', [])
43                # Convert to set for deduplication + filter category duplicates
44                tags = list(set(post.metadata['tags']))
45                cleaned_tags = [
46                    tag for tag in tags 
47                    if tag not in current_cats
48                ]
49                post.metadata['tags'] = cleaned_tags
50                
51            # Save modified file
52            with open(filepath, 'w', encoding='utf-8') as f_out:
53                f_out.write(frontmatter.dumps(post))
54
55if __name__ == "__main__":
56    # Example usage (modify with your actual path)
57    md_folder = "./markdown_output"
58    clean_hugo_tags_categories(md_folder)
59    print(f"Processing completed: {len(os.listdir(md_folder))} files")

Step 4: Reducing Image Quantity

During the HTML-to-Markdown conversion, since only article content was extracted, many cropped images from the original site became unnecessary. Therefore, we matched the converted Markdown files against the original site’s images to identify only those needed for the new site.

This step reduced the total number of images from 326,000 to 118,000.

  1. Extracting Image Links Extract all image links from Markdown files. Since the image links were standardized during conversion, this process was straightforward.
View the extraction code
 1import os
 2import re
 3import argparse
 4
 5def extract_image_links(directory):
 6    """Extract image links from all md files in directory"""
 7    image_links = set()
 8    pattern = re.compile(r'https://soomal\.cc[^\s\)\]\}]*?\.jpg', re.IGNORECASE)
 9    
10    for root, _, files in os.walk(directory):
11        for filename in files:
12            if filename.endswith('.md'):
13                filepath = os.path.join(root, filename)
14                try:
15                    with open(filepath, 'r', encoding='utf-8') as f:
16                        content = f.read()
17                        matches = pattern.findall(content)
18                        if matches:
19                            image_links.update(matches)
20                except Exception as e:
21                    print(f"Error processing {filepath}: {str(e)}")
22    
23    return sorted(image_links)
24
25def save_links_to_file(links, output_file):
26    """Save links to file"""
27    with open(output_file, 'w', encoding='utf-8') as f:
28        for link in links:
29            f.write(link + '\n')
30
31if __name__ == "__main__":
32    parser = argparse.ArgumentParser(description='Extract image links from Markdown')
33    parser.add_argument('--input', default='markdown_output', help='Path to Markdown directory')
34    parser.add_argument('--output', default='image_links.txt', help='Output file path')
35    args = parser.parse_args()
36
37    print(f"Scanning directory: {args.input}")
38    links = extract_image_links(args.input)
39    
40    print(f"Found {len(links)} unique image links")
41    save_links_to_file(links, args.output)
42    print(f"Links saved to: {args.output}")
  1. Copying Corresponding Images Use the extracted image links to locate and copy corresponding files from the original site directory, ensuring directory accuracy.
A. View Windows Copy Code
  1import os
  2import shutil
  3import time
  4import sys
  5
  6def main():
  7    # Configuration
  8    source_drive = "F:\\"
  9    target_drive = "D:\\"
 10    image_list_file = r"D:\trans-soomal\image_links.txt"
 11    log_file = r"D:\trans-soomal\image_copy_log.txt"
 12    error_log_file = r"D:\trans-soomal\image_copy_errors.txt"
 13    
 14    print("Image copy script starting...")
 15    
 16    # Record start time
 17    start_time = time.time()
 18    
 19    # Create log files
 20    with open(log_file, "w", encoding="utf-8") as log, open(error_log_file, "w", encoding="utf-8") as err_log:
 21        log.write(f"Image Copy Log - Start Time: {time.ctime(start_time)}\n")
 22        err_log.write("Failed copies:\n")
 23        
 24        try:
 25            # Read image list
 26            with open(image_list_file, "r", encoding="utf-8") as f:
 27                image_paths = [line.strip() for line in f if line.strip()]
 28            
 29            total_files = len(image_paths)
 30            success_count = 0
 31            fail_count = 0
 32            skipped_count = 0
 33            
 34            print(f"Found {total_files} images to copy")
 35            
 36            # Process each file
 37            for i, relative_path in enumerate(image_paths):
 38                # Display progress
 39                progress = (i + 1) / total_files * 100
 40                sys.stdout.write(f"\rProgress: {progress:.2f}% ({i+1}/{total_files})")
 41                sys.stdout.flush()
 42                
 43                # Build full paths
 44                source_path = os.path.join(source_drive, relative_path)
 45                target_path = os.path.join(target_drive, relative_path)
 46                
 47                try:
 48                    # Check if source exists
 49                    if not os.path.exists(source_path):
 50                        err_log.write(f"Source missing: {source_path}\n")
 51                        fail_count += 1
 52                        continue
 53                    
 54                    # Check if target already exists
 55```if os.path.exists(target_path):
 56                        log.write(f"File already exists, skipping: {target_path}\n")
 57                        skipped_count += 1
 58                        continue
 59                    
 60                    # Create target directory
 61                    target_dir = os.path.dirname(target_path)
 62                    os.makedirs(target_dir, exist_ok=True)
 63                    
 64                    # Copy file
 65                    shutil.copy2(source_path, target_path)
 66                    
 67                    # Log success
 68                    log.write(f"[SUCCESS] Copied {source_path} to {target_path}\n")
 69                    success_count += 1
 70                    
 71                except Exception as e:
 72                    # Log failure
 73                    err_log.write(f"[FAILED] {source_path} -> {target_path} : {str(e)}\n")
 74                    fail_count += 1
 75            
 76            # Calculate elapsed time
 77            end_time = time.time()
 78            elapsed_time = end_time - start_time
 79            minutes, seconds = divmod(elapsed_time, 60)
 80            hours, minutes = divmod(minutes, 60)
 81            
 82            # Write summary
 83            summary = f"""
 84================================
 85Copy operation completed
 86Start time: {time.ctime(start_time)}
 87End time: {time.ctime(end_time)}
 88Total duration: {int(hours)}h {int(minutes)}m {seconds:.2f}s
 89
 90Total files: {total_files}
 91Successfully copied: {success_count}
 92Skipped (existing): {skipped_count}
 93Failed: {fail_count}
 94================================
 95"""
 96            log.write(summary)
 97            print(summary)
 98            
 99        except Exception as e:
100            print(f"\nError occurred: {str(e)}")
101            err_log.write(f"Script error: {str(e)}\n")
102
103if __name__ == "__main__":
104    main()
B. View Linux Copy Code
  1#!/bin/bash
  2
  3# Configuration parameters
  4LINK_FILE="/user/image_links.txt"  # Replace with actual link file path
  5SOURCE_BASE="/user/soomal.cc/index"
  6DEST_BASE="/user/images.soomal.cc/index"
  7LOG_FILE="/var/log/image_copy_$(date +%Y%m%d_%H%M%S).log"
  8THREADS=3  # Automatically get CPU cores as thread count
  9
 10# Start logging
 11{
 12echo "===== Copy Task Started: $(date) ====="
 13echo "Source base directory: $SOURCE_BASE"
 14echo "Destination base directory: $DEST_BASE"
 15echo "Link file: $LINK_FILE"
 16echo "Thread count: $THREADS"
 17
 18# Path validation example
 19echo -e "\n=== Path Validation ==="
 20sample_url="https://soomal.cc/images/doc/20090406/00000007.jpg"
 21expected_src="${SOURCE_BASE}/images/doc/20090406/00000007.jpg"
 22expected_dest="${DEST_BASE}/images/doc/20090406/00000007.jpg"
 23
 24echo "Example URL: $sample_url"
 25echo "Expected source path: $expected_src"
 26echo "Expected destination path: $expected_dest"
 27
 28if [[ -f "$expected_src" ]]; then
 29    echo "Validation successful: Example source file exists"
 30else
 31    echo "Validation failed: Example source file missing! Please check paths"
 32    exit 1
 33fi
 34
 35# Create destination base directory
 36mkdir -p "${DEST_BASE}/images"
 37
 38# Prepare parallel processing
 39echo -e "\n=== Processing Started ==="
 40total=$(wc -l < "$LINK_FILE")
 41echo "Total links: $total"
 42counter=0
 43
 44# Processing function
 45process_link() {
 46    local url="$1"
 47    local rel_path="${url#https://soomal.cc}"
 48    
 49    # Build full paths
 50    local src_path="${SOURCE_BASE}${rel_path}"
 51    local dest_path="${DEST_BASE}${rel_path}"
 52    
 53    # Create destination directory
 54    mkdir -p "$(dirname "$dest_path")"
 55    
 56    # Copy file
 57    if [[ -f "$src_path" ]]; then
 58        if cp -f "$src_path" "$dest_path"; then
 59            echo "SUCCESS: $rel_path"
 60            return 0
 61        else
 62            echo "COPY FAILED: $rel_path"
 63            return 2
 64        fi
 65    else
 66        echo "MISSING: $rel_path"
 67        return 1
 68    fi
 69}
 70
 71# Export function for parallel use
 72export -f process_link
 73export SOURCE_BASE DEST_BASE
 74
 75# Use parallel for concurrent processing
 76echo "Starting parallel copying..."
 77parallel --bar --jobs $THREADS --progress \
 78         --halt soon,fail=1 \
 79         --joblog "${LOG_FILE}.jobs" \
 80         --tagstring "{}" \
 81         "process_link {}" < "$LINK_FILE" | tee -a "$LOG_FILE"
 82
 83# Collect results
 84success=$(grep -c 'SUCCESS:' "$LOG_FILE")
 85missing=$(grep -c 'MISSING:' "$LOG_FILE")
 86failed=$(grep -c 'COPY FAILED:' "$LOG_FILE")
 87
 88# Final statistics
 89echo -e "\n===== Copy Task Completed: $(date) ====="
 90echo "Total links: $total"
 91echo "Successfully copied: $success"
 92echo "Missing files: $missing"
 93echo "Copy failures: $failed"
 94echo "Success rate: $((success * 100 / total))%"
 95
 96} | tee "$LOG_FILE"
 97
 98# Save missing files list
 99grep '^MISSING:' "$LOG_FILE" | cut -d' ' -f2- > "${LOG_FILE%.log}_missing.txt"
100echo "Missing files list: ${LOG_FILE%.log}_missing.txt"

Step 5: Compress Image Sizes

I had previously compressed the website’s source images once, but it wasn’t enough. My goal is to reduce the image size to under 10 GB to meet potential future requirements for migrating to CloudFlare R2.

  1. Convert JPG to Webp After compressing the images with Webp before, I kept them in JPG format to avoid access issues due to the numerous HTML files. Since this migration is to Hugo, there’s no need to retain JPG format anymore, so I’ll directly convert them to Webp. Additionally, since my webpage is set to a 960px width and I’m not using fancy lightbox plugins, resizing the images to 960px can further reduce the size.

Actual tests showed that after this compression, the image size dropped to 7.7GB. However, I noticed a minor issue with the image processing logic. Soomal has many vertical images as well as horizontal ones, and 960px width appears somewhat small on 4K displays. I ultimately converted the images with the short edge set to a maximum of 1280px at 85% quality, resulting in a size of about 14GB, which fits within my VPS’s 20GB storage. I also tested with a short edge of 1150px at 80% quality, which met the 10GB requirement.

View Image Conversion Code
  1import os
  2import subprocess
  3import time
  4import sys
  5import shutil
  6from pathlib import Path
  7
  8def main():
  9    # Configure paths
 10    source_dir = Path("D:\\images")  # Original image directory
 11    output_dir = Path("D:\\images_webp")  # WebP output directory
 12    temp_dir = Path("D:\\temp_webp")  # Temporary processing directory
 13    magick_path = "C:\\webp\\magick.exe"  # ImageMagick path
 14    
 15    # Create necessary directories
 16    output_dir.mkdir(parents=True, exist_ok=True)
 17    temp_dir.mkdir(parents=True, exist_ok=True)
 18    
 19    # Log files
 20    log_file = output_dir / "conversion_log.txt"
 21    stats_file = output_dir / "conversion_stats.csv"
 22    
 23    print("Image conversion script starting...")
 24    print(f"Source directory: {source_dir}")
 25    print(f"Output directory: {output_dir}")
 26    print(f"Temporary directory: {temp_dir}")
 27    
 28    # Initialize log
 29    with open(log_file, "w", encoding="utf-8") as log:
 30        log.write(f"Image conversion log - Start time: {time.ctime()}\n")
 31    
 32    # Initialize stats file
 33    with open(stats_file, "w", encoding="utf-8") as stats:
 34        stats.write("Original File,Converted File,Original Size (KB),Converted Size (KB),Space Saved (KB),Savings Percentage\n")
 35    
 36    # Collect all image files
 37    image_exts = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.gif')
 38    all_images = []
 39    for root, _, files in os.walk(source_dir):
 40        for file in files:
 41            if file.lower().endswith(image_exts):
 42                all_images.append(Path(root) / file)
 43    
 44    total_files = len(all_images)
 45    converted_files = 0
 46    skipped_files = 0
 47    error_files = 0
 48    
 49    print(f"Found {total_files} image files to process")
 50    
 51    # Process each image
 52    for idx, img_path in enumerate(all_images):
 53        try:
 54            # Progress display
 55```Display progress  
 56            progress = (idx + 1) / total_files * 100  
 57            sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")  
 58            sys.stdout.flush()  
 59              
 60            # Create relative path structure  
 61            rel_path = img_path.relative_to(source_dir)  
 62            webp_path = output_dir / rel_path.with_suffix('.webp')  
 63            webp_path.parent.mkdir(parents=True, exist_ok=True)  
 64              
 65            # Check if file already exists  
 66            if webp_path.exists():  
 67                skipped_files += 1  
 68                continue  
 69              
 70            # Create temporary file path  
 71            temp_path = temp_dir / f"{img_path.stem}_temp.webp"  
 72              
 73            # Get original file size  
 74            orig_size = img_path.stat().st_size / 1024  # KB  
 75              
 76            # Convert and resize using ImageMagick  
 77            cmd = [  
 78                magick_path,  
 79                str(img_path),  
 80                "-resize", "960>",   # Resize only if width exceeds 960px  
 81                "-quality", "85",    # Initial quality 85  
 82                "-define", "webp:lossless=false",  
 83                str(temp_path)  
 84            ]  
 85              
 86            # Execute command  
 87            result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)  
 88              
 89            if result.returncode != 0:  
 90                # Log conversion failure  
 91                with open(log_file, "a", encoding="utf-8") as log:  
 92                    log.write(f"[ERROR] Failed to convert {img_path}: {result.stderr}\n")  
 93                error_files += 1  
 94                continue  
 95              
 96            # Move temporary file to target location  
 97            shutil.move(str(temp_path), str(webp_path))  
 98              
 99            # Get converted file size  
100            new_size = webp_path.stat().st_size / 1024  # KB  
101              
102            # Calculate space savings  
103            saved = orig_size - new_size  
104            saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0  
105              
106            # Record statistics  
107            with open(stats_file, "a", encoding="utf-8") as stats:  
108                stats.write(f"{img_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")  
109              
110            converted_files += 1  
111          
112        except Exception as e:  
113            with open(log_file, "a", encoding="utf-8") as log:  
114                log.write(f"[EXCEPTION] Error processing {img_path}: {str(e)}\n")  
115            error_files += 1  
116      
117    # Completion report  
118    total_size = sum(f.stat().st_size for f in output_dir.glob('**/*') if f.is_file())  
119    total_size_gb = total_size / (1024 ** 3)  # Convert to GB  
120      
121    end_time = time.time()  
122    elapsed = end_time - time.time()  
123    mins, secs = divmod(elapsed, 60)  
124    hours, mins = divmod(mins, 60)  
125      
126    with open(log_file, "a", encoding="utf-8") as log:  
127        log.write("\nConversion Report:\n")  
128        log.write(f"Total files: {total_files}\n")  
129        log.write(f"Successfully converted: {converted_files}\n")  
130        log.write(f"Skipped files: {skipped_files}\n")  
131        log.write(f"Error files: {error_files}\n")  
132        log.write(f"Output directory size: {total_size_gb:.2f} GB\n")  
133      
134    print("\n\nConversion completed!")  
135    print(f"Total files: {total_files}")  
136    print(f"Successfully converted: {converted_files}")  
137    print(f"Skipped files: {skipped_files}")  
138    print(f"Error files: {error_files}")  
139    print(f"Output directory size: {total_size_gb:.2f} GB")  
140      
141    # Clean up temporary directory  
142    try:  
143        shutil.rmtree(temp_dir)  
144        print(f"Cleaned temporary directory: {temp_dir}")  
145    except Exception as e:  
146        print(f"Error cleaning temporary directory: {str(e)}")  
147      
148    print(f"Log file: {log_file}")  
149    print(f"Statistics file: {stats_file}")  
150    print(f"Total time elapsed: {int(hours)} hours {int(mins)} minutes {secs:.2f} seconds")  
151
152if __name__ == "__main__":  
153    main()  
  1. Further Image Compression
    I originally designed this step to further compress images if the initial conversion didn’t reduce the total size below 10GB. However, the first step successfully resolved the issue, making additional compression unnecessary. Nevertheless, I tested further compression by converting images to WebP with a maximum short edge of 1280px and 60% quality, which resulted in a total size of only 9GB.
View Secondary Compression Code
  1import os  
  2import subprocess  
  3import time  
  4import sys  
  5import shutil  
  6from pathlib import Path  
  7
  8def main():  
  9    # Configure paths  
 10    webp_dir = Path("D:\\images_webp")  # WebP directory  
 11    temp_dir = Path("D:\\temp_compress")  # Temporary directory  
 12    cwebp_path = "C:\\Windows\\System32\\cwebp.exe"  # cwebp path  
 13      
 14    # Create temporary directory  
 15    temp_dir.mkdir(parents=True, exist_ok=True)  
 16      
 17    # Log files  
 18    log_file = webp_dir / "compression_log.txt"  
 19    stats_file = webp_dir / "compression_stats.csv"  
 20      
 21    print("WebP compression script starting...")  
 22    print(f"Processing directory: {webp_dir}")  
 23    print(f"Temporary directory: {temp_dir}")  
 24      
 25    # Initialize log  
 26    with open(log_file, "w", encoding="utf-8") as log:  
 27        log.write(f"WebP Compression Log - Start time: {time.ctime()}\n")  
 28      
 29    # Initialize statistics file  
 30    with open(stats_file, "w", encoding="utf-8") as stats:  
 31        stats.write("Original File,Compressed File,Original Size (KB),New Size (KB),Space Saved (KB),Savings Percentage\n")  
 32      
 33    # Collect all WebP files  
 34    all_webp = list(webp_dir.glob('**/*.webp'))  
 35    total_files = len(all_webp)  
 36      
 37    if total_files == 0:  
 38        print("No WebP files found. Please run the conversion script first.")  
 39        return  
 40      
 41    print(f"Found {total_files} WebP files to compress")  
 42      
 43    compressed_count = 0  
 44    skipped_count = 0  
 45    error_count = 0  
 46      
 47    # Process each WebP file  
 48    for idx, webp_path in enumerate(all_webp):  
 49        try:  
 50            # Display progress  
 51            progress = (idx + 1) / total_files * 100  
 52            sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")  
 53            sys.stdout.flush()  
 54              
 55            # Original size  
 56            orig_size = webp_path.stat().st_size / 1024  # KB  
 57              
 58            # Create temporary file path  
 59            temp_path = temp_dir / f"{webp_path.stem}_compressed.webp"  
 60              
 61            # Perform secondary compression using cwebp  
 62            cmd = [  
 63                cwebp_path,  
 64                "-q", "75",  # Quality parameter  
 65                "-m", "6",   # Maximum compression mode  
 66                str(webp_path),  
 67                "-o", str(temp_path)  
 68            ]  
 69              
 70            # Execute command  
 71            result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)  
 72              
 73            if result.returncode != 0:  
 74                # Log compression failure  
 75                with open(log_file, "a", encoding="utf-8") as log:  
 76                    log.write(f"[ERROR] Failed to compress {webp_path}: {result.stderr}\n")  
 77                error_count += 1  
 78                continue  
 79              
 80            # Get new file size  
 81            new_size = temp_path.stat().st_size / 1024  # KB```markdown
 82# Skip if the new file is larger than the original
 83if new_size >= orig_size:
 84    skipped_count += 1
 85    temp_path.unlink()  # Delete temporary file
 86    continue
 87
 88# Calculate space savings
 89saved = orig_size - new_size
 90saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0
 91
 92# Record statistics
 93with open(stats_file, "a", encoding="utf-8") as stats:
 94    stats.write(f"{webp_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")
 95
 96# Replace original file
 97webp_path.unlink()  # Delete original file
 98shutil.move(str(temp_path), str(webp_path))
 99compressed_count += 1
100
101except Exception as e:
102    with open(log_file, "a", encoding="utf-8") as log:
103        log.write(f"[Error] Processing {webp_path} failed: {str(e)}\n")
104    error_count += 1
105
106# Completion report
107total_size = sum(f.stat().st_size for f in webp_dir.glob('**/*') if f.is_file())
108total_size_gb = total_size / (1024 ** 3)  # Convert to GB
109
110end_time = time.time()
111elapsed = end_time - time.time()
112mins, secs = divmod(elapsed, 60)
113hours, mins = divmod(mins, 60)
114
115with open(log_file, "a", encoding="utf-8") as log:
116    log.write("\nCompression Report:\n")
117    log.write(f"Files processed: {total_files}\n")
118    log.write(f"Successfully compressed: {compressed_count}\n")
119    log.write(f"Skipped files: {skipped_count}\n")
120    log.write(f"Error files: {error_count}\n")
121    log.write(f"Total output directory size: {total_size_gb:.2f} GB\n")
122
123print("\n\nCompression completed!")
124print(f"Files processed: {total_files}")
125print(f"Successfully compressed: {compressed_count}")
126print(f"Skipped files: {skipped_count}")
127print(f"Error files: {error_count}")
128print(f"Total output directory size: {total_size_gb:.2f} GB")
129
130# Clean temporary directory
131try:
132    shutil.rmtree(temp_dir)
133    print(f"Cleaned temporary directory: {temp_dir}")
134except Exception as e:
135    print(f"Error cleaning temporary directory: {str(e)}")
136
137print(f"Log file: {log_file}")
138print(f"Stats file: {stats_file}")
139print(f"Total duration: {int(hours)}h {int(mins)}m {secs:.2f}s")
140
141if __name__ == "__main__":
142    main()

Implementation Plan

Selecting the Right Hugo Theme

For a Hugo project with tens of thousands of markdown files, choosing a theme can be quite challenging.

I tested a visually appealing theme that took over three hours to complete generation without finishing. Some themes threw constant errors during generation, while others produced over 200,000 files.

Ultimately, I settled on the most stable option - the PaperMod theme. By default, this theme generates only about 100 files, and the final website contains fewer than 50,000 files, which is relatively efficient.

Although it doesn’t meet Cloudflare Pages’ 20,000-file limit, it’s sufficiently lean. The build took 6.5 minutes on GitHub Pages and 8 minutes on Vercel.

However, some issues emerged during the build:

The tag issue presents an optimization opportunity: only displaying the top 1,000 most-used tags while incorporating others into article titles. This could potentially reduce the file count below 20,000, meeting Cloudflare Pages’ requirements.

Choosing Static Site Hosting

The Hugo project itself is under 100MB (with 80MB being markdown files), making GitHub hosting feasible. Given GitHub Pages’ slower speeds, I opted for Vercel deployment. While Vercel’s 100GB bandwidth limit might seem restrictive, it should suffice for static content.

Selecting Image Hosting

Still evaluating options. Initially considered Cloudflare R2 but hesitated due to concerns about exceeding free tier limits. Currently using a budget $7/year “fake Alibaba Cloud” VPS as a temporary solution.

#soomal #hugo #html #python

Comments