Migrating Soomal.cc to Hugo

Earlier this year, after obtaining the source code for the Soomal.com website, I uploaded it to my VPS. However, due to the outdated architecture of the original site, which was inconvenient to manage and not mobile-friendly, I recently undertook a complete overhaul, converting and migrating the entire site to Hugo.
Migration Plan Design
I had long considered revamping Soomal.cc. I had previously run some preliminary tests but encountered numerous issues, which led me to shelve the project temporarily.
Challenges and Difficulties
Large Volume of Articles
Soomal contains 9,630 articles, with the earliest dating back to 2003, totaling 19 million words.
The site also hosts 326,000 JPG images across more than 4,700 folders. Most images come in three sizes, though some are missing, resulting in a total size of nearly 70 GB.
Complexity in Article Conversion
The Soomal source code only includes HTML files for article pages. While these files might have been generated by the same program, preliminary tests revealed that the page structure had undergone multiple changes over time, with different tags used in various periods, making information extraction from the HTML files highly challenging.
Encoding Issues: The original HTML files use GB2312 encoding and were likely hosted on a Windows server, requiring special handling for character encoding and escape sequences during conversion.
Image Issues: The site contains a vast number of images, which are the essence of Soomal. However, these images use diverse tags and styles, making it difficult to extract links and descriptions without omissions.
Tags and Categories: The site has nearly 12,000 article tags and over 20 categories. However, the HTML files lack category information, which can only be found in the 2,000+ category slice HTML files. Tags also present problems—some contain spaces, special characters, or duplicates within the same article.
Article Content: The HTML files include the main text, related articles, and tags, all nested under the
DOCtag. Initially, I overlooked that related articles use lowercasedoctags, leading to extraction errors during testing. It was only after noticing this discrepancy while browsing the site that I restarted the conversion project.
Storage Solution Dilemma
I initially hosted Soomal.cc on a VPS. Over a few months, despite low traffic, data usage soared to nearly 1.5TB. Although the VPS offers unlimited bandwidth, this was concerning. After migrating to Hugo, I found that most free hosting services impose restrictions—GitHub recommends repositories under 1GB, CloudFlare Pages limits files to 20,000, CloudFlare R2 caps storage at 10GB, and Vercel and Netlify both limit traffic to 100GB.
Conversion Methodology
Given the potential challenges in converting Soomal to Hugo, I devised a five-step migration plan.
Step 1: Convert HTML Files to Markdown
Define Conversion Requirements
- Extract Titles: Retrieve article titles from the
<head>tag. For example, extract谈谈手机产业链和手机厂商的相互影响from<title>刘延作品 - 谈谈手机产业链和手机厂商的相互影响 [Soomal]</title>. - Extract Tags: Use keyword filtering to locate tags in the HTML, extract tag names, and enclose them in quotes to handle spaces in tag names.
- Extract Main Text: Retrieve the article body from the
DOCtag and truncate content after thedoctag. - Extract Metadata: Gather publication dates, author information, and header images from the HTML.
- Extract Images: Identify and extract all image references (e.g.,
smallpic,bigpic,smallpic2,wrappic). - Extract Special Content: Include subheadings, download links, tables, etc.
- Extract Titles: Retrieve article titles from the
File Conversion Given the clear requirements, I used Python scripts for the conversion.
Click to View Conversion Script Example
1import os
2import re
3from bs4 import BeautifulSoup, Tag, NavigableString
4from datetime import datetime
5
6def convert_html_to_md(html_path, output_dir):
7 try:
8 # Read HTML files with GB2312 encoding
9 with open(html_path, 'r', encoding='gb2312', errors='ignore') as f:
10 html_content = f.read()
11
12 soup = BeautifulSoup(html_content, 'html.parser')
13
14 # 1. Extract title
15 title = extract_title(soup)
16
17 # 2. Extract bookmark tags
18 bookmarks = extract_bookmarks(soup)
19
20 # 3. Extract title image and info
21 title_img, info_content = extract_title_info(soup)
22
23 # 4. Extract main content
24 body_content = extract_body_content(soup)
25
26 # Generate YAML frontmatter
27 frontmatter = f"""---
28title: "{title}"
29date: {datetime.now().strftime('%Y-%m-%dT%H:%M:%S+08:00')}
30tags: {bookmarks}
31title_img: "{title_img}"
32info: "{info_content}"
33---\n\n"""
34
35 # Generate Markdown content
36 markdown_content = frontmatter + body_content
37
38 # Save Markdown file
39 output_path = os.path.join(output_dir, os.path.basename(html_path).replace('.htm', '.md'))
40 with open(output_path, 'w', encoding='utf-8') as f:
41 f.write(markdown_content)
42
43 return f"Conversion successful: {os.path.basename(html_path)}"
44 except Exception as e:
45 return f"Conversion failed {os.path.basename(html_path)}: {str(e)}"
46
47def extract_title(soup):
48 """Extract title"""
49 if soup.title:
50 return soup.title.string.strip()
51 return ""
52
53def extract_bookmarks(soup):
54 """Extract bookmark tags, each enclosed in quotes"""
55 bookmarks = []
56 bookmark_element = soup.find(string=re.compile(r'本文的相关书签:'))
57
58 if bookmark_element:
59 parent = bookmark_element.find_parent(['ul', 'li'])
60 if parent:
61 # Extract text from all <a> tags
62 for a_tag in parent.find_all('a'):
63 text = a_tag.get_text().strip()
64 if text:
65 # Enclose each tag in quotes
66 bookmarks.append(f'"{text}"')
67
68 return f"[{', '.join(bookmarks)}]" if bookmarks else "[]"
69
70def extract_title_info(soup):
71 """Extract title image and info content"""
72 title_img = ""
73 info_content = ""
74
75 titlebox = soup.find('div', class_='titlebox')
76 if titlebox:
77 # Extract title image
78 title_img_div = titlebox.find('div', class_='titleimg')
79 if title_img_div and title_img_div.img:
80 title_img = title_img_div.img['src']
81
82 # Extract info content
83 info_div = titlebox.find('div', class_='info')
84 if info_div:
85 # Remove all HTML tags, keeping only text
86 info_content = info_div.get_text().strip()
87
88 return title_img, info_content
89
90def extract_body_content(soup):
91 """Extract main content and process images"""
92 body_content = ""
93 doc_div = soup.find('div', class_='Doc') # Note uppercase 'D'
94
95 if doc_div:
96 # Remove all nested div class="doc" (lowercase)
97 for nested_doc in doc_div.find_all('div', class_='doc'):
98 nested_doc.decompose()
99
100 # Process images
101 process_images(doc_div)
102
103 # Iterate through all child elements to build Markdown content
104 for element in doc_div.children:
105 if isinstance(element, Tag):
106 if element.name == 'div' and 'subpagetitle' in element.get('class', []):
107 # Convert to subheading
108 body_content += f"## {element.get_text().strip()}\n\n"
109 else:
110 # Preserve other content
111 body_content += element.get_text().strip() + "\n\n"
112 elif isinstance(element, NavigableString):
113 body_content += element.strip() + "\n\n"
114
115 return body_content.strip()
116
117def process_images(container):
118 """Process image content (Rules A/B/C)"""
119 # A: Handle <li data-src> tags
120 for li in container.find_all('li', attrs={'data-src': True}):
121 img_url = li['data-src'].replace('..', 'https://soomal.cc', 1)
122 caption_div = li.find('div', class_='caption')
123 content_div = li.find('div', class_='content')
124
125 alt_text = caption_div.get_text().strip() if caption_div else ""
126 meta_text = content_div.get_text().strip() if content_div else ""
127
128 # Create Markdown image syntax
129 img_md = f"\n\n{meta_text}\n\n"
130 li.replace_with(img_md)
131
132 # B: Process <span class="smallpic"> tags
133 for span in container.find_all('span', class_='smallpic'):
134 img = span.find('img')
135 if img and 'src' in img.attrs:
136 img_url = img['src'].replace('..', 'https://soomal.cc', 1)
137 caption_div = span.find('div', class_='caption')
138 content_div = span.find('div', class_='content')
139
140 alt_text = caption_div.get_text().strip() if caption_div else ""
141 meta_text = content_div.get_text().strip() if content_div else ""
142
143 # Create Markdown image syntax
144 img_md = f"\n\n{meta_text}\n\n"
145 span.replace_with(img_md)
146
147 # C: Process <div class="bigpic"> tags
148 for div in container.find_all('div', class_='bigpic'):
149 img = div.find('img')
150 if img and 'src' in img.attrs:
151 img_url = img['src'].replace('..', 'https://soomal.cc', 1)
152 caption_div = div.find('div', class_='caption')
153 content_div = div.find('div', class_='content')
154
155 alt_text = caption_div.get_text().strip() if caption_div else ""
156 meta_text = content_div.get_text().strip() if content_div else ""
157
158 # Create Markdown image syntax
159 img_md = f"\n\n{meta_text}\n\n"
160 div.replace_with(img_md)
161
162if __name__ == "__main__":
163 input_dir = 'doc'
164 output_dir = 'markdown_output'
165
166 # Create output directory
167 os.makedirs(output_dir, exist_ok=True)
168
169 # Process all HTML files
170 for filename in os.listdir(input_dir):
171 if filename.endswith('.htm'):
172 html_path = os.path.join(input_dir, filename)
173 result = convert_html_to_md(html_path, output_dir)
174 print(result)Step 2: Processing Categories and Abstracts
Due to the original HTML files not containing category information, the article category directories had to be processed separately. During category processing, article abstracts were also handled simultaneously.
Extracting Category and Abstract Information
Primarily using Python to extract and format category and abstract information from over 2,000 category pages.
Click to view conversion code
1import os
2import re
3from bs4 import BeautifulSoup
4import codecs
5from collections import defaultdict
6
7def extract_category_info(folder_path):
8 # Use defaultdict to automatically initialize nested dictionaries
9 article_categories = defaultdict(set) # Stores article ID to category mapping
10 article_summaries = {} # Stores article ID to abstract mapping
11
12 # Iterate through all HTM files in the folder
13 for filename in os.listdir(folder_path):
14 if not filename.endswith('.htm'):
15 continue
16
17 file_path = os.path.join(folder_path, filename)
18
19 try:
20 # Read file with GB2312 encoding and convert to UTF-8
21 with codecs.open(file_path, 'r', encoding='gb2312', errors='replace') as f:
22 content = f.read()
23
24 soup = BeautifulSoup(content, 'html.parser')
25
26 # Extract category name
27 title_tag = soup.title
28 if title_tag:
29 title_text = title_tag.get_text().strip()
30 # Extract content before the first hyphen
31 category_match = re.search(r'^([^-]+)', title_text)
32 if category_match:
33 category_name = category_match.group(1).strip()
34 # Add quotes if category name contains spaces
35 if ' ' in category_name:
36 category_name = f'"{category_name}"'
37 else:
38 category_name = "Unknown_Category"
39 else:
40 category_name = "Unknown_Category"
41
42 # Extract article information
43 for item in soup.find_all('div', class_='item'):
44 # Extract article ID
45 article_link = item.find('a', href=True)
46 if article_link:
47 href = article_link['href']
48 article_id = re.search(r'../doc/(\d+)\.htm', href)
49 if article_id:
50 article_id = article_id.group(1)
51 else:
52 continue
53 else:
54 continue
55
56 # Extract article abstract
57 synopsis_div = item.find('div', class_='synopsis')
58 synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
59
60 # Store category information
61 article_categories[article_id].add(category_name)
62
63 # Store abstract (only once to avoid overwriting)
64 if article_id not in article_summaries:
65 article_summaries[article_id] = synopsis
66
67 except UnicodeDecodeError:
68 # Attempt using GBK encoding as fallback
69 try:
70 with codecs.open(file_path, 'r', encoding='gbk', errors='replace') as f:
71 content = f.read()
72 # Reprocess content...
73 # Note: Repeated processing code omitted here; should be extracted as a function
74 # For code completeness, we include the repeated logic
75 soup = BeautifulSoup(content, 'html.parser')
76 title_tag = soup.title
77 if title_tag:
78 title_text = title_tag.get_text().strip()
79 category_match = re.search(r'^([^-]+)', title_text)
80 if category_match:
81 category_name = category_match.group(1).strip()
82 if ' ' in category_name:
83 category_name = f'"{category_name}"'
84 else:
85 category_name = "Unknown_Category"
86 else:
87 category_name = "Unknown_Category"
88
89 for item in soup.find_all('div', class_='item'):
90 article_link = item.find('a', href=True)
91 if article_link:
92 href = article_link['href']
93 article_id = re.search(r'../doc/(\d+)\.htm', href)
94 if article_id:
95 article_id = article_id.group(1)
96 else:
97 continue```python
98else:
99 continue
100
101synopsis_div = item.find('div', class_='synopsis')
102synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
103
104article_categories[article_id].add(category_name)
105
106if article_id not in article_summaries:
107 article_summaries[article_id] = synopsis
108
109except Exception as e:
110 print(f"Error processing file {filename} (after trying GBK): {str(e)}")
111 continue
112
113except Exception as e:
114 print(f"Error processing file {filename}: {str(e)}")
115 continue
116
117return article_categories, article_summaries
118
119def save_to_markdown(article_categories, article_summaries, output_path):
120 with open(output_path, 'w', encoding='utf-8') as md_file:
121 # Write Markdown header
122 md_file.write("# Article Categories and Summaries\n\n")
123 md_file.write("> This file contains IDs, categories and summaries of all articles\n\n")
124
125 # Sort by article ID
126 sorted_article_ids = sorted(article_categories.keys(), key=lambda x: int(x))
127
128 for article_id in sorted_article_ids:
129 # Get sorted category list
130 categories = sorted(article_categories[article_id])
131 # Format as list string
132 categories_str = ", ".join(categories)
133
134 # Get summary
135 summary = article_summaries.get(article_id, "No summary available")
136
137 # Write Markdown content
138 md_file.write(f"## Filename: {article_id}\n")
139 md_file.write(f"**Categories**: {categories_str}\n")
140 md_file.write(f"**Summary**: {summary}\n\n")
141 md_file.write("---\n\n")
142
143if __name__ == "__main__":
144 # Configure input and output paths
145 input_folder = 'Categories' # Replace with your HTM folder path
146 output_md = 'articles_categories.md'
147
148 # Execute extraction
149 article_categories, article_summaries = extract_category_info(input_folder)
150
151 # Save results to Markdown file
152 save_to_markdown(article_categories, article_summaries, output_md)
153
154 # Print statistics
155 print(f"Successfully processed data for {len(article_categories)} articles")
156 print(f"Saved to {output_md}")
157 print(f"Found {len(article_summaries)} articles with summaries")Writing category and summary information to markdown files
This step is relatively simple - writing the extracted category and summary data into the previously converted markdown files one by one.
Click to view the writing script
1import os
2import re
3import ruamel.yaml
4from collections import defaultdict
5
6def parse_articles_categories(md_file_path):
7 """
8 Parse articles_categories.md file to extract article IDs, categories and summaries
9 """
10 article_info = defaultdict(dict)
11 current_id = None
12
13 try:
14 with open(md_file_path, 'r', encoding='utf-8') as f:
15 for line in f:
16 # Match filename
17 filename_match = re.match(r'^## Filename: (\d+)$', line.strip())
18 if filename_match:
19 current_id = filename_match.group(1)
20 continue
21
22 # Match category information
23 categories_match = re.match(r'^\*\*Categories\*\*: (.+)$', line.strip())
24 if categories_match and current_id:
25 categories_str = categories_match.group(1)
26 # Clean category string, remove extra spaces and quotes
27 categories = [cat.strip().strip('"') for cat in categories_str.split(',')]
28 article_info[current_id]['categories'] = categories
29 continue
30
31 # Match summary information
32 summary_match = re.match(r'^\*\*Summary\*\*: (.+)$', line.strip())
33 if summary_match and current_id:
34 summary = summary_match.group(1)
35 article_info[current_id]['summary'] = summary
36 continue
37
38 # Reset current ID when encountering separator
39 if line.startswith('---'):
40 current_id = None
41
42 except Exception as e:
43 print(f"Error parsing articles_categories.md file: {str(e)}")
44
45 return article_info
46
47def update_markdown_files(article_info, md_folder):
48 """
49 Update Markdown files by adding category and summary information to frontmatter
50 """
51 updated_count = 0
52 skipped_count = 0
53
54 # Initialize YAML parser
55 yaml = ruamel.yaml.YAML()
56 yaml.preserve_quotes = True
57 yaml.width = 1000 # Prevent long summaries from line breaking
58
59 for filename in os.listdir(md_folder):
60 if not filename.endswith('.md'):
61 continue
62
63 article_id = filename[:-3] # Remove .md extension
64 file_path = os.path.join(md_folder, filename)
65
66 # Check if information exists for this article
67 if article_id not in article_info:
68 skipped_count += 1
69 continue
70
71 try:
72 with open(file_path, 'r', encoding='utf-8') as f:
73 content = f.read()
74
75 # Parse frontmatter
76 frontmatter_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
77 if not frontmatter_match:
78 print(f"No frontmatter found in file {filename}, skipping")
79 skipped_count += 1
80 continue
81
82 frontmatter_content = frontmatter_match.group(1)
83
84 # Convert frontmatter to dictionary
85 data = yaml.load(frontmatter_content)
86 if data is None:
87 data = {}
88
89 # Add category and summary information
90 info = article_info[article_id]
91
92 # Add categories
93 if 'categories' in info:
94 # If categories already exist, merge them (deduplicate)
95 existing_categories = set(data.get('categories', []))
96 new_categories = set(info['categories'])
97 combined_categories = sorted(existing_categories.union(new_categories))
98 data['categories'] = combined_categories
99
100 # Add summary (if summary exists and is not empty)
101 if 'summary' in info and info['summary']:
102 # Only update if summary doesn't exist or new summary is not empty
103 if 'summary' not in data or info['summary']:
104 data['summary'] = info['summary']
105
106 # Regenerate frontmatter
107 new_frontmatter = '---\n'
108 with ruamel.yaml.String
109```Here's the English translation of the provided text:
110
111```python
112IO() as stream:
113 yaml.dump(data, stream)
114 new_frontmatter += stream.getvalue().strip()
115new_frontmatter += '\n---'
116
117# Replace original frontmatter
118new_content = content.replace(frontmatter_match.group(0), new_frontmatter)
119
120# Write to file
121with open(file_path, 'w', encoding='utf-8') as f:
122 f.write(new_content)
123
124updated_count += 1
125
126except Exception as e:
127 print(f"Error updating file {filename}: {str(e)}")
128 skipped_count += 1
129
130return updated_count, skipped_count
131
132if __name__ == "__main__":
133 # Configure paths
134 articles_md = 'articles_categories.md' # Markdown file containing category and summary information
135 md_folder = 'markdown_output' # Folder containing Markdown articles
136
137 # Parse articles_categories.md file
138 print("Parsing articles_categories.md file...")
139 article_info = parse_articles_categories(articles_md)
140 print(f"Successfully parsed information for {len(article_info)} articles")
141
142 # Update Markdown files
143 print(f"\nUpdating category and summary information for {len(article_info)} articles...")
144 updated, skipped = update_markdown_files(article_info, md_folder)
145
146 # Print statistics
147 print(f"\nProcessing complete!")
148 print(f"Successfully updated: {updated} files")
149 print(f"Skipped: {skipped} files")
150 print(f"Articles with found information: {len(article_info)}")Step 3: Convert article frontmatter information
This step primarily involves correcting the frontmatter section of the output Markdown files to meet Hugo theme requirements.
- Revise article header information according to frontmatter specifications Mainly handles special characters, date formats, authors, featured images, tags, categories, etc.
View conversion code
1import os
2import re
3import frontmatter
4import yaml
5from datetime import datetime
6
7def escape_special_characters(text):
8 """Escape special characters in YAML"""
9 # Escape backslashes while preserving already escaped characters
10 return re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'\\\\', text)
11
12def process_md_files(folder_path):
13 for filename in os.listdir(folder_path):
14 if filename.endswith(".md"):
15 file_path = os.path.join(folder_path, filename)
16 try:
17 # Read file content
18 with open(file_path, 'r', encoding='utf-8') as f:
19 content = f.read()
20
21 # Manually split frontmatter and content
22 if content.startswith('---\n'):
23 parts = content.split('---\n', 2)
24 if len(parts) >= 3:
25 fm_text = parts[1]
26 body_content = parts[2] if len(parts) > 2 else ""
27
28 # Escape special characters
29 fm_text = escape_special_characters(fm_text)
30
31 # Recombine content
32 new_content = f"---\n{fm_text}---\n{body_content}"
33
34 # Parse frontmatter using safe loading
35 post = frontmatter.loads(new_content)
36
37 # Process info field
38 if 'info' in post.metadata:
39 info = post.metadata['info']
40
41 # Extract date
42 date_match = re.search(r'On (\d{4}\.\d{1,2}\.\d{1,2} \d{1,2}:\d{2}:\d{2})', info)
43 if date_match:
44 date_str = date_match.group(1)
45 try:
46 dt = datetime.strptime(date_str, "%Y.%m.%d %H:%M:%S")
47 post.metadata['date'] = dt.strftime("%Y-%m-%dT%H:%M:%S+08:00")
48 except ValueError:
49 # Keep original date as fallback
50 pass
51
52 # Extract author
53 author_match = re.match(r'^(.+?)作品', info)
54 if author_match:
55 authors = author_match.group(1).strip()
56 # Split multiple authors
57 author_list = [a.strip() for a in re.split(r'\s+', authors) if a.strip()]
58 post.metadata['author'] = author_list
59
60 # Create description
61 desc_parts = info.split('|', 1)
62 if len(desc_parts) > 1:
63 post.metadata['description'] = desc_parts[1].strip()
64
65 # Remove original info
66 del post.metadata['info']
67
68 # Process title_img
69 if 'title_img' in post.metadata:
70 img_url = post.metadata['title_img'].replace("../", "https://soomal.cc/")
71 # Handle potential double slashes
72 img_url = re.sub(r'(?<!:)/{2,}', '/', img_url)
73 post.metadata['cover'] = {
74 'image': img_url,
75 'caption': "",
76 'alt': "",
77 'relative': False
78 }
79 del post.metadata['title_img']
80
81 # Modify title
82 if 'title' in post.metadata:
83 title = post.metadata['title']
84 # Remove content before "-"
85 if '-' in title:
86 new_title = title.split('-', 1)[1].strip()
87 post.metadata['title'] = new_title
88
89 # Save modified file
90 with open(file_path, 'w', encoding='utf-8') as f_out:
91 f_out.write(frontmatter.dumps(post))
92 except Exception as e:
93 print(f"Error processing file {filename}: {str(e)}")
94 # Log error files for later review
95 with open("processing_errors.log", "a", encoding="utf-8") as log:
96 log.write(f"Error in {filename}: {str(e)}\n")
97
98if __na
99``````python
100if __name__ == "__main__":
101 folder_path = "markdown_output" # Replace with your actual path
102 process_md_files(folder_path)
103 print("Frontmatter processing completed for all Markdown files!")- Streamlining Tags and Categories Soomal.com originally had over 20 article categories, some of which were meaningless (e.g., the “All Articles” category). Additionally, there was significant overlap between article categories and tags. To ensure uniqueness between categories and tags, further simplification was implemented. Another goal was to minimize the number of files generated during the final website build.
View the code for streamlining tags and categories
1import os
2import yaml
3import frontmatter
4
5def clean_hugo_tags_categories(folder_path):
6 """
7 Clean up tags and categories in Hugo articles:
8 1. Remove "All Articles" from categories
9 2. Remove tags that duplicate categories
10 """
11 # Valid categories list ("All Articles" removed)
12 valid_categories = [
13 "Digital Devices", "Audio", "Music", "Mobile Digital", "Reviews", "Introductions",
14 "Evaluation Reports", "Galleries", "Smartphones", "Android", "Headphones",
15 "Musicians", "Imaging", "Digital Terminals", "Speakers", "iOS", "Cameras",
16 "Sound Cards", "Album Reviews", "Tablets", "Technology", "Applications",
17 "Portable Players", "Windows", "Digital Accessories", "Essays", "DACs",
18 "Audio Systems", "Lenses", "Musical Instruments", "Audio Codecs"
19 ]
20
21 # Process all Markdown files in the folder
22 for filename in os.listdir(folder_path):
23 if not filename.endswith('.md'):
24 continue
25
26 filepath = os.path.join(folder_path, filename)
27 with open(filepath, 'r', encoding='utf-8') as f:
28 post = frontmatter.load(f)
29
30 # 1. Clean categories (remove invalid entries and deduplicate)
31 if 'categories' in post.metadata:
32 # Convert to set for deduplication + filter invalid categories
33 categories = list(set(post.metadata['categories']))
34 cleaned_categories = [
35 cat for cat in categories
36 if cat in valid_categories
37 ]
38 post.metadata['categories'] = cleaned_categories
39
40 # 2. Clean tags (remove duplicates with categories)
41 if 'tags' in post.metadata:
42 current_cats = post.metadata.get('categories', [])
43 # Convert to set for deduplication + filter category duplicates
44 tags = list(set(post.metadata['tags']))
45 cleaned_tags = [
46 tag for tag in tags
47 if tag not in current_cats
48 ]
49 post.metadata['tags'] = cleaned_tags
50
51 # Save modified file
52 with open(filepath, 'w', encoding='utf-8') as f_out:
53 f_out.write(frontmatter.dumps(post))
54
55if __name__ == "__main__":
56 # Example usage (modify with your actual path)
57 md_folder = "./markdown_output"
58 clean_hugo_tags_categories(md_folder)
59 print(f"Processing completed: {len(os.listdir(md_folder))} files")Step 4: Reducing Image Quantity
During the HTML-to-Markdown conversion, since only article content was extracted, many cropped images from the original site became unnecessary. Therefore, we matched the converted Markdown files against the original site’s images to identify only those needed for the new site.
This step reduced the total number of images from 326,000 to 118,000.
- Extracting Image Links Extract all image links from Markdown files. Since the image links were standardized during conversion, this process was straightforward.
View the extraction code
1import os
2import re
3import argparse
4
5def extract_image_links(directory):
6 """Extract image links from all md files in directory"""
7 image_links = set()
8 pattern = re.compile(r'https://soomal\.cc[^\s\)\]\}]*?\.jpg', re.IGNORECASE)
9
10 for root, _, files in os.walk(directory):
11 for filename in files:
12 if filename.endswith('.md'):
13 filepath = os.path.join(root, filename)
14 try:
15 with open(filepath, 'r', encoding='utf-8') as f:
16 content = f.read()
17 matches = pattern.findall(content)
18 if matches:
19 image_links.update(matches)
20 except Exception as e:
21 print(f"Error processing {filepath}: {str(e)}")
22
23 return sorted(image_links)
24
25def save_links_to_file(links, output_file):
26 """Save links to file"""
27 with open(output_file, 'w', encoding='utf-8') as f:
28 for link in links:
29 f.write(link + '\n')
30
31if __name__ == "__main__":
32 parser = argparse.ArgumentParser(description='Extract image links from Markdown')
33 parser.add_argument('--input', default='markdown_output', help='Path to Markdown directory')
34 parser.add_argument('--output', default='image_links.txt', help='Output file path')
35 args = parser.parse_args()
36
37 print(f"Scanning directory: {args.input}")
38 links = extract_image_links(args.input)
39
40 print(f"Found {len(links)} unique image links")
41 save_links_to_file(links, args.output)
42 print(f"Links saved to: {args.output}")- Copying Corresponding Images Use the extracted image links to locate and copy corresponding files from the original site directory, ensuring directory accuracy.
A. View Windows Copy Code
1import os
2import shutil
3import time
4import sys
5
6def main():
7 # Configuration
8 source_drive = "F:\\"
9 target_drive = "D:\\"
10 image_list_file = r"D:\trans-soomal\image_links.txt"
11 log_file = r"D:\trans-soomal\image_copy_log.txt"
12 error_log_file = r"D:\trans-soomal\image_copy_errors.txt"
13
14 print("Image copy script starting...")
15
16 # Record start time
17 start_time = time.time()
18
19 # Create log files
20 with open(log_file, "w", encoding="utf-8") as log, open(error_log_file, "w", encoding="utf-8") as err_log:
21 log.write(f"Image Copy Log - Start Time: {time.ctime(start_time)}\n")
22 err_log.write("Failed copies:\n")
23
24 try:
25 # Read image list
26 with open(image_list_file, "r", encoding="utf-8") as f:
27 image_paths = [line.strip() for line in f if line.strip()]
28
29 total_files = len(image_paths)
30 success_count = 0
31 fail_count = 0
32 skipped_count = 0
33
34 print(f"Found {total_files} images to copy")
35
36 # Process each file
37 for i, relative_path in enumerate(image_paths):
38 # Display progress
39 progress = (i + 1) / total_files * 100
40 sys.stdout.write(f"\rProgress: {progress:.2f}% ({i+1}/{total_files})")
41 sys.stdout.flush()
42
43 # Build full paths
44 source_path = os.path.join(source_drive, relative_path)
45 target_path = os.path.join(target_drive, relative_path)
46
47 try:
48 # Check if source exists
49 if not os.path.exists(source_path):
50 err_log.write(f"Source missing: {source_path}\n")
51 fail_count += 1
52 continue
53
54 # Check if target already exists
55```if os.path.exists(target_path):
56 log.write(f"File already exists, skipping: {target_path}\n")
57 skipped_count += 1
58 continue
59
60 # Create target directory
61 target_dir = os.path.dirname(target_path)
62 os.makedirs(target_dir, exist_ok=True)
63
64 # Copy file
65 shutil.copy2(source_path, target_path)
66
67 # Log success
68 log.write(f"[SUCCESS] Copied {source_path} to {target_path}\n")
69 success_count += 1
70
71 except Exception as e:
72 # Log failure
73 err_log.write(f"[FAILED] {source_path} -> {target_path} : {str(e)}\n")
74 fail_count += 1
75
76 # Calculate elapsed time
77 end_time = time.time()
78 elapsed_time = end_time - start_time
79 minutes, seconds = divmod(elapsed_time, 60)
80 hours, minutes = divmod(minutes, 60)
81
82 # Write summary
83 summary = f"""
84================================
85Copy operation completed
86Start time: {time.ctime(start_time)}
87End time: {time.ctime(end_time)}
88Total duration: {int(hours)}h {int(minutes)}m {seconds:.2f}s
89
90Total files: {total_files}
91Successfully copied: {success_count}
92Skipped (existing): {skipped_count}
93Failed: {fail_count}
94================================
95"""
96 log.write(summary)
97 print(summary)
98
99 except Exception as e:
100 print(f"\nError occurred: {str(e)}")
101 err_log.write(f"Script error: {str(e)}\n")
102
103if __name__ == "__main__":
104 main()B. View Linux Copy Code
1#!/bin/bash
2
3# Configuration parameters
4LINK_FILE="/user/image_links.txt" # Replace with actual link file path
5SOURCE_BASE="/user/soomal.cc/index"
6DEST_BASE="/user/images.soomal.cc/index"
7LOG_FILE="/var/log/image_copy_$(date +%Y%m%d_%H%M%S).log"
8THREADS=3 # Automatically get CPU cores as thread count
9
10# Start logging
11{
12echo "===== Copy Task Started: $(date) ====="
13echo "Source base directory: $SOURCE_BASE"
14echo "Destination base directory: $DEST_BASE"
15echo "Link file: $LINK_FILE"
16echo "Thread count: $THREADS"
17
18# Path validation example
19echo -e "\n=== Path Validation ==="
20sample_url="https://soomal.cc/images/doc/20090406/00000007.jpg"
21expected_src="${SOURCE_BASE}/images/doc/20090406/00000007.jpg"
22expected_dest="${DEST_BASE}/images/doc/20090406/00000007.jpg"
23
24echo "Example URL: $sample_url"
25echo "Expected source path: $expected_src"
26echo "Expected destination path: $expected_dest"
27
28if [[ -f "$expected_src" ]]; then
29 echo "Validation successful: Example source file exists"
30else
31 echo "Validation failed: Example source file missing! Please check paths"
32 exit 1
33fi
34
35# Create destination base directory
36mkdir -p "${DEST_BASE}/images"
37
38# Prepare parallel processing
39echo -e "\n=== Processing Started ==="
40total=$(wc -l < "$LINK_FILE")
41echo "Total links: $total"
42counter=0
43
44# Processing function
45process_link() {
46 local url="$1"
47 local rel_path="${url#https://soomal.cc}"
48
49 # Build full paths
50 local src_path="${SOURCE_BASE}${rel_path}"
51 local dest_path="${DEST_BASE}${rel_path}"
52
53 # Create destination directory
54 mkdir -p "$(dirname "$dest_path")"
55
56 # Copy file
57 if [[ -f "$src_path" ]]; then
58 if cp -f "$src_path" "$dest_path"; then
59 echo "SUCCESS: $rel_path"
60 return 0
61 else
62 echo "COPY FAILED: $rel_path"
63 return 2
64 fi
65 else
66 echo "MISSING: $rel_path"
67 return 1
68 fi
69}
70
71# Export function for parallel use
72export -f process_link
73export SOURCE_BASE DEST_BASE
74
75# Use parallel for concurrent processing
76echo "Starting parallel copying..."
77parallel --bar --jobs $THREADS --progress \
78 --halt soon,fail=1 \
79 --joblog "${LOG_FILE}.jobs" \
80 --tagstring "{}" \
81 "process_link {}" < "$LINK_FILE" | tee -a "$LOG_FILE"
82
83# Collect results
84success=$(grep -c 'SUCCESS:' "$LOG_FILE")
85missing=$(grep -c 'MISSING:' "$LOG_FILE")
86failed=$(grep -c 'COPY FAILED:' "$LOG_FILE")
87
88# Final statistics
89echo -e "\n===== Copy Task Completed: $(date) ====="
90echo "Total links: $total"
91echo "Successfully copied: $success"
92echo "Missing files: $missing"
93echo "Copy failures: $failed"
94echo "Success rate: $((success * 100 / total))%"
95
96} | tee "$LOG_FILE"
97
98# Save missing files list
99grep '^MISSING:' "$LOG_FILE" | cut -d' ' -f2- > "${LOG_FILE%.log}_missing.txt"
100echo "Missing files list: ${LOG_FILE%.log}_missing.txt"Step 5: Compress Image Sizes
I had previously compressed the website’s source images once, but it wasn’t enough. My goal is to reduce the image size to under 10 GB to meet potential future requirements for migrating to CloudFlare R2.
- Convert JPG to Webp After compressing the images with Webp before, I kept them in JPG format to avoid access issues due to the numerous HTML files. Since this migration is to Hugo, there’s no need to retain JPG format anymore, so I’ll directly convert them to Webp. Additionally, since my webpage is set to a 960px width and I’m not using fancy lightbox plugins, resizing the images to 960px can further reduce the size.
Actual tests showed that after this compression, the image size dropped to 7.7GB. However, I noticed a minor issue with the image processing logic. Soomal has many vertical images as well as horizontal ones, and 960px width appears somewhat small on 4K displays. I ultimately converted the images with the short edge set to a maximum of 1280px at 85% quality, resulting in a size of about 14GB, which fits within my VPS’s 20GB storage. I also tested with a short edge of 1150px at 80% quality, which met the 10GB requirement.
View Image Conversion Code
1import os
2import subprocess
3import time
4import sys
5import shutil
6from pathlib import Path
7
8def main():
9 # Configure paths
10 source_dir = Path("D:\\images") # Original image directory
11 output_dir = Path("D:\\images_webp") # WebP output directory
12 temp_dir = Path("D:\\temp_webp") # Temporary processing directory
13 magick_path = "C:\\webp\\magick.exe" # ImageMagick path
14
15 # Create necessary directories
16 output_dir.mkdir(parents=True, exist_ok=True)
17 temp_dir.mkdir(parents=True, exist_ok=True)
18
19 # Log files
20 log_file = output_dir / "conversion_log.txt"
21 stats_file = output_dir / "conversion_stats.csv"
22
23 print("Image conversion script starting...")
24 print(f"Source directory: {source_dir}")
25 print(f"Output directory: {output_dir}")
26 print(f"Temporary directory: {temp_dir}")
27
28 # Initialize log
29 with open(log_file, "w", encoding="utf-8") as log:
30 log.write(f"Image conversion log - Start time: {time.ctime()}\n")
31
32 # Initialize stats file
33 with open(stats_file, "w", encoding="utf-8") as stats:
34 stats.write("Original File,Converted File,Original Size (KB),Converted Size (KB),Space Saved (KB),Savings Percentage\n")
35
36 # Collect all image files
37 image_exts = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.gif')
38 all_images = []
39 for root, _, files in os.walk(source_dir):
40 for file in files:
41 if file.lower().endswith(image_exts):
42 all_images.append(Path(root) / file)
43
44 total_files = len(all_images)
45 converted_files = 0
46 skipped_files = 0
47 error_files = 0
48
49 print(f"Found {total_files} image files to process")
50
51 # Process each image
52 for idx, img_path in enumerate(all_images):
53 try:
54 # Progress display
55```Display progress
56 progress = (idx + 1) / total_files * 100
57 sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")
58 sys.stdout.flush()
59
60 # Create relative path structure
61 rel_path = img_path.relative_to(source_dir)
62 webp_path = output_dir / rel_path.with_suffix('.webp')
63 webp_path.parent.mkdir(parents=True, exist_ok=True)
64
65 # Check if file already exists
66 if webp_path.exists():
67 skipped_files += 1
68 continue
69
70 # Create temporary file path
71 temp_path = temp_dir / f"{img_path.stem}_temp.webp"
72
73 # Get original file size
74 orig_size = img_path.stat().st_size / 1024 # KB
75
76 # Convert and resize using ImageMagick
77 cmd = [
78 magick_path,
79 str(img_path),
80 "-resize", "960>", # Resize only if width exceeds 960px
81 "-quality", "85", # Initial quality 85
82 "-define", "webp:lossless=false",
83 str(temp_path)
84 ]
85
86 # Execute command
87 result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
88
89 if result.returncode != 0:
90 # Log conversion failure
91 with open(log_file, "a", encoding="utf-8") as log:
92 log.write(f"[ERROR] Failed to convert {img_path}: {result.stderr}\n")
93 error_files += 1
94 continue
95
96 # Move temporary file to target location
97 shutil.move(str(temp_path), str(webp_path))
98
99 # Get converted file size
100 new_size = webp_path.stat().st_size / 1024 # KB
101
102 # Calculate space savings
103 saved = orig_size - new_size
104 saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0
105
106 # Record statistics
107 with open(stats_file, "a", encoding="utf-8") as stats:
108 stats.write(f"{img_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")
109
110 converted_files += 1
111
112 except Exception as e:
113 with open(log_file, "a", encoding="utf-8") as log:
114 log.write(f"[EXCEPTION] Error processing {img_path}: {str(e)}\n")
115 error_files += 1
116
117 # Completion report
118 total_size = sum(f.stat().st_size for f in output_dir.glob('**/*') if f.is_file())
119 total_size_gb = total_size / (1024 ** 3) # Convert to GB
120
121 end_time = time.time()
122 elapsed = end_time - time.time()
123 mins, secs = divmod(elapsed, 60)
124 hours, mins = divmod(mins, 60)
125
126 with open(log_file, "a", encoding="utf-8") as log:
127 log.write("\nConversion Report:\n")
128 log.write(f"Total files: {total_files}\n")
129 log.write(f"Successfully converted: {converted_files}\n")
130 log.write(f"Skipped files: {skipped_files}\n")
131 log.write(f"Error files: {error_files}\n")
132 log.write(f"Output directory size: {total_size_gb:.2f} GB\n")
133
134 print("\n\nConversion completed!")
135 print(f"Total files: {total_files}")
136 print(f"Successfully converted: {converted_files}")
137 print(f"Skipped files: {skipped_files}")
138 print(f"Error files: {error_files}")
139 print(f"Output directory size: {total_size_gb:.2f} GB")
140
141 # Clean up temporary directory
142 try:
143 shutil.rmtree(temp_dir)
144 print(f"Cleaned temporary directory: {temp_dir}")
145 except Exception as e:
146 print(f"Error cleaning temporary directory: {str(e)}")
147
148 print(f"Log file: {log_file}")
149 print(f"Statistics file: {stats_file}")
150 print(f"Total time elapsed: {int(hours)} hours {int(mins)} minutes {secs:.2f} seconds")
151
152if __name__ == "__main__":
153 main() - Further Image Compression
I originally designed this step to further compress images if the initial conversion didn’t reduce the total size below 10GB. However, the first step successfully resolved the issue, making additional compression unnecessary. Nevertheless, I tested further compression by converting images to WebP with a maximum short edge of 1280px and 60% quality, which resulted in a total size of only 9GB.
View Secondary Compression Code
1import os
2import subprocess
3import time
4import sys
5import shutil
6from pathlib import Path
7
8def main():
9 # Configure paths
10 webp_dir = Path("D:\\images_webp") # WebP directory
11 temp_dir = Path("D:\\temp_compress") # Temporary directory
12 cwebp_path = "C:\\Windows\\System32\\cwebp.exe" # cwebp path
13
14 # Create temporary directory
15 temp_dir.mkdir(parents=True, exist_ok=True)
16
17 # Log files
18 log_file = webp_dir / "compression_log.txt"
19 stats_file = webp_dir / "compression_stats.csv"
20
21 print("WebP compression script starting...")
22 print(f"Processing directory: {webp_dir}")
23 print(f"Temporary directory: {temp_dir}")
24
25 # Initialize log
26 with open(log_file, "w", encoding="utf-8") as log:
27 log.write(f"WebP Compression Log - Start time: {time.ctime()}\n")
28
29 # Initialize statistics file
30 with open(stats_file, "w", encoding="utf-8") as stats:
31 stats.write("Original File,Compressed File,Original Size (KB),New Size (KB),Space Saved (KB),Savings Percentage\n")
32
33 # Collect all WebP files
34 all_webp = list(webp_dir.glob('**/*.webp'))
35 total_files = len(all_webp)
36
37 if total_files == 0:
38 print("No WebP files found. Please run the conversion script first.")
39 return
40
41 print(f"Found {total_files} WebP files to compress")
42
43 compressed_count = 0
44 skipped_count = 0
45 error_count = 0
46
47 # Process each WebP file
48 for idx, webp_path in enumerate(all_webp):
49 try:
50 # Display progress
51 progress = (idx + 1) / total_files * 100
52 sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")
53 sys.stdout.flush()
54
55 # Original size
56 orig_size = webp_path.stat().st_size / 1024 # KB
57
58 # Create temporary file path
59 temp_path = temp_dir / f"{webp_path.stem}_compressed.webp"
60
61 # Perform secondary compression using cwebp
62 cmd = [
63 cwebp_path,
64 "-q", "75", # Quality parameter
65 "-m", "6", # Maximum compression mode
66 str(webp_path),
67 "-o", str(temp_path)
68 ]
69
70 # Execute command
71 result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
72
73 if result.returncode != 0:
74 # Log compression failure
75 with open(log_file, "a", encoding="utf-8") as log:
76 log.write(f"[ERROR] Failed to compress {webp_path}: {result.stderr}\n")
77 error_count += 1
78 continue
79
80 # Get new file size
81 new_size = temp_path.stat().st_size / 1024 # KB```markdown
82# Skip if the new file is larger than the original
83if new_size >= orig_size:
84 skipped_count += 1
85 temp_path.unlink() # Delete temporary file
86 continue
87
88# Calculate space savings
89saved = orig_size - new_size
90saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0
91
92# Record statistics
93with open(stats_file, "a", encoding="utf-8") as stats:
94 stats.write(f"{webp_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")
95
96# Replace original file
97webp_path.unlink() # Delete original file
98shutil.move(str(temp_path), str(webp_path))
99compressed_count += 1
100
101except Exception as e:
102 with open(log_file, "a", encoding="utf-8") as log:
103 log.write(f"[Error] Processing {webp_path} failed: {str(e)}\n")
104 error_count += 1
105
106# Completion report
107total_size = sum(f.stat().st_size for f in webp_dir.glob('**/*') if f.is_file())
108total_size_gb = total_size / (1024 ** 3) # Convert to GB
109
110end_time = time.time()
111elapsed = end_time - time.time()
112mins, secs = divmod(elapsed, 60)
113hours, mins = divmod(mins, 60)
114
115with open(log_file, "a", encoding="utf-8") as log:
116 log.write("\nCompression Report:\n")
117 log.write(f"Files processed: {total_files}\n")
118 log.write(f"Successfully compressed: {compressed_count}\n")
119 log.write(f"Skipped files: {skipped_count}\n")
120 log.write(f"Error files: {error_count}\n")
121 log.write(f"Total output directory size: {total_size_gb:.2f} GB\n")
122
123print("\n\nCompression completed!")
124print(f"Files processed: {total_files}")
125print(f"Successfully compressed: {compressed_count}")
126print(f"Skipped files: {skipped_count}")
127print(f"Error files: {error_count}")
128print(f"Total output directory size: {total_size_gb:.2f} GB")
129
130# Clean temporary directory
131try:
132 shutil.rmtree(temp_dir)
133 print(f"Cleaned temporary directory: {temp_dir}")
134except Exception as e:
135 print(f"Error cleaning temporary directory: {str(e)}")
136
137print(f"Log file: {log_file}")
138print(f"Stats file: {stats_file}")
139print(f"Total duration: {int(hours)}h {int(mins)}m {secs:.2f}s")
140
141if __name__ == "__main__":
142 main()Implementation Plan
Selecting the Right Hugo Theme
For a Hugo project with tens of thousands of markdown files, choosing a theme can be quite challenging.
I tested a visually appealing theme that took over three hours to complete generation without finishing. Some themes threw constant errors during generation, while others produced over 200,000 files.
Ultimately, I settled on the most stable option - the PaperMod theme. By default, this theme generates only about 100 files, and the final website contains fewer than 50,000 files, which is relatively efficient.
Although it doesn’t meet Cloudflare Pages’ 20,000-file limit, it’s sufficiently lean. The build took 6.5 minutes on GitHub Pages and 8 minutes on Vercel.
However, some issues emerged during the build:
- Search functionality: Due to the massive article volume, the default index file reached 80MB, rendering it practically unusable. I had to limit indexing to only article titles and summaries.
- Sitemap generation: The default 4MB sitemap consistently failed to load in Google Search Console, though Bing Webmaster Tools handled it without issues.
- Pagination: With 12,000 tags and 20 articles per page, this would generate 60,000 files. Even after increasing to 200 articles per page, there were still 37,000 files (while other files totaled only 12,000).
The tag issue presents an optimization opportunity: only displaying the top 1,000 most-used tags while incorporating others into article titles. This could potentially reduce the file count below 20,000, meeting Cloudflare Pages’ requirements.
Choosing Static Site Hosting
The Hugo project itself is under 100MB (with 80MB being markdown files), making GitHub hosting feasible. Given GitHub Pages’ slower speeds, I opted for Vercel deployment. While Vercel’s 100GB bandwidth limit might seem restrictive, it should suffice for static content.
Selecting Image Hosting
Still evaluating options. Initially considered Cloudflare R2 but hesitated due to concerns about exceeding free tier limits. Currently using a budget $7/year “fake Alibaba Cloud” VPS as a temporary solution.