Merge branch 'PR/293'

Closes #292
Resolve unicode escape sequences not being processed correctly
2026-05-07 00:55:45 +02:00 · 2022-10-23 14:47:04 +02:00 · 2022-10-23 11:51:33 +11:00 · 2022-09-21 18:31:16 +02:00 · 2022-09-21 01:37:39 +02:00 · 2022-02-10 20:05:07 +01:00
3 changed files with 29 additions and 20 deletions
--- a/README-SUBMODULES.md
+++ b/README-SUBMODULES.md
@@ -27,10 +27,10 @@ command line option.

 ## Example

-Example mercurial repo folder structure (~/mercurial):
+Example mercurial repo folder structure (~/mercurial) containing two subrepos:
    src/...
-    subrepo/subrepo1
-    subrepo/subrepo2
+    subrepos/subrepo1
+    subrepos/subrepo2

 ### Setup
 Create an empty new folder where all the converted git modules will be imported:
@@ -41,18 +41,18 @@ Create an empty new folder where all the converted git modules will be imported:
    mkdir submodule1
    cd submodule1
    git init
-    hg-fast-export.sh -r ~/mercurial/subrepo1
+    hg-fast-export.sh -r ~/mercurial/subrepos/subrepo1
    cd ..
    mkdir submodule2
    cd submodule2
    git init
-    hg-fast-export.sh -r ~/mercurial/subrepo2
+    hg-fast-export.sh -r ~/mercurial/subrepos/subrepo2

 ### Create mapping file
    cd ~/imported-gits
    cat > submodule-mappings << EOF
-    "subrepo/subrepo1"="../submodule1"
-    "subrepo/subrepo2"="../submodule2"
+    "subrepos/subrepo1"="../submodule1"
+    "subrepos/subrepo2"="../submodule2"
    EOF

 ### Convert main repository
@@ -60,16 +60,16 @@ Create an empty new folder where all the converted git modules will be imported:
    mkdir git-main-repo
    cd git-main-repo
    git init
-    hg-fast-export.sh -r ~/mercurial --subrepo-map=../submodule-mappings
+    hg-fast-export.sh -r ~/mercurial --subrepo-map=~/imported-gits/submodule-mappings

 ### Result
-The resulting repository will now contain the subrepo/subrepo1 and
-subrepo/subrepo1 submodules. The created .gitmodules file will look
-like:
+The resulting repository will now contain the submodules at the paths
+`subrepos/subrepo1` and `subrepos/subrepo2`. The created .gitmodules
+file will look like:

-    [submodule "subrepo/subrepo1"]
-          path = subrepo/subrepo1
+    [submodule "subrepos/subrepo1"]
+          path = subrepos/subrepo1
          url = ../submodule1
-    [submodule "subrepo/subrepo2"]
-          path = subrepo/subrepo2
+    [submodule "subrepos/subrepo2"]
+          path = subrepos/subrepo2
          url = ../submodule2
--- a/README.md
+++ b/README.md
@@ -133,7 +133,10 @@ is to convert line endings in text files from CRLF to git's preferred LF:
 # $2 = Mercurial's hash of the file
 # $3 = "1" if Mercurial reports the file as binary, otherwise "0"

-if [ "$3" == "1" ]; then cat; else dos2unix; fi
+if [ "$3" == "1" ]; then cat; else dos2unix -q; fi
+# -q option in call to dos2unix allows to avoid returning an
+# error code when handling non-ascii based text files (like UTF-16
+# encoded text files)
 -- End of crlf-filter.sh --
 ```

--- a/hg-fast-export.py
+++ b/hg-fast-export.py
@@ -266,7 +266,7 @@ def sanitize_name(name,what="branch", mapping={}):
  if not auto_sanitize:
    return mapping.get(name,name)
  n=mapping.get(name,name)
-  p=re.compile(b'([[ ~^:?\\\\*]|\.\.)')
+  p=re.compile(b'([\\[ ~^:?\\\\*]|\.\.)')
  n=p.sub(b'_', n)
  if n[-1:] in (b'/', b'.'): n=n[:-1]+b'_'
  n=b'/'.join([dot(s) for s in n.split(b'/')])
@@ -434,9 +434,15 @@ def load_mapping(name, filename, mapping_is_raw):
  def process_unicode_escape_sequences(s):
    # Replace unicode escape sequences in the otherwise UTF8-encoded bytestring s with
    # the UTF8-encoded characters they represent. We need to do an additional
-    # .decode('utf8').encode('unicode-escape') to convert any non-ascii characters into
-    # their escape sequences so that the subsequent .decode('unicode-escape') succeeds:
-    return s.decode('utf8').encode('unicode-escape').decode('unicode-escape').encode('utf8')
+    # .decode('utf8').encode('ascii', 'backslashreplace') to convert any non-ascii
+    # characters into their escape sequences so that the subsequent
+    # .decode('unicode-escape') succeeds:
+    return (
+      s.decode('utf8')
+      .encode('ascii', 'backslashreplace')
+      .decode('unicode-escape')
+      .encode('utf8')
+    )

  def parse_quoted_line(line):
    m=quoted_regexp.match(line)
Author	SHA1	Message	Date
Frej Drejhammar	6700b164d0	Merge branch 'PR/293' Closes #292	2022-10-23 14:47:04 +02:00
chrisjbillington	13c273f10c	Resolve unicode escape sequences not being processed correctly In `process_unicode_escape_sequences()`, any backslash escape sequences in the original string are escaped upon the first `.encode('unicode-escape')` and therefore round-trip the sequence of `.encode('unicode-escape').decode('unicode-escape')`. That is not what we want - we want these sequences to be passed-through the `.encode` unchanged, so that they will be converted to the character they represent upon `.decode()`. This patch changes the `.encode()` step to pass through any ascii characters unchanged, only escaping non-ascii characters. This ensures any existing backslash escape sequences will be interpreted as the character they represent upon `.decode()`.	2022-10-23 11:51:33 +11:00
Frej Drejhammar	667404e836	Merge branch 'PR291'	2022-09-21 18:31:16 +02:00
Nicolas Vanhoren	38e236962d	Update README.md to change recommandation for crlf filtering	2022-09-21 01:37:39 +02:00
Frej Drejhammar	dbb8158527	Merge branch 'frej/submodule-doc-improvement'	2022-02-10 20:05:07 +01:00
Frej Drejhammar	bb0bcda7ba	Merge branch 'frej/fix-re-future-warning'	2022-02-10 20:04:14 +01:00
Frej Drejhammar	838b654614	Remove inconsistencies from submodule documentation The submodule documentation is not consistent with regards to the example directory structure. Update the example to be consistent. Closes #277.	2022-02-09 15:58:48 +01:00
Frej Drejhammar	f179afce65	Fix FutureWarning about nested sets in re Since Python 3.7 the re module warns for syntax which could, in the future, be misparsed as a nested set. Avoid this by escaping the literal `[` we search for in the regexp. Reported by Monte Davidoff @mndavidoff Closes #269.	2022-02-09 15:37:29 +01:00